home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

3 rows where "created_at" is on date 2020-09-07 and issue = 688668680 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • simonwiles 2
  • simonw 1

author_association 2

  • CONTRIBUTOR 2
  • OWNER 1

issue 1

  • Handle case where subsequent records (after first batch) include extra columns · 3 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
688508510 https://github.com/simonw/sqlite-utils/pull/146#issuecomment-688508510 https://api.github.com/repos/simonw/sqlite-utils/issues/146 MDEyOklzc3VlQ29tbWVudDY4ODUwODUxMA== simonw 9599 2020-09-07T20:56:03Z 2020-09-07T20:56:24Z OWNER

The problem with this approach is that it requires us to consume the entire iterator before we can start inserting rows into the table - here on line 1052:

https://github.com/simonw/sqlite-utils/blob/bb131793feac16bc7181ab997568f941b0220ef2/sqlite_utils/db.py#L1047-L1054

I designed the .insert_all() to avoid doing this, because I want to be able to pass it an iterator (or more likely a generator) that could produce potentially millions of records. Doing things one batch of 100 records at a time means that the Python process doesn't need to pull millions of records into memory at once.

db-to-sqlite is one example of a tool that uses that characteristic, in https://github.com/simonw/db-to-sqlite/blob/63e4ee972f292de13bb11767c0fb64b35339d954/db_to_sqlite/cli.py#L94-L106

So we need to solve this issue without consuming the entire iterator with a records = list(records) call.

I think one way to do this is to execute each chunk one at a time and watch out for an exception that indicates that we sent too many parameters - then adjust the chunk size down and try again.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Handle case where subsequent records (after first batch) include extra columns 688668680  
688481317 https://github.com/simonw/sqlite-utils/pull/146#issuecomment-688481317 https://api.github.com/repos/simonw/sqlite-utils/issues/146 MDEyOklzc3VlQ29tbWVudDY4ODQ4MTMxNw== simonwiles 96218 2020-09-07T19:18:55Z 2020-09-07T19:18:55Z CONTRIBUTOR

Just force-pushed to update d042f9c with more formatting changes to satisfy black==20.8b1 and pass the GitHub Actions "Test" workflow.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Handle case where subsequent records (after first batch) include extra columns 688668680  
688479163 https://github.com/simonw/sqlite-utils/pull/146#issuecomment-688479163 https://api.github.com/repos/simonw/sqlite-utils/issues/146 MDEyOklzc3VlQ29tbWVudDY4ODQ3OTE2Mw== simonwiles 96218 2020-09-07T19:10:33Z 2020-09-07T19:11:57Z CONTRIBUTOR

@simonw -- I've gone ahead updated the documentation to reflect the changes introduced in this PR. IMO it's ready to merge now.

In writing the documentation changes, I begin to wonder about the value and role of batch_size at all, tbh. May I assume it was originally intended to prevent using the entire row set to determine columns and column types, and that this was a performance consideration? If so, this PR entirely undermines its purpose. I've been passing in excess of 500,000 rows at a time to insert_all() with these changes and although I'm sure the performance difference is measurable it's not really noticeable; given #145, I don't know that any performance advantages outweigh the problems doing it this way removes. What do you think about just dropping the argument and defaulting to the maximum batch_size permissible given SQLITE_MAX_VARS? Are there other reasons one might want to restrict batch_size that I've overlooked? I could open a new issue to discuss/implement this.

Of course the documentation will need to change again too if/when something is done about #147.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Handle case where subsequent records (after first batch) include extra columns 688668680  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 24.621ms · About: github-to-sqlite