home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

9 rows where issue = 707427200 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • simonw 9

issue 1

  • Improve performance of extract operations · 9 ✖

author_association 1

  • OWNER 9
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
698178101 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-698178101 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5ODE3ODEwMQ== simonw 9599 2020-09-24T07:48:57Z 2020-09-24T07:49:20Z OWNER

I wonder if I could make this faster by separating it out into a few steps:

* Create the new lookup table with all of the distinct rows

* Add the blank foreign key column

* run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`

* Drop the value columns

My prototype of this knocked the time down from 10 minutes to 4 seconds, so I think the change is worth it! ``` % date sqlite-utils extract salaries.db salaries \ 'Department Code' 'Department' \ --table 'departments' \ --fk-column 'department_id' \ --rename 'Department Code' code \ --rename 'Department' name date sqlite-utils extract salaries.db salaries \ 'Union Code' 'Union' \ --table 'unions' \ --fk-column 'union_id' \ --rename 'Union Code' code \ --rename 'Union' name date sqlite-utils extract salaries.db salaries \ 'Job Family Code' 'Job Family' \ --table 'job_families' \ --fk-column 'job_family_id' \ --rename 'Job Family Code' code \ --rename 'Job Family' name date sqlite-utils extract salaries.db salaries \ 'Job Code' 'Job' \ --table 'jobs' \ --fk-column 'job_id' \ --rename 'Job Code' code \ --rename 'Job' name date Thu Sep 24 00:48:16 PDT 2020

Thu Sep 24 00:48:20 PDT 2020

Thu Sep 24 00:48:24 PDT 2020

Thu Sep 24 00:48:28 PDT 2020

Thu Sep 24 00:48:32 PDT 2020 ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697869886 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697869886 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5Nzg2OTg4Ng== simonw 9599 2020-09-23T18:45:30Z 2020-09-23T18:45:30Z OWNER

There's something to be said for making this operation pausable and resumable, especially if I'm going to make it available in a Datasette plugin at some point.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697866885 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697866885 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5Nzg2Njg4NQ== simonw 9599 2020-09-23T18:43:37Z 2020-09-23T18:43:37Z OWNER

Also what would happen if the table had new rows added to it while that command was running?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697863116 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697863116 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5Nzg2MzExNg== simonw 9599 2020-09-23T18:41:06Z 2020-09-23T18:41:06Z OWNER

Problem with this approach is it's not compatible with progress bars - but if it's a multiple of times faster it's worth it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697859772 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697859772 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5Nzg1OTc3Mg== simonw 9599 2020-09-23T18:38:43Z 2020-09-23T18:38:52Z OWNER

I wonder if I could make this faster by separating it out into a few steps: - Create the new lookup table with all of the distinct rows - Add the blank foreign key column - run a UPDATE table SET blah_id = (select id from lookup where thang = table.thang) - Drop the value columns

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697835956 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697835956 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5NzgzNTk1Ng== simonw 9599 2020-09-23T18:22:49Z 2020-09-23T18:22:49Z OWNER

I ran sudo py-spy top -p 123 against the process while it was running and the most time is definitely spent in .update(): ``` Total Samples 1000 GIL: 0.00%, Active: 90.00%, Threads: 1

%Own %Total OwnTime TotalTime Function (filename:line)
38.00% 38.00% 3.85s 3.85s update (sqlite_utils/db.py:1283) 27.00% 27.00% 2.12s 2.12s execute (sqlite_utils/db.py:161) 10.00% 10.00% 0.890s 0.890s execute (sqlite_utils/db.py:163) 10.00% 17.00% 0.870s 1.54s columns (sqlite_utils/db.py:553) 0.00% 0.00% 0.110s 0.210s <listcomp> (sqlite_utils/db.py:554) 0.00% 3.00% 0.100s 0.320s table_names (sqlite_utils/db.py:191) 0.00% 0.00% 0.100s 0.100s new (<string>:1) ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697473247 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697473247 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5NzQ3MzI0Nw== simonw 9599 2020-09-23T14:45:13Z 2020-09-23T14:45:13Z OWNER

lookup_table.lookup(lookups) is doing a SQL lookup. This could be cached in-memory, maybe with a LRU cache, to avoid looking up the primary key for records that we have recently used.

The .update() method it is calling first does a get() and then does a SQL UPDATE ... WHERE:

https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L1244-L1264

Batching those updates may have an effect. Or finding a way to skip the .get() since we already know we have a valid record.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697467833 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697467833 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5NzQ2NzgzMw== simonw 9599 2020-09-23T14:42:03Z 2020-09-23T14:42:03Z OWNER

Here's the loop that's taking the time: https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L892-L897

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  
697466497 https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697466497 https://api.github.com/repos/simonw/sqlite-utils/issues/172 MDEyOklzc3VlQ29tbWVudDY5NzQ2NjQ5Nw== simonw 9599 2020-09-23T14:41:17Z 2020-09-23T14:41:17Z OWNER

Steps to produce that database: curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD' sqlite-utils insert salaries.db salaries salaries.csv --csv

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improve performance of extract operations 707427200  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 23.269ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows