{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-698178101", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 698178101, "node_id": "MDEyOklzc3VlQ29tbWVudDY5ODE3ODEwMQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-24T07:48:57Z", "updated_at": "2020-09-24T07:49:20Z", "author_association": "OWNER", "body": "> I wonder if I could make this faster by separating it out into a few steps:\r\n> \r\n>     * Create the new lookup table with all of the distinct rows\r\n> \r\n>     * Add the blank foreign key column\r\n> \r\n>     * run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`\r\n> \r\n>     * Drop the value columns\r\nMy prototype of this knocked the time down from 10 minutes to 4 seconds, so I think the change is worth it!\r\n```\r\n% date\r\nsqlite-utils extract salaries.db salaries \\\r\n   'Department Code' 'Department' \\\r\n  --table 'departments' \\\r\n  --fk-column 'department_id' \\\r\n  --rename 'Department Code' code \\\r\n  --rename 'Department' name\r\ndate\r\nsqlite-utils extract salaries.db salaries \\\r\n   'Union Code' 'Union' \\\r\n  --table 'unions' \\\r\n  --fk-column 'union_id' \\\r\n  --rename 'Union Code' code \\\r\n  --rename 'Union' name\r\ndate\r\nsqlite-utils extract salaries.db salaries \\\r\n   'Job Family Code' 'Job Family' \\\r\n  --table 'job_families' \\\r\n  --fk-column 'job_family_id' \\\r\n  --rename 'Job Family Code' code \\\r\n  --rename 'Job Family' name\r\ndate\r\nsqlite-utils extract salaries.db salaries \\\r\n   'Job Code' 'Job' \\\r\n  --table 'jobs' \\\r\n  --fk-column 'job_id' \\\r\n  --rename 'Job Code' code \\\r\n  --rename 'Job' name\r\ndate\r\nThu Sep 24 00:48:16 PDT 2020\r\n\r\nThu Sep 24 00:48:20 PDT 2020\r\n\r\nThu Sep 24 00:48:24 PDT 2020\r\n\r\nThu Sep 24 00:48:28 PDT 2020\r\n\r\nThu Sep 24 00:48:32 PDT 2020\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697869886", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697869886, "node_id": "MDEyOklzc3VlQ29tbWVudDY5Nzg2OTg4Ng==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T18:45:30Z", "updated_at": "2020-09-23T18:45:30Z", "author_association": "OWNER", "body": "There's something to be said for making this operation pausable and resumable, especially if I'm going to make it available in a Datasette plugin at some point.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697866885", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697866885, "node_id": "MDEyOklzc3VlQ29tbWVudDY5Nzg2Njg4NQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T18:43:37Z", "updated_at": "2020-09-23T18:43:37Z", "author_association": "OWNER", "body": "Also what would happen if the table had new rows added to it while that command was running?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697863116", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697863116, "node_id": "MDEyOklzc3VlQ29tbWVudDY5Nzg2MzExNg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T18:41:06Z", "updated_at": "2020-09-23T18:41:06Z", "author_association": "OWNER", "body": "Problem with this approach is it's not compatible with progress bars - but if it's a multiple of times faster it's worth it.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697859772", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697859772, "node_id": "MDEyOklzc3VlQ29tbWVudDY5Nzg1OTc3Mg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T18:38:43Z", "updated_at": "2020-09-23T18:38:52Z", "author_association": "OWNER", "body": "I wonder if I could make this faster by separating it out into a few steps:\r\n- Create the new lookup table with all of the distinct rows\r\n- Add the blank foreign key column\r\n- run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`\r\n- Drop the value columns", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697835956", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697835956, "node_id": "MDEyOklzc3VlQ29tbWVudDY5NzgzNTk1Ng==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T18:22:49Z", "updated_at": "2020-09-23T18:22:49Z", "author_association": "OWNER", "body": "I ran `sudo py-spy top -p 123` against the process while it was running and the most time is definitely spent in `.update()`:\r\n```\r\nTotal Samples 1000\r\nGIL: 0.00%, Active: 90.00%, Threads: 1\r\n\r\n  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                                  \r\n 38.00%  38.00%    3.85s     3.85s   update (sqlite_utils/db.py:1283)\r\n 27.00%  27.00%    2.12s     2.12s   execute (sqlite_utils/db.py:161)\r\n 10.00%  10.00%   0.890s    0.890s   execute (sqlite_utils/db.py:163)\r\n 10.00%  17.00%   0.870s     1.54s   columns (sqlite_utils/db.py:553)\r\n  0.00%   0.00%   0.110s    0.210s   <listcomp> (sqlite_utils/db.py:554)\r\n  0.00%   3.00%   0.100s    0.320s   table_names (sqlite_utils/db.py:191)\r\n  0.00%   0.00%   0.100s    0.100s   __new__ (<string>:1)\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697473247", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697473247, "node_id": "MDEyOklzc3VlQ29tbWVudDY5NzQ3MzI0Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T14:45:13Z", "updated_at": "2020-09-23T14:45:13Z", "author_association": "OWNER", "body": "`lookup_table.lookup(lookups)` is doing a SQL lookup. This could be cached in-memory, maybe with a LRU cache, to avoid looking up the primary key for records that we have recently used.\r\n\r\nThe `.update()` method it is calling first does a `get()` and then does a SQL `UPDATE ... WHERE`:\r\n\r\nhttps://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L1244-L1264\r\n\r\nBatching those updates may have an effect. Or finding a way to skip the `.get()` since we already know we have a valid record.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697467833", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697467833, "node_id": "MDEyOklzc3VlQ29tbWVudDY5NzQ2NzgzMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T14:42:03Z", "updated_at": "2020-09-23T14:42:03Z", "author_association": "OWNER", "body": "Here's the loop that's taking the time: https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L892-L897", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697466497", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/172", "id": 697466497, "node_id": "MDEyOklzc3VlQ29tbWVudDY5NzQ2NjQ5Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-09-23T14:41:17Z", "updated_at": "2020-09-23T14:41:17Z", "author_association": "OWNER", "body": "Steps to produce that database:\r\n```\r\ncurl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD'\r\nsqlite-utils insert salaries.db salaries salaries.csv --csv\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 707427200, "label": "Improve performance of extract operations"}, "performed_via_github_app": null}