html_url,issue_url,id,node_id,user,user_label,created_at,updated_at,author_association,body,reactions,issue,issue_label,performed_via_github_app https://github.com/simonw/sqlite-utils/issues/172#issuecomment-698178101,https://api.github.com/repos/simonw/sqlite-utils/issues/172,698178101,MDEyOklzc3VlQ29tbWVudDY5ODE3ODEwMQ==,9599,simonw,2020-09-24T07:48:57Z,2020-09-24T07:49:20Z,OWNER,"> I wonder if I could make this faster by separating it out into a few steps: > > * Create the new lookup table with all of the distinct rows > > * Add the blank foreign key column > > * run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)` > > * Drop the value columns My prototype of this knocked the time down from 10 minutes to 4 seconds, so I think the change is worth it! ``` % date sqlite-utils extract salaries.db salaries \ 'Department Code' 'Department' \ --table 'departments' \ --fk-column 'department_id' \ --rename 'Department Code' code \ --rename 'Department' name date sqlite-utils extract salaries.db salaries \ 'Union Code' 'Union' \ --table 'unions' \ --fk-column 'union_id' \ --rename 'Union Code' code \ --rename 'Union' name date sqlite-utils extract salaries.db salaries \ 'Job Family Code' 'Job Family' \ --table 'job_families' \ --fk-column 'job_family_id' \ --rename 'Job Family Code' code \ --rename 'Job Family' name date sqlite-utils extract salaries.db salaries \ 'Job Code' 'Job' \ --table 'jobs' \ --fk-column 'job_id' \ --rename 'Job Code' code \ --rename 'Job' name date Thu Sep 24 00:48:16 PDT 2020 Thu Sep 24 00:48:20 PDT 2020 Thu Sep 24 00:48:24 PDT 2020 Thu Sep 24 00:48:28 PDT 2020 Thu Sep 24 00:48:32 PDT 2020 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697869886,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697869886,MDEyOklzc3VlQ29tbWVudDY5Nzg2OTg4Ng==,9599,simonw,2020-09-23T18:45:30Z,2020-09-23T18:45:30Z,OWNER,"There's something to be said for making this operation pausable and resumable, especially if I'm going to make it available in a Datasette plugin at some point.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697866885,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697866885,MDEyOklzc3VlQ29tbWVudDY5Nzg2Njg4NQ==,9599,simonw,2020-09-23T18:43:37Z,2020-09-23T18:43:37Z,OWNER,Also what would happen if the table had new rows added to it while that command was running?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697863116,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697863116,MDEyOklzc3VlQ29tbWVudDY5Nzg2MzExNg==,9599,simonw,2020-09-23T18:41:06Z,2020-09-23T18:41:06Z,OWNER,Problem with this approach is it's not compatible with progress bars - but if it's a multiple of times faster it's worth it.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697859772,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697859772,MDEyOklzc3VlQ29tbWVudDY5Nzg1OTc3Mg==,9599,simonw,2020-09-23T18:38:43Z,2020-09-23T18:38:52Z,OWNER,"I wonder if I could make this faster by separating it out into a few steps: - Create the new lookup table with all of the distinct rows - Add the blank foreign key column - run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)` - Drop the value columns","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697835956,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697835956,MDEyOklzc3VlQ29tbWVudDY5NzgzNTk1Ng==,9599,simonw,2020-09-23T18:22:49Z,2020-09-23T18:22:49Z,OWNER,"I ran `sudo py-spy top -p 123` against the process while it was running and the most time is definitely spent in `.update()`: ``` Total Samples 1000 GIL: 0.00%, Active: 90.00%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line) 38.00% 38.00% 3.85s 3.85s update (sqlite_utils/db.py:1283) 27.00% 27.00% 2.12s 2.12s execute (sqlite_utils/db.py:161) 10.00% 10.00% 0.890s 0.890s execute (sqlite_utils/db.py:163) 10.00% 17.00% 0.870s 1.54s columns (sqlite_utils/db.py:553) 0.00% 0.00% 0.110s 0.210s (sqlite_utils/db.py:554) 0.00% 3.00% 0.100s 0.320s table_names (sqlite_utils/db.py:191) 0.00% 0.00% 0.100s 0.100s __new__ (:1) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697473247,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697473247,MDEyOklzc3VlQ29tbWVudDY5NzQ3MzI0Nw==,9599,simonw,2020-09-23T14:45:13Z,2020-09-23T14:45:13Z,OWNER,"`lookup_table.lookup(lookups)` is doing a SQL lookup. This could be cached in-memory, maybe with a LRU cache, to avoid looking up the primary key for records that we have recently used. The `.update()` method it is calling first does a `get()` and then does a SQL `UPDATE ... WHERE`: https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L1244-L1264 Batching those updates may have an effect. Or finding a way to skip the `.get()` since we already know we have a valid record. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697467833,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697467833,MDEyOklzc3VlQ29tbWVudDY5NzQ2NzgzMw==,9599,simonw,2020-09-23T14:42:03Z,2020-09-23T14:42:03Z,OWNER,Here's the loop that's taking the time: https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L892-L897,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations, https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697466497,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697466497,MDEyOklzc3VlQ29tbWVudDY5NzQ2NjQ5Nw==,9599,simonw,2020-09-23T14:41:17Z,2020-09-23T14:41:17Z,OWNER,"Steps to produce that database: ``` curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD' sqlite-utils insert salaries.db salaries salaries.csv --csv ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,Improve performance of extract operations,