html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,issue,performed_via_github_app https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-888075098,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,888075098,IC_kwDODFE5qs407vNa,28565,2021-07-28T07:18:56Z,2021-07-28T07:18:56Z,NONE,"> I'm not sure why but my most recent import, when displayed in Datasette, looks like this: > > I did some investigation into this issue and made a fix [here](https://github.com/dogsheep/google-takeout-to-sqlite/pull/8/commits/8ee555c2889a38ff42b95664ee074b4a01a82f06). The problem was that some messages (like gchat logs) don't have a `Message-Id` and we need to use `X-GM-THRID` as the pkey instead. @simonw While looking into this I found something unexpected about how sqlite_utils handles upserts if the pkey column is `None`. When the pkey is NULL I'd expect the function to either use rowid or throw an exception. Instead, it seems upsert_all creates a row where all columns are NULL instead of using the values provided as parameters.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885098025,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,885098025,IC_kwDODFE5qs40wYYp,306240,2021-07-22T17:47:50Z,2021-07-22T17:47:50Z,NONE,"Hi @maxhawkins , I'm sorry, I haven't had any time to work on this. I'll have some time tomorrow to test your commits. I think they look great. I'm great with your commits superseding my initial attempt here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885094284,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,885094284,IC_kwDODFE5qs40wXeM,28565,2021-07-22T17:41:32Z,2021-07-22T17:41:32Z,NONE,I added a follow-up commit that deals with emails that don't have a `Date` header: https://github.com/maxhawkins/google-takeout-to-sqlite/commit/4bc70103582c10802c85a523ef1e99a8a2154aa9,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885022230,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,885022230,IC_kwDODFE5qs40wF4W,28565,2021-07-22T15:51:46Z,2021-07-22T15:51:46Z,NONE,One thing I noticed is this importer doesn't save attachments along with the body of the emails. It would be nice if those got stored as blobs in a separate attachments table so attachments can be included while fetching search results.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-884672647,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,884672647,IC_kwDODFE5qs40uwiH,28565,2021-07-22T05:56:31Z,2021-07-22T14:03:08Z,NONE,"How does this commit look? https://github.com/maxhawkins/google-takeout-to-sqlite/commit/72802a83fee282eb5d02d388567731ba4301050d It seems that Takeout's mbox format is pretty simple, so we can get away with just splitting the file on lines begining with `From `. My commit just splits the file every time a line starts with `From ` and uses `email.message_from_bytes` to parse each chunk. I was able to load a 12GB takeout mbox without the program using more than a couple hundred MB of memory during the import process. It does make us lose the progress bar, but maybe I can add that back in a later commit.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-849708617,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,849708617,MDEyOklzc3VlQ29tbWVudDg0OTcwODYxNw==,28565,2021-05-27T15:01:42Z,2021-05-27T15:01:42Z,NONE,Any updates?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791089881,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,791089881,MDEyOklzc3VlQ29tbWVudDc5MTA4OTg4MQ==,28565,2021-03-05T02:03:19Z,2021-03-05T02:03:19Z,NONE,"I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout. Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790695126,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790695126,MDEyOklzc3VlQ29tbWVudDc5MDY5NTEyNg==,9599,2021-03-04T15:20:42Z,2021-03-04T15:20:42Z,MEMBER,"I'm not sure why but my most recent import, when displayed in Datasette, looks like this: Sorting by `id` in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790693674,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790693674,MDEyOklzc3VlQ29tbWVudDc5MDY5MzY3NA==,9599,2021-03-04T15:18:36Z,2021-03-04T15:18:36Z,MEMBER,"I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790669767,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790669767,MDEyOklzc3VlQ29tbWVudDc5MDY2OTc2Nw==,9599,2021-03-04T14:46:06Z,2021-03-04T14:46:06Z,MEMBER,"Solution could be to pre-process that string by splitting on `(` and dropping everything afterwards, assuming that the `(...)` bit isn't necessary for correctly parsing the date.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790668263,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790668263,MDEyOklzc3VlQ29tbWVudDc5MDY2ODI2Mw==,9599,2021-03-04T14:43:58Z,2021-03-04T14:43:58Z,MEMBER,"I added this code to output a message ID on errors: ```diff print(""Errors: {}"".format(num_errors)) print(traceback.format_exc()) + print(""Message-Id: {}"".format(email.get(""Message-Id"", ""None""))) continue ``` Having found a message ID that had an error, I ran this command to see the context: rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox This was for the following error: ``` File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 102, in get_mbox message[""date""] = get_message_date(email.get(""Date""), email.get_from()) File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 178, in get_message_date datetime_tuple = email.utils.parsedate_tz(mail_date) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 50, in parsedate_tz res = _parsedate_tz(data) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 69, in _parsedate_tz data = data.split() AttributeError: 'Header' object has no attribute 'split' ``` Here's what I spotted in the `ripgrep` output: ``` 177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI> 177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) 177133572-X-Mailer: IncrediMail (5002253) ``` So it could it be that `_parsedate_tz` is having trouble with that `Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)` string.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790391711,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790391711,MDEyOklzc3VlQ29tbWVudDc5MDM5MTcxMQ==,306240,2021-03-04T07:36:24Z,2021-03-04T07:36:24Z,NONE,"> Looks like you're doing this: > > ```python > elif message.get_content_type() == ""text/plain"": > body = message.get_payload(decode=True) > ``` > > So presumably that decodes to a unicode string? > > I imagine the reason the column is a `BLOB` for me is that `sqlite-utils` determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string. Ah, that's good to know. I think explicitly creating the tables will be a great improvement. I'll add that. Also, I noticed after I opened this PR that the `message.get_payload()` is being deprecated in favor of `message.get_content()` or something like that. I'll see if that handles the decoding better, too. Thanks for the feedback. I should have time tomorrow to put together some improvements.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790380839,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790380839,MDEyOklzc3VlQ29tbWVudDc5MDM4MDgzOQ==,9599,2021-03-04T07:17:05Z,2021-03-04T07:17:05Z,MEMBER,"Looks like you're doing this: ```python elif message.get_content_type() == ""text/plain"": body = message.get_payload(decode=True) ``` So presumably that decodes to a unicode string? I imagine the reason the column is a `BLOB` for me is that `sqlite-utils` determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790379629,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790379629,MDEyOklzc3VlQ29tbWVudDc5MDM3OTYyOQ==,9599,2021-03-04T07:14:41Z,2021-03-04T07:14:41Z,MEMBER,"Confirmed: removing the `len()` call does not speed things up, so it's reading through the entire file for some other purpose too.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790378658,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790378658,MDEyOklzc3VlQ29tbWVudDc5MDM3ODY1OA==,9599,2021-03-04T07:12:48Z,2021-03-04T07:12:48Z,MEMBER,"It looks like the `body` is being loaded into a BLOB column - so in Datasette default it looks like this: If I `datasette install datasette-render-binary` and then try again I get this: It would be great if we could store the `body` as unicode text instead. May have to do something clever to decode it based on some kind of charset header?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790373024,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790373024,MDEyOklzc3VlQ29tbWVudDc5MDM3MzAyNA==,9599,2021-03-04T07:01:58Z,2021-03-04T07:04:06Z,MEMBER,"I got 9 warnings that look like this: ``` Errors: 1 Traceback (most recent call last): File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 103, in get_mbox message[""date""] = get_message_date(email.get(""Date""), email.get_from()) File ""/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py"", line 167, in get_message_date datetime_tuple = email.utils.parsedate_tz(mail_date) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 50, in parsedate_tz res = _parsedate_tz(data) File ""/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py"", line 69, in _parsedate_tz data = data.split() AttributeError: 'Header' object has no attribute 'split' ``` It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the `mbox` and see what was going on. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790372621,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790372621,MDEyOklzc3VlQ29tbWVudDc5MDM3MjYyMQ==,9599,2021-03-04T07:01:18Z,2021-03-04T07:01:18Z,MEMBER,"I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in `healthkit-to-sqlite` - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file. https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the `progress_callback()` bit) is where that happens. It can be a bit of a convoluted pattern, and I'm not at all sure it would work for `mbox` files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790370485,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790370485,MDEyOklzc3VlQ29tbWVudDc5MDM3MDQ4NQ==,9599,2021-03-04T06:57:25Z,2021-03-04T06:57:48Z,MEMBER,"The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count: https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67 I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790369076,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790369076,MDEyOklzc3VlQ29tbWVudDc5MDM2OTA3Ng==,9599,2021-03-04T06:54:46Z,2021-03-04T06:54:46Z,MEMBER,"The Rich-powered progress bar is pretty: ![rich](https://user-images.githubusercontent.com/9599/109923307-71f69200-7c73-11eb-9ee2-8f0a240f3994.gif) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790312268,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,790312268,MDEyOklzc3VlQ29tbWVudDc5MDMxMjI2OA==,9599,2021-03-04T05:48:16Z,2021-03-04T05:48:16Z,MEMBER,"Wow, my mbox is a 10.35 GB download!","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-784638394,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,784638394,MDEyOklzc3VlQ29tbWVudDc4NDYzODM5NA==,306240,2021-02-24T00:36:18Z,2021-02-24T00:36:18Z,NONE,I noticed that @simonw is using black for formatting. I ran black on my additions in this PR.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401, https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-783794520,https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5,783794520,MDEyOklzc3VlQ29tbWVudDc4Mzc5NDUyMA==,306240,2021-02-23T01:13:54Z,2021-02-23T01:13:54Z,NONE,"Also, @simonw I created a test based off the existing tests. I think it's working correctly","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",813880401,