github

This data as json, CSV

html_url	issue_url	id	node_id	user	created_at	updated_at	author_association	body	reactions	issue	performed_via_github_app
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790695126	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790695126	MDEyOklzc3VlQ29tbWVudDc5MDY5NTEyNg==	9599	2021-03-04T15:20:42Z	2021-03-04T15:20:42Z	MEMBER	I'm not sure why but my most recent import, when displayed in Datasette, looks like this: <img width="574" alt="mbox__mbox_emails__753_446_rows" src="https://user-images.githubusercontent.com/9599/109985836-0ab00080-7cba-11eb-97d5-0631a0835b61.png"> Sorting by `id` in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790693674	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790693674	MDEyOklzc3VlQ29tbWVudDc5MDY5MzY3NA==	9599	2021-03-04T15:18:36Z	2021-03-04T15:18:36Z	MEMBER	I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790669767	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790669767	MDEyOklzc3VlQ29tbWVudDc5MDY2OTc2Nw==	9599	2021-03-04T14:46:06Z	2021-03-04T14:46:06Z	MEMBER	Solution could be to pre-process that string by splitting on `(` and dropping everything afterwards, assuming that the `(...)` bit isn't necessary for correctly parsing the date.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790668263	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790668263	MDEyOklzc3VlQ29tbWVudDc5MDY2ODI2Mw==	9599	2021-03-04T14:43:58Z	2021-03-04T14:43:58Z	MEMBER	I added this code to output a message ID on errors: ```diff print("Errors: {}".format(num_errors)) print(traceback.format_exc()) + print("Message-Id: {}".format(email.get("Message-Id", "None"))) continue ``` Having found a message ID that had an error, I ran this command to see the context: rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox This was for the following error: ``` File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 102, in get_mbox message["date"] = get_message_date(email.get("Date"), email.get_from()) File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 178, in get_message_date datetime_tuple = email.utils.parsedate_tz(mail_date) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz res = _parsedate_tz(data) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz data = data.split() AttributeError: 'Header' object has no attribute 'split' ``` Here's what I spotted in the `ripgrep` output: ``` 177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI> 177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) 177133572-X-Mailer: IncrediMail (5002253) ``` So it could it be that `_parsedate_tz` is having trouble with that `Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)` string.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790380839	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790380839	MDEyOklzc3VlQ29tbWVudDc5MDM4MDgzOQ==	9599	2021-03-04T07:17:05Z	2021-03-04T07:17:05Z	MEMBER	Looks like you're doing this: ```python elif message.get_content_type() == "text/plain": body = message.get_payload(decode=True) ``` So presumably that decodes to a unicode string? I imagine the reason the column is a `BLOB` for me is that `sqlite-utils` determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790379629	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790379629	MDEyOklzc3VlQ29tbWVudDc5MDM3OTYyOQ==	9599	2021-03-04T07:14:41Z	2021-03-04T07:14:41Z	MEMBER	Confirmed: removing the `len()` call does not speed things up, so it's reading through the entire file for some other purpose too.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790378658	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790378658	MDEyOklzc3VlQ29tbWVudDc5MDM3ODY1OA==	9599	2021-03-04T07:12:48Z	2021-03-04T07:12:48Z	MEMBER	It looks like the `body` is being loaded into a BLOB column - so in Datasette default it looks like this: <img width="1650" alt="mbox__mbox_emails__753_446_rows" src="https://user-images.githubusercontent.com/9599/109924808-b4b96980-7c75-11eb-8c9e-307f2ae32d5a.png"> If I `datasette install datasette-render-binary` and then try again I get this: <img width="1487" alt="mbox__mbox_emails__753_446_rows" src="https://user-images.githubusercontent.com/9599/109924944-ea5e5280-7c75-11eb-9a32-404f3d68455f.png"> It would be great if we could store the `body` as unicode text instead. May have to do something clever to decode it based on some kind of charset header?	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790373024	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790373024	MDEyOklzc3VlQ29tbWVudDc5MDM3MzAyNA==	9599	2021-03-04T07:01:58Z	2021-03-04T07:04:06Z	MEMBER	I got 9 warnings that look like this: ``` Errors: 1 Traceback (most recent call last): File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 103, in get_mbox message["date"] = get_message_date(email.get("Date"), email.get_from()) File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 167, in get_message_date datetime_tuple = email.utils.parsedate_tz(mail_date) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz res = _parsedate_tz(data) File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz data = data.split() AttributeError: 'Header' object has no attribute 'split' ``` It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the `mbox` and see what was going on.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790372621	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790372621	MDEyOklzc3VlQ29tbWVudDc5MDM3MjYyMQ==	9599	2021-03-04T07:01:18Z	2021-03-04T07:01:18Z	MEMBER	I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in `healthkit-to-sqlite` - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file. https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the `progress_callback()` bit) is where that happens. It can be a bit of a convoluted pattern, and I'm not at all sure it would work for `mbox` files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790370485	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790370485	MDEyOklzc3VlQ29tbWVudDc5MDM3MDQ4NQ==	9599	2021-03-04T06:57:25Z	2021-03-04T06:57:48Z	MEMBER	The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count: https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67 I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790369076	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790369076	MDEyOklzc3VlQ29tbWVudDc5MDM2OTA3Ng==	9599	2021-03-04T06:54:46Z	2021-03-04T06:54:46Z	MEMBER	The Rich-powered progress bar is pretty: ![rich](https://user-images.githubusercontent.com/9599/109923307-71f69200-7c73-11eb-9ee2-8f0a240f3994.gif)	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790312268	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	790312268	MDEyOklzc3VlQ29tbWVudDc5MDMxMjI2OA==	9599	2021-03-04T05:48:16Z	2021-03-04T05:48:16Z	MEMBER	Wow, my mbox is a 10.35 GB download!	{ "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401
https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-786925280	https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5	786925280	MDEyOklzc3VlQ29tbWVudDc4NjkyNTI4MA==	9599	2021-02-26T22:23:10Z	2021-02-26T22:23:10Z	MEMBER	Thanks! I requested my Gmail export from takeout - once that arrives I'll test it against this and then merge the PR.	{ "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }	813880401