20 rows where issue = 813880401 sorted by updated_at descending

View and edit SQL

Suggested facets: reactions, created_at (date), updated_at (date)

user

author_association

issue

  • WIP: Add Gmail takeout mbox import · 20
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
849708617 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-849708617 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDg0OTcwODYxNw== maxhawkins 28565 2021-05-27T15:01:42Z 2021-05-27T15:01:42Z NONE

Any updates?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
791530093 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791530093 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MTUzMDA5Mw== UtahDave 306240 2021-03-05T16:28:07Z 2021-03-05T16:28:07Z NONE

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

@maxhawkins a limitation of the python mbox module is it loads the entire mbox into memory. I did find another approach to this problem that didn't use the builtin python mbox module and created a generator so that it didn't have to load the whole mbox into memory. I was hoping to use standard library modules, but this might be a good reason to investigate that approach a bit more. My worry is making sure a custom processor handles all the ins and outs of the mbox format correctly.

Hm. As I'm writing this, I thought of something. I think I can parse each message one at a time, and then use an mbox function to load each message using the python mbox module. That way the mbox module can still deal with the specifics of the mbox format, but I can use a generator.

I'll give that a try. Thanks for the feedback @maxhawkins and @simonw. I'll give that a try.

@simonw can we hold off on merging this until I can test this new approach?

{
    "total_count": 3,
    "+1": 3,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
791089881 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791089881 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MTA4OTg4MQ== maxhawkins 28565 2021-03-05T02:03:19Z 2021-03-05T02:03:19Z NONE

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790695126 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790695126 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY5NTEyNg== simonw 9599 2021-03-04T15:20:42Z 2021-03-04T15:20:42Z MEMBER

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

https://user-images.githubusercontent.com/9599/109985836-0ab00080-7cba-11eb-97d5-0631a0835b61.png">

Sorting by id in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790693674 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790693674 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY5MzY3NA== simonw 9599 2021-03-04T15:18:36Z 2021-03-04T15:18:36Z MEMBER

I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790669767 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790669767 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY2OTc2Nw== simonw 9599 2021-03-04T14:46:06Z 2021-03-04T14:46:06Z MEMBER

Solution could be to pre-process that string by splitting on ( and dropping everything afterwards, assuming that the (...) bit isn't necessary for correctly parsing the date.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790668263 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790668263 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY2ODI2Mw== simonw 9599 2021-03-04T14:43:58Z 2021-03-04T14:43:58Z MEMBER

I added this code to output a message ID on errors:

             print("Errors: {}".format(num_errors))
             print(traceback.format_exc())
+            print("Message-Id: {}".format(email.get("Message-Id", "None")))
             continue

Having found a message ID that had an error, I ran this command to see the context:

rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox

This was for the following error:

  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 102, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 178, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

Here's what I spotted in the ripgrep output:

177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI>
177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)
177133572-X-Mailer: IncrediMail (5002253)

So it could it be that _parsedate_tz is having trouble with that Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) string.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790391711 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790391711 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM5MTcxMQ== UtahDave 306240 2021-03-04T07:36:24Z 2021-03-04T07:36:24Z NONE

Looks like you're doing this:

python elif message.get_content_type() == "text/plain": body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

Ah, that's good to know. I think explicitly creating the tables will be a great improvement. I'll add that.

Also, I noticed after I opened this PR that the message.get_payload() is being deprecated in favor of message.get_content() or something like that. I'll see if that handles the decoding better, too.

Thanks for the feedback. I should have time tomorrow to put together some improvements.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790389335 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790389335 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM4OTMzNQ== UtahDave 306240 2021-03-04T07:32:04Z 2021-03-04T07:32:04Z NONE

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

The wait is from python loading the mbox file. This happens regardless if you're getting the length of the mbox. The mbox module is on the slow side. It is possible to do one's own parsing of the mbox, but I kind of wanted to avoid doing that.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790380839 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790380839 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM4MDgzOQ== simonw 9599 2021-03-04T07:17:05Z 2021-03-04T07:17:05Z MEMBER

Looks like you're doing this:

    elif message.get_content_type() == "text/plain":
        body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790379629 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790379629 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3OTYyOQ== simonw 9599 2021-03-04T07:14:41Z 2021-03-04T07:14:41Z MEMBER

Confirmed: removing the len() call does not speed things up, so it's reading through the entire file for some other purpose too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790378658 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790378658 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3ODY1OA== simonw 9599 2021-03-04T07:12:48Z 2021-03-04T07:12:48Z MEMBER

It looks like the body is being loaded into a BLOB column - so in Datasette default it looks like this:

https://user-images.githubusercontent.com/9599/109924808-b4b96980-7c75-11eb-8c9e-307f2ae32d5a.png">

If I datasette install datasette-render-binary and then try again I get this:

https://user-images.githubusercontent.com/9599/109924944-ea5e5280-7c75-11eb-9a32-404f3d68455f.png">

It would be great if we could store the body as unicode text instead. May have to do something clever to decode it based on some kind of charset header?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790373024 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790373024 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3MzAyNA== simonw 9599 2021-03-04T07:01:58Z 2021-03-04T07:04:06Z MEMBER

I got 9 warnings that look like this:

Errors: 1
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 103, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 167, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the mbox and see what was going on.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790372621 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790372621 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3MjYyMQ== simonw 9599 2021-03-04T07:01:18Z 2021-03-04T07:01:18Z MEMBER

I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in healthkit-to-sqlite - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file.

https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the progress_callback() bit) is where that happens.

It can be a bit of a convoluted pattern, and I'm not at all sure it would work for mbox files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790370485 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790370485 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3MDQ4NQ== simonw 9599 2021-03-04T06:57:25Z 2021-03-04T06:57:48Z MEMBER

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790369076 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790369076 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM2OTA3Ng== simonw 9599 2021-03-04T06:54:46Z 2021-03-04T06:54:46Z MEMBER

The Rich-powered progress bar is pretty:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790312268 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790312268 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDMxMjI2OA== simonw 9599 2021-03-04T05:48:16Z 2021-03-04T05:48:16Z MEMBER

Wow, my mbox is a 10.35 GB download!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
786925280 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-786925280 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc4NjkyNTI4MA== simonw 9599 2021-02-26T22:23:10Z 2021-02-26T22:23:10Z MEMBER

Thanks!

I requested my Gmail export from takeout - once that arrives I'll test it against this and then merge the PR.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
784638394 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-784638394 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc4NDYzODM5NA== UtahDave 306240 2021-02-24T00:36:18Z 2021-02-24T00:36:18Z NONE

I noticed that @simonw is using black for formatting. I ran black on my additions in this PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
783794520 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-783794520 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc4Mzc5NDUyMA== UtahDave 306240 2021-02-23T01:13:54Z 2021-02-23T01:13:54Z NONE

Also, @simonw I created a test based off the existing tests. I think it's working correctly

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);