home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where user = 28565 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 2

  • WIP: Add Gmail takeout mbox import 6
  • Add Gmail takeout mbox import (v2) 4

user 1

  • maxhawkins · 10 ✖

author_association 1

  • NONE 10
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1710380941 https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-1710380941 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8 IC_kwDODFE5qs5l8leN maxhawkins 28565 2023-09-07T15:39:59Z 2023-09-07T15:39:59Z NONE

@maxhawkins curious why you didn't use the stdlib mailbox to parse the mbox files?

Mailbox parses the entire mbox into memory. Using the lower level library lets us stream the emails in one at a time to support larger archives. Both libraries are in the stdlib.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Add Gmail takeout mbox import (v2) 954546309  
1003437288 https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-1003437288 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8 IC_kwDODFE5qs47zzzo maxhawkins 28565 2021-12-31T19:06:20Z 2021-12-31T19:06:20Z NONE

@maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.

Shouldn't be hard. The easiest way is probably to remove the if body.content_type == "text/html" clause from utils.py:254 and just return content directly without parsing.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Add Gmail takeout mbox import (v2) 954546309  
896378525 https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-896378525 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8 IC_kwDODFE5qs41baad maxhawkins 28565 2021-08-10T23:28:45Z 2021-08-10T23:28:45Z NONE

I added parsing of text/html emails using BeautifulSoup.

Around half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Add Gmail takeout mbox import (v2) 954546309  
894581223 https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-894581223 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8 IC_kwDODFE5qs41Ujnn maxhawkins 28565 2021-08-07T00:57:48Z 2021-08-07T00:57:48Z NONE

Just added two more fixes:

  • Added parsing for rfc 2047 encoded unicode headers
  • Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.

I was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Add Gmail takeout mbox import (v2) 954546309  
888075098 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-888075098 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 IC_kwDODFE5qs407vNa maxhawkins 28565 2021-07-28T07:18:56Z 2021-07-28T07:18:56Z NONE

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

I did some investigation into this issue and made a fix here. The problem was that some messages (like gchat logs) don't have a Message-Id and we need to use X-GM-THRID as the pkey instead.

@simonw While looking into this I found something unexpected about how sqlite_utils handles upserts if the pkey column is None. When the pkey is NULL I'd expect the function to either use rowid or throw an exception. Instead, it seems upsert_all creates a row where all columns are NULL instead of using the values provided as parameters.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
885094284 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885094284 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 IC_kwDODFE5qs40wXeM maxhawkins 28565 2021-07-22T17:41:32Z 2021-07-22T17:41:32Z NONE

I added a follow-up commit that deals with emails that don't have a Date header: https://github.com/maxhawkins/google-takeout-to-sqlite/commit/4bc70103582c10802c85a523ef1e99a8a2154aa9

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
885022230 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885022230 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 IC_kwDODFE5qs40wF4W maxhawkins 28565 2021-07-22T15:51:46Z 2021-07-22T15:51:46Z NONE

One thing I noticed is this importer doesn't save attachments along with the body of the emails. It would be nice if those got stored as blobs in a separate attachments table so attachments can be included while fetching search results.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
884672647 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-884672647 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 IC_kwDODFE5qs40uwiH maxhawkins 28565 2021-07-22T05:56:31Z 2021-07-22T14:03:08Z NONE

How does this commit look? https://github.com/maxhawkins/google-takeout-to-sqlite/commit/72802a83fee282eb5d02d388567731ba4301050d

It seems that Takeout's mbox format is pretty simple, so we can get away with just splitting the file on lines begining with From. My commit just splits the file every time a line starts with From and uses email.message_from_bytes to parse each chunk.

I was able to load a 12GB takeout mbox without the program using more than a couple hundred MB of memory during the import process. It does make us lose the progress bar, but maybe I can add that back in a later commit.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
849708617 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-849708617 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDg0OTcwODYxNw== maxhawkins 28565 2021-05-27T15:01:42Z 2021-05-27T15:01:42Z NONE

Any updates?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
791089881 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791089881 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MTA4OTg4MQ== maxhawkins 28565 2021-03-05T02:03:19Z 2021-03-05T02:03:19Z NONE

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 19.567ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows