{"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-1710380941", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8", "id": 1710380941, "node_id": "IC_kwDODFE5qs5l8leN", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2023-09-07T15:39:59Z", "updated_at": "2023-09-07T15:39:59Z", "author_association": "NONE", "body": "> @maxhawkins curious why you didn't use the stdlib `mailbox` to parse the `mbox` files?\r\n\r\nMailbox parses the entire mbox into memory. Using the lower level library lets us stream the emails in one at a time to support larger archives. Both libraries are in the stdlib.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 954546309, "label": "Add Gmail takeout mbox import (v2)"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-1003437288", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8", "id": 1003437288, "node_id": "IC_kwDODFE5qs47zzzo", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-12-31T19:06:20Z", "updated_at": "2021-12-31T19:06:20Z", "author_association": "NONE", "body": "> @maxhawkins how hard would it be to add an entry to the table that includes the HTML version of the email, if it exists? I just attempted your the PR branch on a very small mbox file, and it worked great. My use case is a research project and I need to access more than just the body plain text.\r\n\r\nShouldn't be hard. The easiest way is probably to remove the `if body.content_type == \"text/html\"` clause from [utils.py:254](https://github.com/dogsheep/google-takeout-to-sqlite/pull/8/commits/8e6d487b697ce2e8ad885acf613a157bfba84c59#diff-25ad9dd1ced1b8bfc37fda8444819c803232c08891e4af3d4064aa205d8174eaR254) and just return content directly without parsing.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 954546309, "label": "Add Gmail takeout mbox import (v2)"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-896378525", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8", "id": 896378525, "node_id": "IC_kwDODFE5qs41baad", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-08-10T23:28:45Z", "updated_at": "2021-08-10T23:28:45Z", "author_association": "NONE", "body": "I added parsing of text/html emails using BeautifulSoup.\r\n\r\nAround half of the emails in my archive don't include a text/plain payload so adding html parsing makes a good chunk of them searchable.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 954546309, "label": "Add Gmail takeout mbox import (v2)"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/8#issuecomment-894581223", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/8", "id": 894581223, "node_id": "IC_kwDODFE5qs41Ujnn", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-08-07T00:57:48Z", "updated_at": "2021-08-07T00:57:48Z", "author_association": "NONE", "body": "Just added two more fixes:\r\n\r\n* Added parsing for rfc 2047 encoded unicode headers\r\n* Body is now stored as TEXT rather than a BLOB regardless of what order the messages are parsed in.\r\n\r\nI was able to run this on my Takeout export and everything seems to work fine. @simonw let me know if this looks good to merge.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 954546309, "label": "Add Gmail takeout mbox import (v2)"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-888075098", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 888075098, "node_id": "IC_kwDODFE5qs407vNa", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-07-28T07:18:56Z", "updated_at": "2021-07-28T07:18:56Z", "author_association": "NONE", "body": "> I'm not sure why but my most recent import, when displayed in Datasette, looks like this:\r\n> \r\n> \"mbox__mbox_emails__753_446_rows\"\r\n\r\nI did some investigation into this issue and made a fix [here](https://github.com/dogsheep/google-takeout-to-sqlite/pull/8/commits/8ee555c2889a38ff42b95664ee074b4a01a82f06). The problem was that some messages (like gchat logs) don't have a `Message-Id` and we need to use `X-GM-THRID` as the pkey instead.\r\n\r\n@simonw While looking into this I found something unexpected about how sqlite_utils handles upserts if the pkey column is `None`. When the pkey is NULL I'd expect the function to either use rowid or throw an exception. Instead, it seems upsert_all creates a row where all columns are NULL instead of using the values provided as parameters.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885094284", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 885094284, "node_id": "IC_kwDODFE5qs40wXeM", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-07-22T17:41:32Z", "updated_at": "2021-07-22T17:41:32Z", "author_association": "NONE", "body": "I added a follow-up commit that deals with emails that don't have a `Date` header: https://github.com/maxhawkins/google-takeout-to-sqlite/commit/4bc70103582c10802c85a523ef1e99a8a2154aa9", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-885022230", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 885022230, "node_id": "IC_kwDODFE5qs40wF4W", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-07-22T15:51:46Z", "updated_at": "2021-07-22T15:51:46Z", "author_association": "NONE", "body": "One thing I noticed is this importer doesn't save attachments along with the body of the emails. It would be nice if those got stored as blobs in a separate attachments table so attachments can be included while fetching search results.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-884672647", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 884672647, "node_id": "IC_kwDODFE5qs40uwiH", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-07-22T05:56:31Z", "updated_at": "2021-07-22T14:03:08Z", "author_association": "NONE", "body": "How does this commit look? https://github.com/maxhawkins/google-takeout-to-sqlite/commit/72802a83fee282eb5d02d388567731ba4301050d\r\n\r\nIt seems that Takeout's mbox format is pretty simple, so we can get away with just splitting the file on lines begining with `From `. My commit just splits the file every time a line starts with `From ` and uses `email.message_from_bytes` to parse each chunk.\r\n\r\nI was able to load a 12GB takeout mbox without the program using more than a couple hundred MB of memory during the import process. It does make us lose the progress bar, but maybe I can add that back in a later commit.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-849708617", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 849708617, "node_id": "MDEyOklzc3VlQ29tbWVudDg0OTcwODYxNw==", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-05-27T15:01:42Z", "updated_at": "2021-05-27T15:01:42Z", "author_association": "NONE", "body": "Any updates?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791089881", "issue_url": "https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5", "id": 791089881, "node_id": "MDEyOklzc3VlQ29tbWVudDc5MTA4OTg4MQ==", "user": {"value": 28565, "label": "maxhawkins"}, "created_at": "2021-03-05T02:03:19Z", "updated_at": "2021-03-05T02:03:19Z", "author_association": "NONE", "body": "I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.\r\n\r\nIs it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 813880401, "label": "WIP: Add Gmail takeout mbox import"}, "performed_via_github_app": null}