5,315 rows sorted by updated_at descending

View and edit SQL

Suggested facets: reactions, created_at (date)

author_association

id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
792386484 https://github.com/simonw/datasette/issues/1250#issuecomment-792386484 https://api.github.com/repos/simonw/datasette/issues/1250 MDEyOklzc3VlQ29tbWVudDc5MjM4NjQ4NA== simonw 9599 2021-03-08T00:29:06Z 2021-03-08T00:29:06Z OWNER

DuckDB has a read-only mechanism: https://duckdb.org/docs/api/python

import duckdb
con = duckdb.connect(database="/tmp/blah.db", read_only=True)
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: Plugin hook for alternative database connections 824067604  
792385274 https://github.com/simonw/datasette/issues/1248#issuecomment-792385274 https://api.github.com/repos/simonw/datasette/issues/1248 MDEyOklzc3VlQ29tbWVudDc5MjM4NTI3NA== simonw 9599 2021-03-08T00:25:10Z 2021-03-08T00:25:10Z OWNER

It's not possible yet, unfortunately. This came up on the forums recently: https://github.com/simonw/datasette/discussions/968

I'm leaning further towards making the database connection layer itself work via a plugin hook, which would open up the possibility of supporting DuckDB and other databases as well. I've not committed to doing this yet though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
duckdb database (very low performance in SQLite) 823035080  
792384854 https://github.com/simonw/datasette/issues/1249#issuecomment-792384854 https://api.github.com/repos/simonw/datasette/issues/1249 MDEyOklzc3VlQ29tbWVudDc5MjM4NDg1NA== simonw 9599 2021-03-08T00:23:38Z 2021-03-08T00:23:38Z OWNER

One reason to prioritize this issue: Homebrew upgraded to SpatiaLite 5.0 recently https://formulae.brew.sh/formula/spatialite-tools and as a result SpatiaLite database created on my laptop don't appear to be compatible with Datasette when published using datasette publish.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Upgrade SpatiaLite to version 5.0 824064069  
792384382 https://github.com/simonw/datasette/issues/1249#issuecomment-792384382 https://api.github.com/repos/simonw/datasette/issues/1249 MDEyOklzc3VlQ29tbWVudDc5MjM4NDM4Mg== simonw 9599 2021-03-08T00:22:02Z 2021-03-08T00:22:02Z OWNER

I tried this patch against Dockerfile:

diff --git a/Dockerfile b/Dockerfile
index f4b1414..dd659e1 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,25 +1,26 @@
-FROM python:3.7.10-slim-stretch as build
+FROM python:3.9.2-slim-buster as build

 # Setup build dependencies
 RUN apt update \
-&& apt install -y python3-dev build-essential wget libxml2-dev libproj-dev libgeos-dev libsqlite3-dev zlib1g-dev pkg-config git \
- && apt clean
+  && apt install -y python3-dev build-essential wget libxml2-dev libproj-dev \
+  libminizip-dev libgeos-dev libsqlite3-dev zlib1g-dev pkg-config git \
+  && apt clean

-
-RUN wget "https://www.sqlite.org/2020/sqlite-autoconf-3310100.tar.gz" && tar xzf sqlite-autoconf-3310100.tar.gz \
-    && cd sqlite-autoconf-3310100 && ./configure --disable-static --enable-fts5 --enable-json1 CFLAGS="-g -O2 -DSQLITE_ENABLE_FTS3=1 -DSQLITE_ENABLE_FTS3_PARENTHESIS -DSQLITE_ENABLE_FTS4=1 -DSQLITE_ENABLE_RTREE=1 -DSQLITE_ENABLE_JSON1" \
+RUN wget "https://www.sqlite.org/2021/sqlite-autoconf-3340100.tar.gz" && tar xzf sqlite-autoconf-3340100.tar.gz \
+    && cd sqlite-autoconf-3340100 && ./configure --disable-static --enable-fts5 --enable-json1 \
+    CFLAGS="-g -O2 -DSQLITE_ENABLE_FTS3=1 -DSQLITE_ENABLE_FTS3_PARENTHESIS -DSQLITE_ENABLE_FTS4=1 -DSQLITE_ENABLE_RTREE=1 -DSQLITE_ENABLE_JSON1" \
     && make && make install

-RUN wget "http://www.gaia-gis.it/gaia-sins/freexl-sources/freexl-1.0.5.tar.gz" && tar zxf freexl-1.0.5.tar.gz \
-    && cd freexl-1.0.5 && ./configure && make && make install
+RUN wget "http://www.gaia-gis.it/gaia-sins/freexl-1.0.6.tar.gz" && tar zxf freexl-1.0.6.tar.gz \
+    && cd freexl-1.0.6 && ./configure && make && make install

-RUN wget "http://www.gaia-gis.it/gaia-sins/libspatialite-sources/libspatialite-4.4.0-RC0.tar.gz" && tar zxf libspatialite-4.4.0-RC0.tar.gz \
-    && cd libspatialite-4.4.0-RC0 && ./configure && make && make install
+RUN wget "http://www.gaia-gis.it/gaia-sins/libspatialite-5.0.1.tar.gz" && tar zxf libspatialite-5.0.1.tar.gz \
+    && cd libspatialite-5.0.1 && ./configure --disable-rttopo && make && make install

 RUN wget "http://www.gaia-gis.it/gaia-sins/readosm-sources/readosm-1.1.0.tar.gz" && tar zxf readosm-1.1.0.tar.gz && cd readosm-1.1.0 && ./configure && make && make install

-RUN wget "http://www.gaia-gis.it/gaia-sins/spatialite-tools-sources/spatialite-tools-4.4.0-RC0.tar.gz" && tar zxf spatialite-tools-4.4.0-RC0.tar.gz \
-    && cd spatialite-tools-4.4.0-RC0 && ./configure && make && make install
+RUN wget "http://www.gaia-gis.it/gaia-sins/spatialite-tools-5.0.0.tar.gz" && tar zxf spatialite-tools-5.0.0.tar.gz \
+    && cd spatialite-tools-5.0.0 && ./configure --disable-rttopo && make && make install


 # Add local code to the image instead of fetching from pypi.
@@ -27,7 +28,7 @@ COPY . /datasette

 RUN pip install /datasette

-FROM python:3.7.10-slim-stretch
+FROM python:3.9.2-slim-buster

 # Copy python dependencies and spatialite libraries
 COPY --from=build /usr/local/lib/ /usr/local/lib/

I had to use --disable-rttopo from the tip in https://github.com/OSGeo/gdal/pull/3443 and also needed to install libminizip-dev.

This works, sort of... I'm getting a weird issue where the /dbname page is hanging some of the time instead of loading correctly. Other than that it seems to work, but a hanging page is bad!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Upgrade SpatiaLite to version 5.0 824064069  
792383956 https://github.com/simonw/datasette/issues/1249#issuecomment-792383956 https://api.github.com/repos/simonw/datasette/issues/1249 MDEyOklzc3VlQ29tbWVudDc5MjM4Mzk1Ng== simonw 9599 2021-03-08T00:20:09Z 2021-03-08T00:20:09Z OWNER

Worth noting that the Docker image used by datasette publish cloudrun doesn't actually use that Datasette docker image - it does this:

https://github.com/simonw/datasette/blob/d0fd833b8cdd97e1b91d0f97a69b494895d82bee/datasette/utils/__init__.py#L349-L353

Where the apt extras for SpatiaLite are: https://github.com/simonw/datasette/blob/d0fd833b8cdd97e1b91d0f97a69b494895d82bee/datasette/utils/__init__.py#L344-L345

libsqlite3-mod-spatialite against that official python:3.8 image doesn't appear to install SpatiaLite 5.0.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Upgrade SpatiaLite to version 5.0 824064069  
792308036 https://github.com/simonw/datasette/issues/858#issuecomment-792308036 https://api.github.com/repos/simonw/datasette/issues/858 MDEyOklzc3VlQ29tbWVudDc5MjMwODAzNg== robroc 1219001 2021-03-07T16:41:54Z 2021-03-07T16:41:54Z NONE

Apologies if I sound dense but I don't see where you would pass
'shell=True'. I'm using the CLI installed via pip.

On Sun., Mar. 7, 2021, 2:15 a.m. David Smith, notifications@github.com
wrote:

To get it to work I had to:

-

add shell=true to the various commands in datasette
-

use the name argument of the publish command. (
https://docs.datasette.io/en/stable/publish.html)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/simonw/datasette/issues/858#issuecomment-792230560,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJJTOMZMGYSCGUU4J3AVSDTCMRX5ANCNFSM4ODNEDYA
.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
publish heroku does not work on Windows 10 642388564  
792233255 https://github.com/simonw/datasette/pull/1223#issuecomment-792233255 https://api.github.com/repos/simonw/datasette/issues/1223 MDEyOklzc3VlQ29tbWVudDc5MjIzMzI1NQ== simonw 9599 2021-03-07T07:41:01Z 2021-03-07T07:41:01Z OWNER

This is fantastic, thanks so much for tracking this down.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
Add compile option to Dockerfile to fix failing test (fixes #696) 806918878  
792230560 https://github.com/simonw/datasette/issues/858#issuecomment-792230560 https://api.github.com/repos/simonw/datasette/issues/858 MDEyOklzc3VlQ29tbWVudDc5MjIzMDU2MA== smithdc1 39445562 2021-03-07T07:14:58Z 2021-03-07T07:14:58Z NONE

To get it to work I had to:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
publish heroku does not work on Windows 10 642388564  
792129022 https://github.com/simonw/datasette/issues/858#issuecomment-792129022 https://api.github.com/repos/simonw/datasette/issues/858 MDEyOklzc3VlQ29tbWVudDc5MjEyOTAyMg== robroc 1219001 2021-03-07T00:23:34Z 2021-03-07T00:23:34Z NONE

@smithdc1 Can you tell us what you did to get it to publish in Windows? What commands did you pass?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
publish heroku does not work on Windows 10 642388564  
791509910 https://github.com/simonw/datasette/issues/766#issuecomment-791509910 https://api.github.com/repos/simonw/datasette/issues/766 MDEyOklzc3VlQ29tbWVudDc5MTUwOTkxMA== JBPressac 6371750 2021-03-05T15:57:35Z 2021-03-05T16:35:21Z NONE

Hello,
I have the same wildcards search problems with an instance of Datasette. http://crbc-dataset.huma-num.fr/inventaires/fonds_auguste_dupouy_1872_1967?_search=gwerz&_sort=rowid is OK but http://crbc-dataset.huma-num.fr/inventaires/fonds_auguste_dupouy_1872_1967?_search=gwe* is not (FTS is activated on "Reference" "IntituleAnalyse" "NomDuProducteur" "PresentationDuContenu" "Notes").

Notice that a SQL query as below launched directly from SQLite in the server's shell, retrieves results.

select * from fonds_auguste_dupouy_1872_1967_fts where IntituleAnalyse MATCH "gwe*";

Thanks,

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Enable wildcard-searches by default 617323873  
791530093 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791530093 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MTUzMDA5Mw== UtahDave 306240 2021-03-05T16:28:07Z 2021-03-05T16:28:07Z NONE

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

@maxhawkins a limitation of the python mbox module is it loads the entire mbox into memory. I did find another approach to this problem that didn't use the builtin python mbox module and created a generator so that it didn't have to load the whole mbox into memory. I was hoping to use standard library modules, but this might be a good reason to investigate that approach a bit more. My worry is making sure a custom processor handles all the ins and outs of the mbox format correctly.

Hm. As I'm writing this, I thought of something. I think I can parse each message one at a time, and then use an mbox function to load each message using the python mbox module. That way the mbox module can still deal with the specifics of the mbox format, but I can use a generator.

I'll give that a try. Thanks for the feedback @maxhawkins and @simonw. I'll give that a try.

@simonw can we hold off on merging this until I can test this new approach?

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
791089881 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-791089881 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MTA4OTg4MQ== maxhawkins 28565 2021-03-05T02:03:19Z 2021-03-05T02:03:19Z NONE

I just tried to run this on a small VPS instance with 2GB of memory and it crashed out of memory while processing a 12GB mbox from Takeout.

Is it possible to stream the emails to sqlite instead of loading it all into memory and upserting at once?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
791053721 https://github.com/dogsheep/dogsheep-photos/issues/32#issuecomment-791053721 https://api.github.com/repos/dogsheep/dogsheep-photos/issues/32 MDEyOklzc3VlQ29tbWVudDc5MTA1MzcyMQ== dsisnero 6213 2021-03-05T00:31:27Z 2021-03-05T00:31:27Z NONE

I am getting the same thing for US West (N. California) us-west-1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
KeyError: 'Contents' on running upload 803333769  
790934616 https://github.com/dogsheep/google-takeout-to-sqlite/issues/4#issuecomment-790934616 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/4 MDEyOklzc3VlQ29tbWVudDc5MDkzNDYxNg== Btibert3 203343 2021-03-04T20:54:44Z 2021-03-04T20:54:44Z NONE

Sorry for the delay, I got sidetracked after class last night. I am getting the following error:

/content# google-takeout-to-sqlite mbox takeout.db Takeout/Mail/gmail.mbox 
Usage: google-takeout-to-sqlite [OPTIONS] COMMAND [ARGS]...Try 'google-takeout-to-sqlite --help' for help.

Error: No such command 'mbox'.

On the box, I installed with pip after cloning: https://github.com/UtahDave/google-takeout-to-sqlite.git

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Feature Request: Gmail 778380836  
790857004 https://github.com/simonw/datasette/issues/1238#issuecomment-790857004 https://api.github.com/repos/simonw/datasette/issues/1238 MDEyOklzc3VlQ29tbWVudDc5MDg1NzAwNA== tsibley 79913 2021-03-04T19:06:55Z 2021-03-04T19:06:55Z NONE

@rgieseke Ah, that's super helpful. Thank you for the workaround for now!

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Custom pages don't work with base_url setting 813899472  
790695126 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790695126 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY5NTEyNg== simonw 9599 2021-03-04T15:20:42Z 2021-03-04T15:20:42Z MEMBER

I'm not sure why but my most recent import, when displayed in Datasette, looks like this:

https://user-images.githubusercontent.com/9599/109985836-0ab00080-7cba-11eb-97d5-0631a0835b61.png">

Sorting by id in the opposite order gives me the data I would expect - so it looks like a bunch of null/blank messages are being imported at some point and showing up first due to ID ordering.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790693674 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790693674 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY5MzY3NA== simonw 9599 2021-03-04T15:18:36Z 2021-03-04T15:18:36Z MEMBER

I imported my 10GB mbox with 750,000 emails in it, ran this tool (with a hacked fix for the blob column problem) - and now a search that returns 92 results takes 25.37ms! This is fantastic.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790669767 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790669767 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY2OTc2Nw== simonw 9599 2021-03-04T14:46:06Z 2021-03-04T14:46:06Z MEMBER

Solution could be to pre-process that string by splitting on ( and dropping everything afterwards, assuming that the (...) bit isn't necessary for correctly parsing the date.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790668263 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790668263 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDY2ODI2Mw== simonw 9599 2021-03-04T14:43:58Z 2021-03-04T14:43:58Z MEMBER

I added this code to output a message ID on errors:

             print("Errors: {}".format(num_errors))
             print(traceback.format_exc())
+            print("Message-Id: {}".format(email.get("Message-Id", "None")))
             continue

Having found a message ID that had an error, I ran this command to see the context:

rg --text --context 20 '44F289B0.000001.02100@SCHWARZE-DWFXMI' ~/gmail.mbox

This was for the following error:

  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 102, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 178, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

Here's what I spotted in the ripgrep output:

177133570:Message-Id: <44F289B0.000001.02100@SCHWARZE-DWFXMI>
177133571-Date: Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit)
177133572-X-Mailer: IncrediMail (5002253)

So it could it be that _parsedate_tz is having trouble with that Mon, 28 Aug 2006 08:14:08 +0200 (Westeurop�ische Sommerzeit) string.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790391711 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790391711 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM5MTcxMQ== UtahDave 306240 2021-03-04T07:36:24Z 2021-03-04T07:36:24Z NONE

Looks like you're doing this:

python elif message.get_content_type() == "text/plain": body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

Ah, that's good to know. I think explicitly creating the tables will be a great improvement. I'll add that.

Also, I noticed after I opened this PR that the message.get_payload() is being deprecated in favor of message.get_content() or something like that. I'll see if that handles the decoding better, too.

Thanks for the feedback. I should have time tomorrow to put together some improvements.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790389335 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790389335 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM4OTMzNQ== UtahDave 306240 2021-03-04T07:32:04Z 2021-03-04T07:32:04Z NONE

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

The wait is from python loading the mbox file. This happens regardless if you're getting the length of the mbox. The mbox module is on the slow side. It is possible to do one's own parsing of the mbox, but I kind of wanted to avoid doing that.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790384087 https://github.com/dogsheep/google-takeout-to-sqlite/issues/6#issuecomment-790384087 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/6 MDEyOklzc3VlQ29tbWVudDc5MDM4NDA4Nw== simonw 9599 2021-03-04T07:22:51Z 2021-03-04T07:22:51Z MEMBER

3 also mentions the conflicting version with other tools.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Upgrade to latest sqlite-utils 821841046  
790380839 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790380839 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM4MDgzOQ== simonw 9599 2021-03-04T07:17:05Z 2021-03-04T07:17:05Z MEMBER

Looks like you're doing this:

    elif message.get_content_type() == "text/plain":
        body = message.get_payload(decode=True)

So presumably that decodes to a unicode string?

I imagine the reason the column is a BLOB for me is that sqlite-utils determines the column type based on the first batch of items - https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1927-L1928 - and I got unlucky and had something in my first batch that wasn't a unicode string.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790379629 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790379629 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3OTYyOQ== simonw 9599 2021-03-04T07:14:41Z 2021-03-04T07:14:41Z MEMBER

Confirmed: removing the len() call does not speed things up, so it's reading through the entire file for some other purpose too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790378658 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790378658 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3ODY1OA== simonw 9599 2021-03-04T07:12:48Z 2021-03-04T07:12:48Z MEMBER

It looks like the body is being loaded into a BLOB column - so in Datasette default it looks like this:

https://user-images.githubusercontent.com/9599/109924808-b4b96980-7c75-11eb-8c9e-307f2ae32d5a.png">

If I datasette install datasette-render-binary and then try again I get this:

https://user-images.githubusercontent.com/9599/109924944-ea5e5280-7c75-11eb-9a32-404f3d68455f.png">

It would be great if we could store the body as unicode text instead. May have to do something clever to decode it based on some kind of charset header?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790373024 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790373024 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3MzAyNA== simonw 9599 2021-03-04T07:01:58Z 2021-03-04T07:04:06Z MEMBER

I got 9 warnings that look like this:

Errors: 1
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 103, in get_mbox
    message["date"] = get_message_date(email.get("Date"), email.get_from())
  File "/Users/simon/Dropbox/Development/google-takeout-to-sqlite/google_takeout_to_sqlite/utils.py", line 167, in get_message_date
    datetime_tuple = email.utils.parsedate_tz(mail_date)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/email/_parseaddr.py", line 69, in _parsedate_tz
    data = data.split()
AttributeError: 'Header' object has no attribute 'split'

It would be useful if those warnings told me the message ID (or similar) of the affected message so I could grep for it in the mbox and see what was going on.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790372621 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790372621 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3MjYyMQ== simonw 9599 2021-03-04T07:01:18Z 2021-03-04T07:01:18Z MEMBER

I'm not sure if it would work, but there is an alternative pattern for showing a progress bar against a really large file that I've used in healthkit-to-sqlite - you set the progress bar size to the size of the file in bytes, then update a counter as you read the file.

https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/cli.py#L24-L57 and https://github.com/dogsheep/healthkit-to-sqlite/blob/3eb2b06bfe3b4faaf10e9cf9dfcb28e3d16c14ff/healthkit_to_sqlite/utils.py#L4-L19 (the progress_callback() bit) is where that happens.

It can be a bit of a convoluted pattern, and I'm not at all sure it would work for mbox files since it looks like that library has other reasons it needs to do a file scan rather than streaming it through one chunk of bytes at a time. So I imagine this would not work here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790370485 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790370485 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM3MDQ4NQ== simonw 9599 2021-03-04T06:57:25Z 2021-03-04T06:57:48Z MEMBER

The command takes quite a while to start running, presumably because this line causes it to have to scan the WHOLE file in order to generate a count:

https://github.com/dogsheep/google-takeout-to-sqlite/blob/a3de045eba0fae4b309da21aa3119102b0efc576/google_takeout_to_sqlite/utils.py#L66-L67

I'm fine with waiting though. It's not like this is a command people run every day - and without that count we can't show a progress bar, which seems pretty important for a process that takes this long.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790369076 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790369076 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDM2OTA3Ng== simonw 9599 2021-03-04T06:54:46Z 2021-03-04T06:54:46Z MEMBER

The Rich-powered progress bar is pretty:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790312268 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-790312268 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc5MDMxMjI2OA== simonw 9599 2021-03-04T05:48:16Z 2021-03-04T05:48:16Z MEMBER

Wow, my mbox is a 10.35 GB download!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
790311215 https://github.com/simonw/datasette/pull/1243#issuecomment-790311215 https://api.github.com/repos/simonw/datasette/issues/1243 MDEyOklzc3VlQ29tbWVudDc5MDMxMTIxNQ== simonw 9599 2021-03-04T05:45:57Z 2021-03-04T05:45:57Z OWNER

Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
fix small typo 815955014  
790257263 https://github.com/simonw/datasette/issues/268#issuecomment-790257263 https://api.github.com/repos/simonw/datasette/issues/268 MDEyOklzc3VlQ29tbWVudDc5MDI1NzI2Mw== mhalle 649467 2021-03-04T03:20:23Z 2021-03-04T03:20:23Z NONE

It's kind of an ugly hack, but you can try out what using the fts5 table as an actual datasette-accessible table looks like without changing any datasette code by creating yet another view on top of the fts5 table:

create view proxyview as select *, rank, table_fts as fts from table_fts;

That's now visible from datasette, just like any other view, but you can use fts match escape_fts(search_string) order by rank.

This is only good as a proof of concept because you're inefficiently going from view -> fts5 external content table -> view -> data table. However, it does show it works.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Mechanism for ranking results from SQLite full-text search 323718842  
790198930 https://github.com/dogsheep/google-takeout-to-sqlite/issues/4#issuecomment-790198930 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/4 MDEyOklzc3VlQ29tbWVudDc5MDE5ODkzMA== Btibert3 203343 2021-03-04T00:58:40Z 2021-03-04T00:58:40Z NONE

I am just seeing this sorry, yes! I will kick the tires later on tonight. My apologies for the delay.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Feature Request: Gmail 778380836  
789680230 https://github.com/simonw/datasette/issues/283#issuecomment-789680230 https://api.github.com/repos/simonw/datasette/issues/283 MDEyOklzc3VlQ29tbWVudDc4OTY4MDIzMA== justinpinkney 605492 2021-03-03T12:28:42Z 2021-03-03T12:28:42Z NONE

One note on using this pragma I got an error on starting datasette no such table: pragma_database_list.

I diagnosed this to an older version of sqlite3 (3.14.2) and upgrading to a newer version (3.34.2) fixed the issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Support cross-database joins 325958506  
789409126 https://github.com/simonw/datasette/issues/268#issuecomment-789409126 https://api.github.com/repos/simonw/datasette/issues/268 MDEyOklzc3VlQ29tbWVudDc4OTQwOTEyNg== mhalle 649467 2021-03-03T03:57:15Z 2021-03-03T03:58:40Z NONE

In FTS5, I think doing an FTS search is actually much easier than doing a join against the main table like datasette does now. In fact, FTS5 external content tables provide a transparent interface back to the original table or view.

Here's what I'm currently doing: * build a view that joins whatever tables I want and rename the columns to non-joiny names (e.g, chapter.name AS chapter_name in the view where needed) * Create an FTS5 table with content="viewname" * As described in the "external content tables" section (https://www.sqlite.org/fts5.html#external_content_tables), sql queries can be made directly to the FTS table, which behind the covers makes select calls to the content table when the content of the original columns are needed. * In addition, you get "rank" and "bm25()" available to you when you select on the _fts table.

Unfortunately, datasette doesn't currently seem happy being coerced into doing a real query on an fts5 table. This works:
select col1, col2, col3 from table_fts where coll1="value" and table_fts match escape_fts("search term") order by rank

But this doesn't work in the datasette SQL query interface:
select col1, col2, col3 from table_fts where coll1="value" and table_fts match escape_fts(:search) order by rank (the "search" input text field doesn't show up)

For what datasette is doing right now, I think you could just use contentless fts5 tables (content=""), since all you care about is the rowid since all you're doing a subselect to get the rowid anyway. In fts5, that's just a contentless table.

I guess if you want to follow this suggestion, you'd need a somewhat different code path for fts5.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Mechanism for ranking results from SQLite full-text search 323718842  
789186458 https://github.com/simonw/datasette/issues/1238#issuecomment-789186458 https://api.github.com/repos/simonw/datasette/issues/1238 MDEyOklzc3VlQ29tbWVudDc4OTE4NjQ1OA== rgieseke 198537 2021-03-02T20:19:30Z 2021-03-02T20:19:30Z CONTRIBUTOR

A custom templates/index.html seems to work and custom pages as a workaround with moving them to pages/base_url_dir.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Custom pages don't work with base_url setting 813899472  
787616446 https://github.com/simonw/datasette/issues/1247#issuecomment-787616446 https://api.github.com/repos/simonw/datasette/issues/1247 MDEyOklzc3VlQ29tbWVudDc4NzYxNjQ0Ng== simonw 9599 2021-03-01T03:50:37Z 2021-03-01T03:50:37Z OWNER

I like the .add_memory_database() option. I also like that it makes it more obvious that this is a capability of Datasette, since I'm excited to see more plugins, features and tests that take advantage of it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
datasette.add_memory_database() method 818430405  
787616158 https://github.com/simonw/datasette/issues/1247#issuecomment-787616158 https://api.github.com/repos/simonw/datasette/issues/1247 MDEyOklzc3VlQ29tbWVudDc4NzYxNjE1OA== simonw 9599 2021-03-01T03:49:27Z 2021-03-01T03:49:27Z OWNER

A couple of options:

datasette.add_memory_database("test_json_array")
# or make that first argument to add_database() optional and support:
datasette.add_database(memory_name="test_json_array")
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
datasette.add_memory_database() method 818430405  
787611153 https://github.com/simonw/datasette/issues/1246#issuecomment-787611153 https://api.github.com/repos/simonw/datasette/issues/1246 MDEyOklzc3VlQ29tbWVudDc4NzYxMTE1Mw== simonw 9599 2021-03-01T03:30:57Z 2021-03-01T03:30:57Z OWNER

I'm going to try a new pattern for testing this, enabled by #1151 - the test will create a new named in-memory database, write some records to it and then run some test facets against that. This will save me from having to add yet another fixtures table for this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Suggest for ArrayFacet possibly confused by blank values 817597268  
787536267 https://github.com/simonw/datasette/issues/1005#issuecomment-787536267 https://api.github.com/repos/simonw/datasette/issues/1005 MDEyOklzc3VlQ29tbWVudDc4NzUzNjI2Nw== simonw 9599 2021-02-28T22:30:37Z 2021-02-28T22:30:37Z OWNER
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Remove xfail tests when new httpx is released 718259202  
787532279 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787532279 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzUzMjI3OQ== simonw 9599 2021-02-28T22:09:37Z 2021-02-28T22:09:37Z OWNER

Microsoft's playwright Python library solves this problem by code generating both their sync AND their async libraries https://github.com/microsoft/playwright-python/tree/master/scripts

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787198202 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787198202 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE5ODIwMg== simonw 9599 2021-02-27T22:33:58Z 2021-02-27T22:33:58Z OWNER

Hah or use this trick, which genuinely rewrites the code at runtime using a class decorator! https://github.com/python-happybase/aiohappybase/blob/0990ef45cfdb720dc987afdb4957a0fac591cb99/aiohappybase/sync/_util.py#L19-L32

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787195536 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787195536 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE5NTUzNg== simonw 9599 2021-02-27T22:13:24Z 2021-02-27T22:13:24Z OWNER

Some other interesting background reading: https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.html - in particular see how SQLALchemy has a await conn.run_sync(meta.drop_all) mechanism for running methods that haven't themselves been provided in an async version

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787190562 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787190562 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE5MDU2Mg== simonw 9599 2021-02-27T22:04:00Z 2021-02-27T22:04:00Z OWNER
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787186826 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787186826 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE4NjgyNg== simonw 9599 2021-02-27T22:01:54Z 2021-02-27T22:01:54Z OWNER

unasync is an implementation of the exact pattern I was talking about above - it uses the tokenize module from the Python standard library to apply some clever rules to transform an async codebase into a sync one. https://unasync.readthedocs.io/en/latest/ - implementation here: https://github.com/python-trio/unasync/blob/v0.5.0/src/unasync/__init__.py

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787175126 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787175126 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE3NTEyNg== simonw 9599 2021-02-27T21:55:05Z 2021-02-27T21:55:05Z OWNER

"how to use some new tools to more easily maintain a codebase that supports both async and synchronous I/O and multiple async libraries" - yeah that's exactly what I need, thank you!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787150276 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787150276 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE1MDI3Ng== polyrand 37962604 2021-02-27T21:27:26Z 2021-02-27T21:27:26Z NONE

I had this resource by Seth Michael Larson saved https://github.com/sethmlarson/pycon-async-sync-poster I haven't had a look at it, but it may contain useful info.

On twitter, I mentioned passing an aiosqlite connection during the Database creation. I'm not 100% familiar with the sqlite-utils codebase, so I may be wrong here, but maybe decorating internal functions could be an option? Then they are awaited or not inside the decorator depending on how they are called.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787144523 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787144523 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE0NDUyMw== simonw 9599 2021-02-27T21:18:46Z 2021-02-27T21:18:46Z OWNER

Here's a really wild idea: I wonder if it would be possible to run a source transformation against either the sync or the async versions of the code to produce the equivalent for the other paradigm?

Could that even be as simple as a set of regular expressions against the await ... version that strips out or replaces the await and async def and async for statements?

If so... I could maintain just the async version, generate the sync version with a script and rely on robust unit testing to guarantee that this actually works.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787142066 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787142066 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzE0MjA2Ng== simonw 9599 2021-02-27T21:17:10Z 2021-02-27T21:17:10Z OWNER

I have a hunch this is actually going to be quite difficult, due to the internal complexity of some of the sqlite-utils API methods.

Consider db[table].extract(...) for example. It does a whole bunch of extra queries inside the method - each of those would need to be turned into an await call for the async version. Here's the method body today:

https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1060-L1152

Writing this method twice - looking similar but with await ... tucked in before every internal method it calls that needs to execute SQL - is going to be pretty messy.

One thing that would help a LOT is figuring out how to share the majority of the test code. If the exact same tests could run against both the sync and async versions with a bit of test trickery, maintaining parallel implementations would at least be a bit more feasible.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787121933 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787121933 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzEyMTkzMw== eyeseast 25778 2021-02-27T19:18:57Z 2021-02-27T19:18:57Z NONE

I think HTTPX gets it exactly right, with a clear separation between sync and async clients, each with a basically identical API. (I'm about to switch feed-to-sqlite over to it, from Requests, to eventually make way for async support.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787120136 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787120136 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzEyMDEzNg== simonw 9599 2021-02-27T19:04:47Z 2021-02-27T19:04:47Z OWNER

Another option here would be to add https://github.com/omnilib/aiosqlite/blob/main/aiosqlite/core.py as a dependency - it's four years old now and actively marinated, and the code is pretty small so it looks like a solid, stable, reliable dependency.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
787118691 https://github.com/simonw/sqlite-utils/issues/242#issuecomment-787118691 https://api.github.com/repos/simonw/sqlite-utils/issues/242 MDEyOklzc3VlQ29tbWVudDc4NzExODY5MQ== simonw 9599 2021-02-27T18:53:23Z 2021-02-27T18:53:23Z OWNER

Datasette has its own implementation of a write queue for exactly this purpose - and there's no reason at all that should stay in Datasette rather than being extracted out and moved over here to sqlite-utils.

One small concern I have is around the API design. I'd want to keep supporting the existing synchronous API while also providing a similar API with await-based methods.

What are some good examples of libraries that do this? I like how https://www.python-httpx.org/ handles it, maybe that's a good example to imitate?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Async support 817989436  
786925280 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-786925280 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc4NjkyNTI4MA== simonw 9599 2021-02-26T22:23:10Z 2021-02-26T22:23:10Z MEMBER

Thanks!

I requested my Gmail export from takeout - once that arrives I'll test it against this and then merge the PR.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
786849095 https://github.com/simonw/datasette/issues/1238#issuecomment-786849095 https://api.github.com/repos/simonw/datasette/issues/1238 MDEyOklzc3VlQ29tbWVudDc4Njg0OTA5NQ== simonw 9599 2021-02-26T19:29:38Z 2021-02-26T19:29:38Z OWNER

Here's the test I wrote:

git diff tests/test_custom_pages.py
diff --git a/tests/test_custom_pages.py b/tests/test_custom_pages.py
index 6a23192..5a71f56 100644
--- a/tests/test_custom_pages.py
+++ b/tests/test_custom_pages.py
@@ -2,11 +2,19 @@ import pathlib
 import pytest
 from .fixtures import make_app_client

+TEST_TEMPLATE_DIRS = str(pathlib.Path(__file__).parent / "test_templates")
+

 @pytest.fixture(scope="session")
 def custom_pages_client():
+    with make_app_client(template_dir=TEST_TEMPLATE_DIRS) as client:
+        yield client
+
+
+@pytest.fixture(scope="session")
+def custom_pages_client_with_base_url():
     with make_app_client(
-        template_dir=str(pathlib.Path(__file__).parent / "test_templates")
+        template_dir=TEST_TEMPLATE_DIRS, config={"base_url": "/prefix/"}
     ) as client:
         yield client

@@ -23,6 +31,12 @@ def test_request_is_available(custom_pages_client):
     assert "path:/request" == response.text


+def test_custom_pages_with_base_url(custom_pages_client_with_base_url):
+    response = custom_pages_client_with_base_url.get("/prefix/request")
+    assert 200 == response.status
+    assert "path:/prefix/request" == response.text
+
+
 def test_custom_pages_nested(custom_pages_client):
     response = custom_pages_client.get("/nested/nest")
     assert 200 == response.status
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Custom pages don't work with base_url setting 813899472  
786848654 https://github.com/simonw/datasette/issues/1238#issuecomment-786848654 https://api.github.com/repos/simonw/datasette/issues/1238 MDEyOklzc3VlQ29tbWVudDc4Njg0ODY1NA== simonw 9599 2021-02-26T19:28:48Z 2021-02-26T19:28:48Z OWNER

I added a debug line just before for regex, wildcard_template here:

https://github.com/simonw/datasette/blob/afed51b1e36cf275c39e71c7cb262d6c5bdbaa31/datasette/app.py#L1148-L1155

And it showed that for some reason request.path is /prefix/prefix/request here - the prefix got doubled somehow.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Custom pages don't work with base_url setting 813899472  
786841261 https://github.com/simonw/datasette/issues/1238#issuecomment-786841261 https://api.github.com/repos/simonw/datasette/issues/1238 MDEyOklzc3VlQ29tbWVudDc4Njg0MTI2MQ== simonw 9599 2021-02-26T19:13:44Z 2021-02-26T19:13:44Z OWNER

Sounds like a bug - thanks for reporting this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Custom pages don't work with base_url setting 813899472  
786840734 https://github.com/simonw/datasette/issues/1246#issuecomment-786840734 https://api.github.com/repos/simonw/datasette/issues/1246 MDEyOklzc3VlQ29tbWVudDc4Njg0MDczNA== simonw 9599 2021-02-26T19:12:39Z 2021-02-26T19:12:47Z OWNER

Could I take this part:

             suggested_facet_sql = """ 
                 select distinct json_type({column}) 
                 from ({sql}) 
             """.format( 
                 column=escape_sqlite(column), sql=self.sql 
             ) 

And add where {column} is not null and {column} != '' perhaps?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Suggest for ArrayFacet possibly confused by blank values 817597268  
786840425 https://github.com/simonw/datasette/issues/1246#issuecomment-786840425 https://api.github.com/repos/simonw/datasette/issues/1246 MDEyOklzc3VlQ29tbWVudDc4Njg0MDQyNQ== simonw 9599 2021-02-26T19:11:56Z 2021-02-26T19:11:56Z OWNER
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Suggest for ArrayFacet possibly confused by blank values 817597268  
786830832 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-786830832 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NjgzMDgzMg== simonw 9599 2021-02-26T18:52:40Z 2021-02-26T18:52:40Z OWNER

Could this handle lists of objects too? That would be pretty amazing - if the column has a [{...}, {...}] list in it could turn that into a many-to-many.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
786813506 https://github.com/simonw/datasette/issues/1240#issuecomment-786813506 https://api.github.com/repos/simonw/datasette/issues/1240 MDEyOklzc3VlQ29tbWVudDc4NjgxMzUwNg== simonw 9599 2021-02-26T18:19:46Z 2021-02-26T18:19:46Z OWNER

Linking to rows from custom queries is a lot harder - because given an arbitrary string of SQL it's difficult to analyze it and figure out which (if any) of the returned columns represent a primary key.

It's possible to manually write a SQL query that returns a column that will be treated as a link to another page using this plugin, but it's not particularly straight-forward: https://datasette.io/plugins/datasette-json-html

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Allow facetting on custom queries 814591962  
786812716 https://github.com/simonw/datasette/issues/1240#issuecomment-786812716 https://api.github.com/repos/simonw/datasette/issues/1240 MDEyOklzc3VlQ29tbWVudDc4NjgxMjcxNg== simonw 9599 2021-02-26T18:18:18Z 2021-02-26T18:18:18Z OWNER

Agreed, this would be extremely useful. I'd love to be able to facet against custom queries. It's a fair bit of work to implement but it's not impossible. Closing this as a duplicate of #972.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
Allow facetting on custom queries 814591962  
786795132 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-786795132 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4Njc5NTEzMg== simonw 9599 2021-02-26T17:45:53Z 2021-02-26T17:45:53Z OWNER

If there's no primary key in the JSON could use the hash_id mechanism.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
786794435 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-786794435 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4Njc5NDQzNQ== simonw 9599 2021-02-26T17:44:38Z 2021-02-26T17:44:38Z OWNER

This came up in office hours!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
786786645 https://github.com/simonw/datasette/issues/1244#issuecomment-786786645 https://api.github.com/repos/simonw/datasette/issues/1244 MDEyOklzc3VlQ29tbWVudDc4Njc4NjY0NQ== simonw 9599 2021-02-26T17:30:38Z 2021-02-26T17:30:38Z OWNER

New paragraph at the top of https://docs.datasette.io/en/latest/writing_plugins.html

Want to start by looking at an example? The Datasette plugins directory lists more than 50 open source plugins with code you can explore. The plugin hooks page includes links to example plugins for each of the documented hooks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Plugin tip: look at the examples linked from the hooks page 817528452  
786050562 https://github.com/simonw/sqlite-utils/issues/237#issuecomment-786050562 https://api.github.com/repos/simonw/sqlite-utils/issues/237 MDEyOklzc3VlQ29tbWVudDc4NjA1MDU2Mg== simonw 9599 2021-02-25T16:57:56Z 2021-02-25T16:57:56Z OWNER

sqlite-utils create-view currently has a --ignore option, so adding that to sqlite-utils drop-view and sqlite-utils drop-table makes sense as well.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
db["my_table"].drop(ignore=True) parameter, plus sqlite-utils drop-table --ignore and drop-view --ignore 815554385  
786049686 https://github.com/simonw/sqlite-utils/issues/237#issuecomment-786049686 https://api.github.com/repos/simonw/sqlite-utils/issues/237 MDEyOklzc3VlQ29tbWVudDc4NjA0OTY4Ng== simonw 9599 2021-02-25T16:56:42Z 2021-02-25T16:56:42Z OWNER

So:

    db["my_table"].drop(ignore=True)
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
db["my_table"].drop(ignore=True) parameter, plus sqlite-utils drop-table --ignore and drop-view --ignore 815554385  
786049394 https://github.com/simonw/sqlite-utils/issues/237#issuecomment-786049394 https://api.github.com/repos/simonw/sqlite-utils/issues/237 MDEyOklzc3VlQ29tbWVudDc4NjA0OTM5NA== simonw 9599 2021-02-25T16:56:14Z 2021-02-25T16:56:14Z OWNER

Other methods (db.create_view() for example) have ignore=True to mean "don't throw an error if this causes a problem", so I'm good with adding that to .drop_view().

I don't like using it as the default partly because that would be a very minor breaking API change, but mainly because I don't want to hide mistakes people make - e.g. if you mistype the name of the table you are trying to drop.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
db["my_table"].drop(ignore=True) parameter, plus sqlite-utils drop-table --ignore and drop-view --ignore 815554385  
786037219 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786037219 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAzNzIxOQ== simonw 9599 2021-02-25T16:39:23Z 2021-02-25T16:39:23Z OWNER

Example from the docs:

>>> db = sqlite_utils.Database(memory=True)
>>> db["dogs"].insert({"name": "Cleo"})
>>> for pk, row in db["dogs"].pks_and_rows_where():
...     print(pk, row)
1 {'rowid': 1, 'name': 'Cleo'}

>>> db["dogs_with_pk"].insert({"id": 5, "name": "Cleo"}, pk="id")
>>> for pk, row in db["dogs_with_pk"].pks_and_rows_where():
...     print(pk, row)
5 {'id': 5, 'name': 'Cleo'}

>>> db["dogs_with_compound_pk"].insert(
...     {"species": "dog", "id": 3, "name": "Cleo"},
...     pk=("species", "id")
... )
>>> for pk, row in db["dogs_with_compound_pk"].pks_and_rows_where():
...     print(pk, row)
('dog', 3) {'species': 'dog', 'id': 3, 'name': 'Cleo'}
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
786036355 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786036355 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAzNjM1NQ== simonw 9599 2021-02-25T16:38:07Z 2021-02-25T16:38:07Z OWNER
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
786035142 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-786035142 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NjAzNTE0Mg== simonw 9599 2021-02-25T16:36:17Z 2021-02-25T16:36:17Z OWNER

WIP in a pull request.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
786016380 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786016380 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAxNjM4MA== simonw 9599 2021-02-25T16:10:01Z 2021-02-25T16:10:01Z OWNER

I prototyped this and I like it:

In [1]: import sqlite_utils
In [2]: db = sqlite_utils.Database("/Users/simon/Dropbox/Development/datasette/fixtures.db")
In [3]: list(db["compound_primary_key"].pks_and_rows_where())
Out[3]: [(('a', 'b'), {'pk1': 'a', 'pk2': 'b', 'content': 'c'})]
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
786007209 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786007209 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAwNzIwOQ== simonw 9599 2021-02-25T15:57:50Z 2021-02-25T15:57:50Z OWNER

table.pks_and_rows_where(...) is explicit and I think less ambiguous than the other options.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
786006794 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786006794 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAwNjc5NA== simonw 9599 2021-02-25T15:57:17Z 2021-02-25T15:57:28Z OWNER

I quite like pks_with_rows_where(...) - but grammatically it suggests it will return the primary keys that exist where their rows match the criteria - "pks with rows" can be interpreted as "pks for the rows that..." as opposed to "pks accompanied by rows"

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
786005078 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786005078 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAwNTA3OA== simonw 9599 2021-02-25T15:54:59Z 2021-02-25T15:56:16Z OWNER

Is pk_rows_where() a good name? It sounds like it returns "primary key rows" which isn't a thing. It actually returns rows along with their primary key.

Other options:

  • table.rows_with_pk_where(...) - should this return (row, pk) rather than (pk, row)?
  • table.rows_where_pk(...)
  • table.pk_and_rows_where(...)
  • table.pk_with_rows_where(...)
  • table.pks_with_rows_where(...) - because rows is pluralized, so pks should be pluralized too?
  • table.pks_rows_where(...)
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
786001768 https://github.com/simonw/sqlite-utils/issues/240#issuecomment-786001768 https://api.github.com/repos/simonw/sqlite-utils/issues/240 MDEyOklzc3VlQ29tbWVudDc4NjAwMTc2OA== simonw 9599 2021-02-25T15:50:28Z 2021-02-25T15:52:12Z OWNER

One option: .rows_where() could grow a ensure_pk=True option which checks to see if the table is a rowid table and, if it is, includes that in the select.

Or... how about you can call .rows_where(..., pks=True) and it will yield (pk, rowdict) tuple pairs instead of just returning the sequence of dictionaries?

I'm always a little bit nervous of methods that vary their return type based on their arguments. Maybe this would be a separate method instead?

    for pk, row in table.pk_rows_where(...):
        # ...
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
table.pks_and_rows_where() method returning primary keys along with the rows 816560819  
785992158 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785992158 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk5MjE1OA== simonw 9599 2021-02-25T15:37:04Z 2021-02-25T15:37:04Z OWNER

Here's the current implementation of .extract(): https://github.com/simonw/sqlite-utils/blob/806c21044ac8d31da35f4c90600e98115aade7c6/sqlite_utils/db.py#L1049-L1074

Tricky detail here: I create the lookup table first, based on the types of the columns that are being extracted.

I need to do this because extraction currently uses unique tuples of values, so the table has to be created in advance.

But if I'm using these new expand functions to figure out what's going to be extracted, I don't know the names of the columns and their types in advance. I'm only going to find those out during the transformation.

This may turn out to be incompatible with how .extract() works at the moment. I may need a new method, .extract_expand() perhaps? It could be simpler - work only against a single column for example.

I can still use the existing sqlite-utils extract CLI command though, with a --json flag and a rule that you can't run it against multiple columns.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785983837 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785983837 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk4MzgzNw== simonw 9599 2021-02-25T15:25:21Z 2021-02-25T15:28:57Z OWNER

Problem with calling this argument transform= is that the term "transform" already means something else in this library.

I could use convert= instead.

... but that doesn't instantly make me think of turning a value into multiple columns.

How about expand=? I've not used that term anywhere yet.

db["Reports"].extract(["Reported by"], expand={"Reported by": json.loads})

I think that works. You're expanding a single value into several columns of information.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785983070 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785983070 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk4MzA3MA== simonw 9599 2021-02-25T15:24:17Z 2021-02-25T15:24:17Z OWNER

I'm going to go with last-wins - so if multiple transform functions return the same key the last one will over-write the others.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785980813 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785980813 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk4MDgxMw== simonw 9599 2021-02-25T15:21:02Z 2021-02-25T15:23:47Z OWNER

Maybe the Python version takes an optional dictionary mapping column names to transformation functions? It could then merge all of those results together - and maybe throw an error if the same key is produced by more than one column.

    db["Reports"].extract(["Reported by"], transform={"Reported by": json.loads})

Or it could have an option for different strategies if keys collide: first wins, last wins, throw exception, add a prefix to the new column name. That feels a bit too complex for an edge-case though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785980083 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785980083 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk4MDA4Mw== simonw 9599 2021-02-25T15:20:02Z 2021-02-25T15:20:02Z OWNER

It would be OK if the CLI version only allows you to specify a single column if you are using the --json option.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785979769 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785979769 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk3OTc2OQ== simonw 9599 2021-02-25T15:19:37Z 2021-02-25T15:19:37Z OWNER

For the Python version I'd like to be able to provide a transformation callback function - which can be json.loads but could also be anything else which accepts the value of the current column and returns a Python dictionary of columns and their values to use in the new table.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785979192 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785979192 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk3OTE5Mg== simonw 9599 2021-02-25T15:18:46Z 2021-02-25T15:18:46Z OWNER

Likewise the sqlite-utils extract command takes one or more columns:

Usage: sqlite-utils extract [OPTIONS] PATH TABLE COLUMNS...

  Extract one or more columns into a separate table

Options:
  --table TEXT             Name of the other table to extract columns to
  --fk-column TEXT         Name of the foreign key column to add to the table
  --rename <TEXT TEXT>...  Rename this column in extracted table
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785978689 https://github.com/simonw/sqlite-utils/issues/239#issuecomment-785978689 https://api.github.com/repos/simonw/sqlite-utils/issues/239 MDEyOklzc3VlQ29tbWVudDc4NTk3ODY4OQ== simonw 9599 2021-02-25T15:18:03Z 2021-02-25T15:18:03Z OWNER

The Python .extract() method currently starts like this:

def extract(self, columns, table=None, fk_column=None, rename=None):
        rename = rename or {}
        if isinstance(columns, str):
            columns = [columns]
        if not set(columns).issubset(self.columns_dict.keys()):
            raise InvalidColumns(
                "Invalid columns {} for table with columns {}".format(
                    columns, list(self.columns_dict.keys())
                )
            )
        ...

Note that it takes a list of columns (and treats a string as a single item list). That's because it can be called with a list of columns and it will use them to populate another table of unique tuples of those column values.

So a new mechanism that can instead read JSON values from a single column needs to be compatible with that existing design.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
sqlite-utils extract could handle nested objects 816526538  
785972074 https://github.com/simonw/sqlite-utils/issues/238#issuecomment-785972074 https://api.github.com/repos/simonw/sqlite-utils/issues/238 MDEyOklzc3VlQ29tbWVudDc4NTk3MjA3NA== simonw 9599 2021-02-25T15:08:36Z 2021-02-25T15:08:36Z OWNER
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
.add_foreign_key() corrupts database if column contains a space 816523763  
785485597 https://github.com/simonw/datasette/pull/1243#issuecomment-785485597 https://api.github.com/repos/simonw/datasette/issues/1243 MDEyOklzc3VlQ29tbWVudDc4NTQ4NTU5Nw== codecov[bot] 22429695 2021-02-25T00:28:30Z 2021-02-25T00:28:30Z NONE

Codecov Report

Merging #1243 (887bfd2) into main (726f781) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1243   +/-   ##
=======================================
  Coverage   91.56%   91.56%           
=======================================
  Files          34       34           
  Lines        4242     4242           
=======================================
  Hits         3884     3884           
  Misses        358      358           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 726f781...32652d9. Read the comment docs.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
fix small typo 815955014  
784638394 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-784638394 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc4NDYzODM5NA== UtahDave 306240 2021-02-24T00:36:18Z 2021-02-24T00:36:18Z NONE

I noticed that @simonw is using black for formatting. I ran black on my additions in this PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
784567547 https://github.com/simonw/datasette/issues/1241#issuecomment-784567547 https://api.github.com/repos/simonw/datasette/issues/1241 MDEyOklzc3VlQ29tbWVudDc4NDU2NzU0Nw== simonw 9599 2021-02-23T22:45:56Z 2021-02-23T22:46:12Z OWNER

I really like the way the Share feature on Stack Overflow works: https://stackoverflow.com/questions/18934149/how-can-i-use-postgresqls-text-column-type-in-django

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
[Feature request] Button to copy URL 814595021  
784347646 https://github.com/simonw/datasette/issues/1241#issuecomment-784347646 https://api.github.com/repos/simonw/datasette/issues/1241 MDEyOklzc3VlQ29tbWVudDc4NDM0NzY0Ng== Kabouik 7107523 2021-02-23T16:55:26Z 2021-02-23T16:57:39Z NONE

I think it's possible that many users these days no longer assume they can paste a URL from the browser address bar (if they ever understood that at all) because to many apps are SPAs with broken URLs.

Absolutely, that's why I thought my corner case with iframe preventing access to the datasette URL could actually be relevant in more general situations.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
[Feature request] Button to copy URL 814595021  
784334931 https://github.com/simonw/datasette/issues/1241#issuecomment-784334931 https://api.github.com/repos/simonw/datasette/issues/1241 MDEyOklzc3VlQ29tbWVudDc4NDMzNDkzMQ== simonw 9599 2021-02-23T16:37:26Z 2021-02-23T16:37:26Z OWNER

A "Share link" button would only be needed on the table page and the arbitrary query page I think - and maybe on the row page, especially as that page starts to grow more features in the future.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
[Feature request] Button to copy URL 814595021  
784333768 https://github.com/simonw/datasette/issues/1241#issuecomment-784333768 https://api.github.com/repos/simonw/datasette/issues/1241 MDEyOklzc3VlQ29tbWVudDc4NDMzMzc2OA== simonw 9599 2021-02-23T16:35:51Z 2021-02-23T16:35:51Z OWNER

This can definitely be done with a plugin.

Adding to Datasette itself is an interesting idea. I think it's possible that many users these days no longer assume they can paste a URL from the browser address bar (if they ever understood that at all) because to many apps are SPAs with broken URLs.

The shareable URLs are actually a key feature of Datasette - so maybe they should be highlighted in the default UI?

I built a "copy to clipboard" feature for datasette-copyable and wrote up how that works here: https://til.simonwillison.net/javascript/copy-button

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
[Feature request] Button to copy URL 814595021  
784312460 https://github.com/simonw/datasette/issues/1240#issuecomment-784312460 https://api.github.com/repos/simonw/datasette/issues/1240 MDEyOklzc3VlQ29tbWVudDc4NDMxMjQ2MA== Kabouik 7107523 2021-02-23T16:07:10Z 2021-02-23T16:08:28Z NONE

Likewise, while answering to another issue regarding the Vega plugin, I realized that there is no such way of linking rows after a custom query, I only get this "Link" column with individual URLs for the default SQL view:

Or is it there and I am just missing the option in my custom queries?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Allow facetting on custom queries 814591962  
784157345 https://github.com/simonw/datasette/issues/1218#issuecomment-784157345 https://api.github.com/repos/simonw/datasette/issues/1218 MDEyOklzc3VlQ29tbWVudDc4NDE1NzM0NQ== soobrosa 1244799 2021-02-23T12:12:17Z 2021-02-23T12:12:17Z NONE

Topline this fixed the same problem for me.

brew install python@3.7
ln -s /usr/local/opt/python@3.7/bin/python3.7 /usr/local/opt/python/bin/python3.7
pip3 uninstall -y numpy
pip3 uninstall -y setuptools
pip3 install setuptools
pip3 install numpy
pip3 install datasette-publish-fly
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
/usr/local/opt/python3/bin/python3.6: bad interpreter: No such file or directory 803356942  
783794520 https://github.com/dogsheep/google-takeout-to-sqlite/pull/5#issuecomment-783794520 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/5 MDEyOklzc3VlQ29tbWVudDc4Mzc5NDUyMA== UtahDave 306240 2021-02-23T01:13:54Z 2021-02-23T01:13:54Z NONE

Also, @simonw I created a test based off the existing tests. I think it's working correctly

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
WIP: Add Gmail takeout mbox import 813880401  
783774084 https://github.com/simonw/datasette/issues/1239#issuecomment-783774084 https://api.github.com/repos/simonw/datasette/issues/1239 MDEyOklzc3VlQ29tbWVudDc4Mzc3NDA4NA== simonw 9599 2021-02-23T00:18:56Z 2021-02-23T00:19:18Z OWNER

Bug is here: https://github.com/simonw/datasette/blob/42caabf7e9e6e4d69ef6dd7de16f2cd96bc79d5b/datasette/filters.py#L149-L165

Those json_each lines should be:

select {t}.rowid from {t}, json_each([{t}].[{c}]) j
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
JSON filter fails if column contains spaces 813978858  
783688547 https://github.com/dogsheep/google-takeout-to-sqlite/issues/4#issuecomment-783688547 https://api.github.com/repos/dogsheep/google-takeout-to-sqlite/issues/4 MDEyOklzc3VlQ29tbWVudDc4MzY4ODU0Nw== UtahDave 306240 2021-02-22T21:31:28Z 2021-02-22T21:31:28Z NONE

@Btibert3 I've opened a PR with my initial attempt at this. Would you be willing to give this a try?

https://github.com/dogsheep/google-takeout-to-sqlite/pull/5

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Feature Request: Gmail 778380836  
783676548 https://github.com/simonw/datasette/issues/1237#issuecomment-783676548 https://api.github.com/repos/simonw/datasette/issues/1237 MDEyOklzc3VlQ29tbWVudDc4MzY3NjU0OA== simonw 9599 2021-02-22T21:10:19Z 2021-02-22T21:10:25Z OWNER

This is another change which is a little bit hard to figure out because I haven't solved #878 yet.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
?_pretty=1 option for pretty-printing JSON output 812704869  
783674659 https://github.com/simonw/datasette/issues/1234#issuecomment-783674659 https://api.github.com/repos/simonw/datasette/issues/1234 MDEyOklzc3VlQ29tbWVudDc4MzY3NDY1OQ== simonw 9599 2021-02-22T21:06:28Z 2021-02-22T21:06:28Z OWNER

I'm not going to work on this for a while, but if anyone has needs or ideas around that they can add them to this issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Runtime support for ATTACHing multiple databases 811505638  
783674038 https://github.com/simonw/datasette/issues/1236#issuecomment-783674038 https://api.github.com/repos/simonw/datasette/issues/1236 MDEyOklzc3VlQ29tbWVudDc4MzY3NDAzOA== simonw 9599 2021-02-22T21:05:21Z 2021-02-22T21:05:21Z OWNER

It's good on mobile - iOS at least. Going to close this open new issues if anyone reports bugs.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 1,
    "eyes": 0
}
Ability to increase size of the SQL editor window 812228314  
783662968 https://github.com/simonw/sqlite-utils/issues/220#issuecomment-783662968 https://api.github.com/repos/simonw/sqlite-utils/issues/220 MDEyOklzc3VlQ29tbWVudDc4MzY2Mjk2OA== mhalle 649467 2021-02-22T20:44:51Z 2021-02-22T20:44:51Z NONE

Actually, coming back to this, I have a clearer use case for enabling fts generation for views: making it easier to bring in text from lookup tables and other joins.

The datasette documentation describes populating an fts table like so:

INSERT INTO "items_fts" (rowid, name, description, category_name)
    SELECT items. rowid,
    items.name,
    items.description,
    categories.name
    FROM items JOIN categories ON items.category_id=categories.id;

Alternatively if you have fts support in sqlite_utils for views (which sqlite and fts5 support), you can do the same thing just by creating a view that captures the above joins as columns, then creating an fts table from that view. Such an fts table can be created using sqlite_utils, where one created with your method can't.

The resulting fts table can then be used by a whole family of related tables and views in the manner you described earlier in this issue.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Better error message for *_fts methods against views 783778672  
783560017 https://github.com/simonw/datasette/issues/1166#issuecomment-783560017 https://api.github.com/repos/simonw/datasette/issues/1166 MDEyOklzc3VlQ29tbWVudDc4MzU2MDAxNw== thorn0 94334 2021-02-22T18:00:57Z 2021-02-22T18:13:11Z NONE

Hi! I don't think Prettier supports this syntax for globs: datasette/static/*[!.min].js Are you sure that works?
Prettier uses https://github.com/mrmlnc/fast-glob, which in turn uses https://github.com/micromatch/micromatch, and the docs for these packages don't mention this syntax. As per the docs, square brackets should work as in regexes (foo-[1-5].js).

Tested it. Apparently, it works as a negated character class in regexes (like [^.min]). I wonder where this syntax comes from. Micromatch doesn't support that:

micromatch(['static/table.js', 'static/n.js'], ['static/*[!.min].js']);
// result: ["static/n.js"] -- brackets are treated like [!.min] in regexes, without negation
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Adopt Prettier for JavaScript code formatting 777140799  

Next page

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);