23 rows where issue = 775666296 sorted by updated_at descending

View and edit SQL

Suggested facets: created_at (date), updated_at (date)

user

issue

  • "datasette insert" command and plugin hook · 23

author_association

id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
753568428 https://github.com/simonw/datasette/issues/1160#issuecomment-753568428 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MzU2ODQyOA== simonw 9599 2021-01-03T05:02:32Z 2021-01-03T05:02:32Z OWNER

Should this command include a --fts option for configuring full-text search on one-or-more columns?

I thought about doing that for sqlite-utils insert in https://github.com/simonw/sqlite-utils/issues/202 and decided not to because of the need to include extra options covering the FTS version, porter stemming options and whether or not to create triggers.

But maybe I can set sensible defaults for that with datasette insert ... -f title -f body? Worth thinking about a bit more.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752275611 https://github.com/simonw/datasette/issues/1160#issuecomment-752275611 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI3NTYxMQ== simonw 9599 2020-12-29T23:32:04Z 2020-12-29T23:32:04Z OWNER

If I can get this working for CSV, TSV, JSON and JSON-NL that should be enough to exercise the API design pretty well across both streaming and non-streaming formats.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752274509 https://github.com/simonw/datasette/issues/1160#issuecomment-752274509 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI3NDUwOQ== simonw 9599 2020-12-29T23:26:02Z 2020-12-29T23:26:02Z OWNER

The documentation for this plugin hook is going to be pretty detailed, since it involves writing custom classes.

I'll stick it all on the existing hooks page for the moment, but I should think about breaking up the plugin hook documentation into a page-per-hook in the future.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752274078 https://github.com/simonw/datasette/issues/1160#issuecomment-752274078 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI3NDA3OA== simonw 9599 2020-12-29T23:23:39Z 2020-12-29T23:23:39Z OWNER

If I design this right I can ship a full version of the command-line datasette insert command in a release without doing any work at all on the Web UI version of it - that UI can then come later, without needing any changes to be made to the plugin hook.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752273873 https://github.com/simonw/datasette/issues/1160#issuecomment-752273873 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI3Mzg3Mw== simonw 9599 2020-12-29T23:22:30Z 2020-12-29T23:22:30Z OWNER

How much of this should I get done in a branch before merging into main?

The challenge here is the plugin hook design: ideally I don't want an incomplete plugin hook design in main since that could be a blocker for a release.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752273400 https://github.com/simonw/datasette/issues/1160#issuecomment-752273400 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI3MzQwMA== simonw 9599 2020-12-29T23:19:46Z 2020-12-29T23:19:46Z OWNER

I'm going to break out some separate tickets.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752273306 https://github.com/simonw/datasette/issues/1160#issuecomment-752273306 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI3MzMwNg== simonw 9599 2020-12-29T23:19:15Z 2020-12-29T23:19:15Z OWNER

It would be nice if this abstraction could support progress bars as well. These won't necessarily work for every format - or they might work for things loaded from files but not things loaded over URLs (if the content-length HTTP header is missing) - but if they ARE possible it would be good to provide them - both for the CLI interface and the web insert UI.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752267905 https://github.com/simonw/datasette/issues/1160#issuecomment-752267905 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI2NzkwNQ== simonw 9599 2020-12-29T22:52:09Z 2020-12-29T22:52:09Z OWNER

What's the simplest thing that could possible work? I think it's datasette insert blah.db data.csv - no URL handling, no other formats.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752266076 https://github.com/simonw/datasette/issues/1160#issuecomment-752266076 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI2NjA3Ng== simonw 9599 2020-12-29T22:42:23Z 2020-12-29T22:42:59Z OWNER

Aside: maybe datasette insert works against simple files, but a later mechanism called datasette import allows plugins to register sub-commands, like datasette import github ... or datasette import jira ... or whatever.

This would be useful for import mechanisms that are likely to need their own custom set of command-line options unique to that source.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752265600 https://github.com/simonw/datasette/issues/1160#issuecomment-752265600 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI2NTYwMA== simonw 9599 2020-12-29T22:39:56Z 2020-12-29T22:39:56Z OWNER

Does it definitely make sense to break this operation up into the code that turns the incoming format into a iterator of dictionaries, then the code that inserts those into the database using sqlite-utils?

That seems right for simple imports, where the incoming file represents a sequence of records in a single table. But what about more complex formats? What if a format needs to be represented as multiple tables?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752259345 https://github.com/simonw/datasette/issues/1160#issuecomment-752259345 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI1OTM0NQ== simonw 9599 2020-12-29T22:11:54Z 2020-12-29T22:11:54Z OWNER

Important detail from https://docs.python.org/3/library/csv.html#csv.reader

If csvfile is a file object, it should be opened with newline=''. [1]

[...]

If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752257666 https://github.com/simonw/datasette/issues/1160#issuecomment-752257666 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjI1NzY2Ng== simonw 9599 2020-12-29T22:09:18Z 2020-12-29T22:09:18Z OWNER

Figuring out the API design

I want to be able to support different formats, and be able to parse them into tables either streaming or in one go depending on if the format supports that.

Ideally I want to be able to pull the first 1,024 bytes for the purpose of detecting the format, then replay those bytes again later. I'm considering this a stretch goal though.

CSV is easy to parse as a stream - here’s how sqlite-utils does it:

    dialect = "excel-tab" if tsv else "excel"
    with file_progress(json_file, silent=silent) as json_file:
        reader = csv_std.reader(json_file, dialect=dialect)
        headers = next(reader)
        docs = (dict(zip(headers, row)) for row in reader)

Problem: using db.insert_all() could block for a long time on a big set of rows. Probably easiest to batch the records before calling insert_all() and then run a batch at a time using a db.execute_write_fn() call.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752236520 https://github.com/simonw/datasette/issues/1160#issuecomment-752236520 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjIzNjUyMA== simonw 9599 2020-12-29T20:48:51Z 2020-12-29T20:48:51Z OWNER

It would be neat if datasette insert could accept a --plugins-dir option which allowed one-off format plugins to be registered. Bit tricky to implement since the --format Click option will already be populated by that plugin hook call.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751925934 https://github.com/simonw/datasette/issues/1160#issuecomment-751925934 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTkyNTkzNA== simonw 9599 2020-12-29T02:40:13Z 2020-12-29T20:25:57Z OWNER

Basic command design:

datasette insert data.db blah.csv

The options can include:

  • --format to specify the exact format - without this it will be guessed based on the filename
  • --table to specify the table (otherwise the filename is used)
  • --pk to specify one or more primary key columns
  • --replace to specify that existing rows with a matching primary key should be replaced
  • --upsert to specify that existing matching rows should be upserted
  • --ignore to ignore matching rows
  • --alter to alter the table to add missing columns
  • --type column type to specify the type of a column - useful when working with CSV or TSV files
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752208036 https://github.com/simonw/datasette/issues/1160#issuecomment-752208036 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjIwODAzNg== simonw 9599 2020-12-29T19:06:35Z 2020-12-29T19:06:35Z OWNER

If I'm going to execute 1000s of writes in an async def operation it may make sense to break that up into smaller chunks, so as not to block the event loop for too long.

https://stackoverflow.com/a/36648102 and https://github.com/python/asyncio/issues/284 confirm that await asyncio.sleep(0) is the recommended way of doing this.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
752203909 https://github.com/simonw/datasette/issues/1160#issuecomment-752203909 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MjIwMzkwOQ== simonw 9599 2020-12-29T18:54:19Z 2020-12-29T18:54:19Z OWNER

More thoughts on this: the key mechanism that populates the tables needs to be an aysnc def method of some sort so that it can run as part of the async loop in core Datasette - for importing from web uploads.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751947991 https://github.com/simonw/datasette/issues/1160#issuecomment-751947991 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTk0Nzk5MQ== simonw 9599 2020-12-29T05:06:50Z 2020-12-29T05:07:03Z OWNER

Given the URL option could it be possible for plugins to "subscribe" to URLs that keep on streaming?

datasette insert db.db https://example.con/streaming-api \
  --format api-stream
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751946262 https://github.com/simonw/datasette/issues/1160#issuecomment-751946262 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTk0NjI2Mg== simonw 9599 2020-12-29T04:56:12Z 2020-12-29T04:56:32Z OWNER

Potential design for this: a datasette memory command which takes most of the same arguments as datasette serve but starts an in-memory database and treats the command arguments as things that should be inserted into that in-memory database.

tail -f access.log | datasette memory - \
  --format clf -p 8002 -o
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751945094 https://github.com/simonw/datasette/issues/1160#issuecomment-751945094 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTk0NTA5NA== simonw 9599 2020-12-29T04:48:11Z 2020-12-29T04:48:11Z OWNER

It would be pretty cool if you could launch Datasette directly against an insert-compatible file or URL without first having to load it into a SQLite database file.

Or imagine being able to tail a log file and like that directly into a new Datasette process, which then runs a web server with the UI while simultaneously continuing to load new entries from that log into the in-memory SQLite database that it is serving...

Not quite sure what that CLI interface would look like. Maybe treat that as a future stretch goal for the moment.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751943837 https://github.com/simonw/datasette/issues/1160#issuecomment-751943837 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTk0MzgzNw== simonw 9599 2020-12-29T04:40:30Z 2020-12-29T04:40:30Z OWNER

The insert command should also accept URLs - anything starting with http:// or https://.

It should accept more than one file name at a time for bulk inserts.

if using a URL that URL will be passed to the method that decides if a plugin implementation can handle the import or not. This will allow plugins to register themselves for specific websites.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751926437 https://github.com/simonw/datasette/issues/1160#issuecomment-751926437 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTkyNjQzNw== simonw 9599 2020-12-29T02:43:21Z 2020-12-29T02:43:37Z OWNER

Default formats to support:

  • CSV
  • TSV
  • JSON and newline-delimited JSON
  • YAML

Each of these will be implemented as a default plugin.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751926218 https://github.com/simonw/datasette/issues/1160#issuecomment-751926218 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTkyNjIxOA== simonw 9599 2020-12-29T02:41:57Z 2020-12-29T02:41:57Z OWNER

Other names I considered:

  • datasette load
  • datasette import - I decided to keep this name available for any future work that might involve plugins that help import data from APIs as opposed to inserting it from files
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  
751926095 https://github.com/simonw/datasette/issues/1160#issuecomment-751926095 https://api.github.com/repos/simonw/datasette/issues/1160 MDEyOklzc3VlQ29tbWVudDc1MTkyNjA5NQ== simonw 9599 2020-12-29T02:41:15Z 2020-12-29T02:41:15Z OWNER

The UI can live at /-/insert and be available by default to the root user only. It can offer the following:

  • Upload a file and have the import type detected (equivalent to datasette insert data.db thatfile.csv)
  • Copy and paste the data to be inserted into a textarea
  • API equivalents of these
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
"datasette insert" command and plugin hook 775666296  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);