home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

5 rows where issue = 749283032 and "updated_at" is on date 2022-04-21 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • simonw 3
  • eyeseast 2

author_association 2

  • OWNER 3
  • CONTRIBUTOR 2

issue 1

  • register_output_renderer() should support streaming data · 5 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1105642187 https://github.com/simonw/datasette/issues/1101#issuecomment-1105642187 https://api.github.com/repos/simonw/datasette/issues/1101 IC_kwDOBm6k_c5B5sLL eyeseast 25778 2022-04-21T18:59:08Z 2022-04-21T18:59:08Z CONTRIBUTOR

Ha! That was your idea (and a good one).

But it's probably worth measuring to see what overhead it adds. It did require both passing in the database and making the whole thing async.

Just timing the queries themselves:

  1. Using AsGeoJSON(geometry) as geometry takes 10.235 ms
  2. Leaving as binary takes 8.63 ms

Looking at the network panel:

  1. Takes about 200 ms for the fetch request
  2. Takes about 300 ms

I'm not sure how best to time the GeoJSON generation, but it would be interesting to check. Maybe I'll write a plugin to add query times to response headers.

The other thing to consider with async streaming is that it might be well-suited for a slower response. When I have to get the whole result and send a response in a fixed amount of time, I need the most efficient query possible. If I can hang onto a connection and get things one chunk at a time, maybe it's ok if there's some overhead.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
register_output_renderer() should support streaming data 749283032  
1105615625 https://github.com/simonw/datasette/issues/1101#issuecomment-1105615625 https://api.github.com/repos/simonw/datasette/issues/1101 IC_kwDOBm6k_c5B5lsJ simonw 9599 2022-04-21T18:31:41Z 2022-04-21T18:32:22Z OWNER

The datasette-geojson plugin is actually an interesting case here, because of the way it converts SpatiaLite geometries into GeoJSON: https://github.com/eyeseast/datasette-geojson/blob/602c4477dc7ddadb1c0a156cbcd2ef6688a5921d/datasette_geojson/init.py#L61-L66

```python

if isinstance(geometry, bytes):
    results = await db.execute(
        "SELECT AsGeoJSON(:geometry)", {"geometry": geometry}
    )
    return geojson.loads(results.single_value())

`` That actually seems to work really well as-is, but it does worry me a bit that it ends up having to execute an extraSELECT` query for every single returned row - especially in streaming mode where it might be asked to return 1m rows at once.

My PostgreSQL/MySQL engineering brain says that this would be better handled by doing a chunk of these (maybe 100) at once, to avoid the per-query-overhead - but with SQLite that might not be necessary.

At any rate, this is one of the reasons I'm interested in "iterate over this sequence of chunks of 100 rows at a time" as a potential option here.

Of course, a better solution would be for datasette-geojson to have a way to influence the SQL query before it is executed, adding a AsGeoJSON(geometry) clause to it - so that's something I'm open to as well.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
register_output_renderer() should support streaming data 749283032  
1105608964 https://github.com/simonw/datasette/issues/1101#issuecomment-1105608964 https://api.github.com/repos/simonw/datasette/issues/1101 IC_kwDOBm6k_c5B5kEE simonw 9599 2022-04-21T18:26:29Z 2022-04-21T18:26:29Z OWNER

I'm questioning if the mechanisms should be separate at all now - a single response rendering is really just a case of a streaming response that only pulls the first N records from the iterator.

It probably needs to be an async for iterator, which I've not worked with much before. Good opportunity to learn.

This actually gets a fair bit more complicated due to the work I'm doing right now to improve the default JSON API:

  • 1709

I want to do things like make faceting results optionally available to custom renderers - which is a separate concern from streaming rows.

I'm going to poke around with a bunch of prototypes and see what sticks.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
register_output_renderer() should support streaming data 749283032  
1105588651 https://github.com/simonw/datasette/issues/1101#issuecomment-1105588651 https://api.github.com/repos/simonw/datasette/issues/1101 IC_kwDOBm6k_c5B5fGr eyeseast 25778 2022-04-21T18:15:39Z 2022-04-21T18:15:39Z CONTRIBUTOR

What if you split rendering and streaming into two things:

  • render is a function that returns a response
  • stream is a function that sends chunks, or yields chunks passed to an ASGI send callback

That way current plugins still work, and streaming is purely additive. A stream function could get a cursor or iterator of rows, instead of a list, so it could more efficiently handle large queries.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
register_output_renderer() should support streaming data 749283032  
1105571003 https://github.com/simonw/datasette/issues/1101#issuecomment-1105571003 https://api.github.com/repos/simonw/datasette/issues/1101 IC_kwDOBm6k_c5B5ay7 simonw 9599 2022-04-21T18:10:38Z 2022-04-21T18:10:46Z OWNER

Maybe the simplest design for this is to add an optional can_stream to the contract:

python @hookimpl def register_output_renderer(datasette): return { "extension": "tsv", "render": render_tsv, "can_render": lambda: True, "can_stream": lambda: True } When streaming, a new parameter could be passed to the render function - maybe chunks - which is an iterator/generator over a sequence of chunks of rows.

Or it could use the existing rows parameter but treat that as an iterator?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
register_output_renderer() should support streaming data 749283032  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 21.68ms · About: github-to-sqlite