{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-1399341761", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 1399341761, "node_id": "IC_kwDOBm6k_c5TaELB", "user": {"value": 9599, "label": "simonw"}, "created_at": "2023-01-21T22:07:19Z", "updated_at": "2023-01-21T22:07:19Z", "author_association": "OWNER", "body": "Idea for supporting streaming with the `register_output_renderer` hook:\r\n\r\n```python\r\n@hookimpl\r\ndef register_output_renderer(datasette):\r\n    return {\r\n        \"extension\": \"test\",\r\n        \"render\": render_demo,\r\n        \"can_render\": can_render_demo,\r\n        \"render_stream\": render_demo_stream, # This is new\r\n    }\r\n```\r\nSo there's a new `\"render_stream\"` key which can be returned, which if present means that the output renderer supports streaming.\r\n\r\nI'll play around with the design of that function signature in:\r\n\r\n- #1999\r\n- #1062 ", "reactions": "{\"total_count\": 1, \"+1\": 1, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-1105642187", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 1105642187, "node_id": "IC_kwDOBm6k_c5B5sLL", "user": {"value": 25778, "label": "eyeseast"}, "created_at": "2022-04-21T18:59:08Z", "updated_at": "2022-04-21T18:59:08Z", "author_association": "CONTRIBUTOR", "body": "Ha! That was your idea (and a good one).\r\n\r\nBut it's probably worth measuring to see what overhead it adds. It did require both passing in the database and making the whole thing `async`. \r\n\r\nJust timing the queries themselves:\r\n\r\n1. [Using `AsGeoJSON(geometry) as geometry`](https://alltheplaces-datasette.fly.dev/alltheplaces?sql=select%0D%0A++id%2C%0D%0A++properties%2C%0D%0A++AsGeoJSON%28geometry%29+as+geometry%2C%0D%0A++spider%0D%0Afrom%0D%0A++places%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++1000) takes 10.235 ms\r\n2. [Leaving as binary](https://alltheplaces-datasette.fly.dev/alltheplaces?sql=select%0D%0A++id%2C%0D%0A++properties%2C%0D%0A++geometry%2C%0D%0A++spider%0D%0Afrom%0D%0A++places%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++1000) takes 8.63 ms\r\n\r\nLooking at the network panel:\r\n\r\n1. Takes about 200 ms for the `fetch` request\r\n2. Takes about 300 ms\r\n\r\nI'm not sure how best to time the GeoJSON generation, but it would be interesting to check. Maybe I'll write a plugin to add query times to response headers.\r\n\r\nThe other thing to consider with async streaming is that it might be well-suited for a slower response. When I have to get the whole result and send a response in a fixed amount of time, I need the most efficient query possible. If I can hang onto a connection and get things one chunk at a time, maybe it's ok if there's some overhead.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-1105615625", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 1105615625, "node_id": "IC_kwDOBm6k_c5B5lsJ", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-21T18:31:41Z", "updated_at": "2022-04-21T18:32:22Z", "author_association": "OWNER", "body": "The `datasette-geojson` plugin is actually an interesting case here, because of the way it converts SpatiaLite geometries into GeoJSON: https://github.com/eyeseast/datasette-geojson/blob/602c4477dc7ddadb1c0a156cbcd2ef6688a5921d/datasette_geojson/__init__.py#L61-L66\r\n\r\n```python\r\n\r\n    if isinstance(geometry, bytes):\r\n        results = await db.execute(\r\n            \"SELECT AsGeoJSON(:geometry)\", {\"geometry\": geometry}\r\n        )\r\n        return geojson.loads(results.single_value())\r\n```\r\nThat actually seems to work really well as-is, but it does worry me a bit that it ends up having to execute an extra `SELECT` query for every single returned row - especially in streaming mode where it might be asked to return 1m rows at once.\r\n\r\nMy PostgreSQL/MySQL engineering brain says that this would be better handled by doing a chunk of these (maybe 100) at once, to avoid the per-query-overhead - but with SQLite that might not be necessary.\r\n\r\nAt any rate, this is one of the reasons I'm interested in \"iterate over this sequence of chunks of 100 rows at a time\" as a potential option here.\r\n\r\nOf course, a better solution would be for `datasette-geojson` to have a way to influence the SQL query before it is executed, adding a `AsGeoJSON(geometry)` clause to it - so that's something I'm open to as well.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-1105608964", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 1105608964, "node_id": "IC_kwDOBm6k_c5B5kEE", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-21T18:26:29Z", "updated_at": "2022-04-21T18:26:29Z", "author_association": "OWNER", "body": "I'm questioning if the mechanisms should be separate at all now - a single response rendering is really just a case of a streaming response that only pulls the first N records from the iterator.\r\n\r\nIt probably needs to be an `async for` iterator, which I've not worked with much before. Good opportunity to learn.\r\n\r\nThis actually gets a fair bit more complicated due to the work I'm doing right now to improve the default JSON API:\r\n\r\n- #1709\r\n\r\nI want to do things like make faceting results optionally available to custom renderers - which is a separate concern from streaming rows.\r\n\r\nI'm going to poke around with a bunch of prototypes and see what sticks.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-1105588651", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 1105588651, "node_id": "IC_kwDOBm6k_c5B5fGr", "user": {"value": 25778, "label": "eyeseast"}, "created_at": "2022-04-21T18:15:39Z", "updated_at": "2022-04-21T18:15:39Z", "author_association": "CONTRIBUTOR", "body": "What if you split rendering and streaming into two things:\r\n\r\n- `render` is a function that returns a response\r\n- `stream` is a function that sends chunks, or yields chunks passed to an ASGI `send` callback\r\n\r\nThat way current plugins still work, and streaming is purely additive. A `stream` function could get a cursor or iterator of rows, instead of a list, so it could more efficiently handle large queries.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-1105571003", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 1105571003, "node_id": "IC_kwDOBm6k_c5B5ay7", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-21T18:10:38Z", "updated_at": "2022-04-21T18:10:46Z", "author_association": "OWNER", "body": "Maybe the simplest design for this is to add an optional `can_stream` to the contract:\r\n\r\n```python\r\n    @hookimpl\r\n    def register_output_renderer(datasette):\r\n        return {\r\n            \"extension\": \"tsv\",\r\n            \"render\": render_tsv,\r\n            \"can_render\": lambda: True,\r\n            \"can_stream\": lambda: True\r\n        }\r\n```\r\nWhen streaming, a new parameter could be passed to the render function - maybe `chunks` - which is an iterator/generator over a sequence of chunks of rows.\r\n\r\nOr it could use the existing `rows` parameter but treat that as an iterator?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-869812567", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 869812567, "node_id": "MDEyOklzc3VlQ29tbWVudDg2OTgxMjU2Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-06-28T16:06:57Z", "updated_at": "2021-06-28T16:07:24Z", "author_association": "OWNER", "body": "Relevant blog post: https://simonwillison.net/2021/Jun/25/streaming-large-api-responses/ - including notes on efficiently streaming formats with some kind of separator in between the records (regular JSON).\r\n\r\n> Some export formats are friendlier for streaming than others. CSV and TSV are pretty easy to stream, as is newline-delimited JSON.\r\n> \r\n> Regular JSON requires a bit more thought: you can output a `[` character, then output each row in a stream with a comma suffix, then skip the comma for the last row and output a `]`. Doing that requires peeking ahead (looping two at a time) to verify that you haven't yet reached the end.\r\n> \r\n> Or... Martin De Wulf [pointed out](https://twitter.com/madewulf/status/1405559088994467844) that you can output the first row, then output every other row with a preceeding comma---which avoids the whole \"iterate two at a time\" problem entirely.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-869191854", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 869191854, "node_id": "MDEyOklzc3VlQ29tbWVudDg2OTE5MTg1NA==", "user": {"value": 25778, "label": "eyeseast"}, "created_at": "2021-06-27T16:42:14Z", "updated_at": "2021-06-27T16:42:14Z", "author_association": "CONTRIBUTOR", "body": "This would really help with this issue: https://github.com/eyeseast/datasette-geojson/issues/7", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-755134771", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 755134771, "node_id": "MDEyOklzc3VlQ29tbWVudDc1NTEzNDc3MQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-01-06T07:28:01Z", "updated_at": "2021-01-06T07:28:01Z", "author_association": "OWNER", "body": "With this structure it will become possible to stream non-newline-delimited JSON array-of-objects too - the `stream_rows()` method could output `[` first, then each row followed by a comma, then `]` after the very last row.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-755133937", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 755133937, "node_id": "MDEyOklzc3VlQ29tbWVudDc1NTEzMzkzNw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-01-06T07:25:48Z", "updated_at": "2021-01-06T07:26:43Z", "author_association": "OWNER", "body": "Idea: instead of returning a dictionary, `register_output_renderer` could return an object. The object could have the following properties:\r\n\r\n- `.extension` - the extension to use\r\n- `.can_render(...)` - says if it can render this\r\n- `.can_stream(...)` - says if streaming is supported\r\n- `async .stream_rows(rows_iterator, send)` - method that loops through all rows and uses `send` to send them to the response in the correct format\r\n\r\nI can then deprecate the existing `dict` return type for 1.0.", "reactions": "{\"total_count\": 2, \"+1\": 2, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-755128038", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 755128038, "node_id": "MDEyOklzc3VlQ29tbWVudDc1NTEyODAzOA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-01-06T07:10:22Z", "updated_at": "2021-01-06T07:10:22Z", "author_association": "OWNER", "body": "Yet another use-case for this: I want to be able to stream newline-delimited JSON in order to better import into Pandas:\r\n\r\n    pandas.read_json(\"https://latest.datasette.io/fixtures/compound_three_primary_keys.json?_shape=array&_nl=on\", lines=True)", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-732544590", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 732544590, "node_id": "MDEyOklzc3VlQ29tbWVudDczMjU0NDU5MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-11-24T02:22:55Z", "updated_at": "2020-11-24T02:22:55Z", "author_association": "OWNER", "body": "The trick I'm using here is to follow the `next_url` in order to paginate through all of the matching results. The loop calls the `data()` method multiple times, once for each page of results: https://github.com/simonw/datasette/blob/4bac9f18f9d04e5ed10f072502bcc508e365438e/datasette/views/base.py#L304-L307", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1101#issuecomment-732543700", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1101", "id": 732543700, "node_id": "MDEyOklzc3VlQ29tbWVudDczMjU0MzcwMA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-11-24T02:20:30Z", "updated_at": "2020-11-24T02:20:30Z", "author_association": "OWNER", "body": "Current design: https://docs.datasette.io/en/stable/plugin_hooks.html#register-output-renderer-datasette\r\n\r\n```python\r\n@hookimpl\r\ndef register_output_renderer(datasette):\r\n    return {\r\n        \"extension\": \"test\",\r\n        \"render\": render_demo,\r\n        \"can_render\": can_render_demo,  # Optional\r\n    }\r\n```\r\nWhere `render_demo` looks something like this:\r\n```python\r\nasync def render_demo(datasette, columns, rows):\r\n    db = datasette.get_database()\r\n    result = await db.execute(\"select sqlite_version()\")\r\n    first_row = \" | \".join(columns)\r\n    lines = [first_row]\r\n    lines.append(\"=\" * len(first_row))\r\n    for row in rows:\r\n        lines.append(\" | \".join(row))\r\n    return Response(\r\n        \"\\n\".join(lines),\r\n        content_type=\"text/plain; charset=utf-8\",\r\n        headers={\"x-sqlite-version\": result.first()[0]}\r\n    )\r\n```\r\nMeanwhile here's where the CSV streaming mode is implemented: https://github.com/simonw/datasette/blob/4bac9f18f9d04e5ed10f072502bcc508e365438e/datasette/views/base.py#L297-L380", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 749283032, "label": "register_output_renderer() should support streaming data"}, "performed_via_github_app": null}