github
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | issue | performed_via_github_app |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/simonw/datasette/pull/1574#issuecomment-1105464661 | https://api.github.com/repos/simonw/datasette/issues/1574 | 1105464661 | IC_kwDOBm6k_c5B5A1V | 208018 | 2022-04-21T16:51:24Z | 2022-04-21T16:51:24Z | NONE | tfw you have more ephemeral storage than upstream bandwidth ``` FROM python:3.10-slim AS base RUN apt update && apt -y install zstd ENV DATASETTE_SECRET 'sosecret' RUN --mount=type=cache,target=/root/.cache/pip pip install -U datasette datasette-pretty-json datasette-graphql ENV PORT 8080 EXPOSE 8080 FROM base AS pack COPY . /app WORKDIR /app RUN datasette inspect --inspect-file inspect-data.json RUN zstd --rm *.db FROM base AS unpack COPY --from=pack /app /app WORKDIR /app CMD ["/bin/bash", "-c", "shopt -s nullglob && zstd --rm -d *.db.zst && datasette serve --host 0.0.0.0 --cors --inspect-file inspect-data.json --metadata metadata.json --create --port $PORT *.db"] ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
1084193403 | |
https://github.com/simonw/datasette/issues/1101#issuecomment-1105571003 | https://api.github.com/repos/simonw/datasette/issues/1101 | 1105571003 | IC_kwDOBm6k_c5B5ay7 | 9599 | 2022-04-21T18:10:38Z | 2022-04-21T18:10:46Z | OWNER | Maybe the simplest design for this is to add an optional `can_stream` to the contract: ```python @hookimpl def register_output_renderer(datasette): return { "extension": "tsv", "render": render_tsv, "can_render": lambda: True, "can_stream": lambda: True } ``` When streaming, a new parameter could be passed to the render function - maybe `chunks` - which is an iterator/generator over a sequence of chunks of rows. Or it could use the existing `rows` parameter but treat that as an iterator? | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
749283032 | |
https://github.com/simonw/datasette/issues/1101#issuecomment-1105588651 | https://api.github.com/repos/simonw/datasette/issues/1101 | 1105588651 | IC_kwDOBm6k_c5B5fGr | 25778 | 2022-04-21T18:15:39Z | 2022-04-21T18:15:39Z | CONTRIBUTOR | What if you split rendering and streaming into two things: - `render` is a function that returns a response - `stream` is a function that sends chunks, or yields chunks passed to an ASGI `send` callback That way current plugins still work, and streaming is purely additive. A `stream` function could get a cursor or iterator of rows, instead of a list, so it could more efficiently handle large queries. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
749283032 | |
https://github.com/simonw/datasette/issues/1101#issuecomment-1105608964 | https://api.github.com/repos/simonw/datasette/issues/1101 | 1105608964 | IC_kwDOBm6k_c5B5kEE | 9599 | 2022-04-21T18:26:29Z | 2022-04-21T18:26:29Z | OWNER | I'm questioning if the mechanisms should be separate at all now - a single response rendering is really just a case of a streaming response that only pulls the first N records from the iterator. It probably needs to be an `async for` iterator, which I've not worked with much before. Good opportunity to learn. This actually gets a fair bit more complicated due to the work I'm doing right now to improve the default JSON API: - #1709 I want to do things like make faceting results optionally available to custom renderers - which is a separate concern from streaming rows. I'm going to poke around with a bunch of prototypes and see what sticks. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
749283032 | |
https://github.com/simonw/datasette/issues/1101#issuecomment-1105615625 | https://api.github.com/repos/simonw/datasette/issues/1101 | 1105615625 | IC_kwDOBm6k_c5B5lsJ | 9599 | 2022-04-21T18:31:41Z | 2022-04-21T18:32:22Z | OWNER | The `datasette-geojson` plugin is actually an interesting case here, because of the way it converts SpatiaLite geometries into GeoJSON: https://github.com/eyeseast/datasette-geojson/blob/602c4477dc7ddadb1c0a156cbcd2ef6688a5921d/datasette_geojson/__init__.py#L61-L66 ```python if isinstance(geometry, bytes): results = await db.execute( "SELECT AsGeoJSON(:geometry)", {"geometry": geometry} ) return geojson.loads(results.single_value()) ``` That actually seems to work really well as-is, but it does worry me a bit that it ends up having to execute an extra `SELECT` query for every single returned row - especially in streaming mode where it might be asked to return 1m rows at once. My PostgreSQL/MySQL engineering brain says that this would be better handled by doing a chunk of these (maybe 100) at once, to avoid the per-query-overhead - but with SQLite that might not be necessary. At any rate, this is one of the reasons I'm interested in "iterate over this sequence of chunks of 100 rows at a time" as a potential option here. Of course, a better solution would be for `datasette-geojson` to have a way to influence the SQL query before it is executed, adding a `AsGeoJSON(geometry)` clause to it - so that's something I'm open to as well. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
749283032 | |
https://github.com/simonw/datasette/issues/1101#issuecomment-1105642187 | https://api.github.com/repos/simonw/datasette/issues/1101 | 1105642187 | IC_kwDOBm6k_c5B5sLL | 25778 | 2022-04-21T18:59:08Z | 2022-04-21T18:59:08Z | CONTRIBUTOR | Ha! That was your idea (and a good one). But it's probably worth measuring to see what overhead it adds. It did require both passing in the database and making the whole thing `async`. Just timing the queries themselves: 1. [Using `AsGeoJSON(geometry) as geometry`](https://alltheplaces-datasette.fly.dev/alltheplaces?sql=select%0D%0A++id%2C%0D%0A++properties%2C%0D%0A++AsGeoJSON%28geometry%29+as+geometry%2C%0D%0A++spider%0D%0Afrom%0D%0A++places%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++1000) takes 10.235 ms 2. [Leaving as binary](https://alltheplaces-datasette.fly.dev/alltheplaces?sql=select%0D%0A++id%2C%0D%0A++properties%2C%0D%0A++geometry%2C%0D%0A++spider%0D%0Afrom%0D%0A++places%0D%0Aorder+by%0D%0A++id%0D%0Alimit%0D%0A++1000) takes 8.63 ms Looking at the network panel: 1. Takes about 200 ms for the `fetch` request 2. Takes about 300 ms I'm not sure how best to time the GeoJSON generation, but it would be interesting to check. Maybe I'll write a plugin to add query times to response headers. The other thing to consider with async streaming is that it might be well-suited for a slower response. When I have to get the whole result and send a response in a fixed amount of time, I need the most efficient query possible. If I can hang onto a connection and get things one chunk at a time, maybe it's ok if there's some overhead. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
749283032 | |
https://github.com/dogsheep/github-to-sqlite/issues/72#issuecomment-1105474232 | https://api.github.com/repos/dogsheep/github-to-sqlite/issues/72 | 1105474232 | IC_kwDODFdgUs5B5DK4 | 9599 | 2022-04-21T17:02:15Z | 2022-04-21T17:02:15Z | MEMBER | That's interesting - yeah it looks like the number of pages can be derived from the `Link` header, which is enough information to show a progress bar, probably using Click just to avoid adding another dependency. https://docs.github.com/en/rest/guides/traversing-with-pagination | { "total_count": 1, "+1": 1, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
1211283427 |