{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1260355224", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1260355224, "node_id": "IC_kwDOBm6k_c5LH36Y", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-28T04:01:25Z", "updated_at": "2022-09-28T04:01:25Z", "author_association": "OWNER", "body": "The ultimate protection against those memory bombs is to support more streaming output formats. Related issues:\r\n\r\n- #1177 \r\n- #1062", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1259693536", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1259693536, "node_id": "IC_kwDOBm6k_c5LFWXg", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T15:42:55Z", "updated_at": "2022-09-27T15:42:55Z", "author_association": "OWNER", "body": "It's interesting to note WHY the time limit works against this so well.\r\n\r\nThe time limit as-implemented looks like this:\r\n\r\nhttps://github.com/simonw/datasette/blob/5f9f567acbc58c9fcd88af440e68034510fb5d2b/datasette/utils/__init__.py#L181-L201\r\n\r\nThe key here is `conn.set_progress_handler(handler, n)` - which specifies that the handler function should be called every `n` SQLite operations.\r\n\r\nThe handler function then checks to see if too much time has transpired and conditionally cancels the query.\r\n\r\nThis also doubles up as a \"maximum number of operations\" guard, which is what's happening when you attempt to fetch an infinite number of rows from an infinite table.\r\n\r\nThat limit code could even be extended to say \"exit the query after either 5s or 50,000,000 operations\".\r\n\r\nI don't think that's necessary though.\r\n\r\nTo be honest I'm having trouble with the idea of dropping `max_returned_rows` mainly because what Datasette does (allow arbitrary untrusted SQL queries) is dangerous, so I've designed in multiple redundant defence-in-depth mechanisms right from the start.", "reactions": "{\"total_count\": 1, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 1, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258906440", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258906440, "node_id": "IC_kwDOBm6k_c5LCWNI", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T03:04:37Z", "updated_at": "2022-09-27T03:04:37Z", "author_association": "OWNER", "body": "It would be really neat if we could explore this idea in a plugin, but I don't think Datasette has plugin hooks in the right place for that at the moment.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258905781", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258905781, "node_id": "IC_kwDOBm6k_c5LCWC1", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T03:03:35Z", "updated_at": "2022-09-27T03:03:47Z", "author_association": "OWNER", "body": "Yes good point, the time limit does already protect against that. I've been contemplating a permissioned-users-only relaxation of that time limit too, and I got that idea mixed up with this one in my head.\r\n\r\nOn that basis maybe this feature would be safe after all? Would need to do some testing, but it may be that the existing time limit provides enough protection here already.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258864140", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258864140, "node_id": "IC_kwDOBm6k_c5LCL4M", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T01:55:32Z", "updated_at": "2022-09-27T01:55:32Z", "author_association": "OWNER", "body": "That recursive query is a great example of the kind of thing having a maximum row limit protects against.\r\n\r\nImagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever.\r\n\r\nEven if this feature becomes a permission-guarded thing we still need to take that case into account.\r\n\r\nAt the very least it would be good if the query could be cancelled if the client disconnects - so if someone accidentally starts an infinite query they can cancel the request and free up the server resources.\r\n\r\nIt might be a good idea to implement a page that shows \"currently running\" queries and allows users with the right permission to terminate them from that page.\r\n\r\nAnother option: a \"limit of last resource\" - either a very high row limit (10,000,000 perhaps) or even a time limit, saying that all queries will be cancelled if they take longer than thirty minutes or similar.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258860845", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258860845, "node_id": "IC_kwDOBm6k_c5LCLEt", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T01:48:31Z", "updated_at": "2022-09-27T01:50:01Z", "author_association": "OWNER", "body": "The protection is supposed to be from this line:\r\n```python\r\nrows = cursor.fetchmany(max_returned_rows + 1) \r\n```\r\nBy capping the call to `.fetchman()` at `max_returned_rows + 1` (the `+ 1` is to allow detection of whether or not there is a next page) I'm ensuring that Datasette never attempts to iterate over a huge result set.\r\n\r\nSQLite and the `sqlite3` library seem to handle this correctly. Here's an example:\r\n\r\n```pycon\r\n>>> import sqlite3\r\n>>> conn = sqlite3.connect(\":memory:\")\r\n>>> cursor = conn.execute(\"\"\"\r\n... with recursive counter(x) as (\r\n...   select 0\r\n...     union\r\n...   select x + 1 from counter\r\n... )\r\n... select * from counter\"\"\")\r\n>>> cursor.fetchmany(10)\r\n[(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,)]\r\n```\r\n`counter` there is an infinitely long table ([see TIL](https://til.simonwillison.net/sqlite/simple-recursive-cte)) - but we can retrieve the first 10 results without going into an infinite loop.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258846992", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258846992, "node_id": "IC_kwDOBm6k_c5LCHsQ", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T01:21:41Z", "updated_at": "2022-09-27T01:21:41Z", "author_association": "OWNER", "body": "My main concern here is that public Datasette instances could easily have all of their available database connections consumed by long-running queries - either accidentally or deliberately.\r\n\r\nI do totally understand the need for this feature though. I think it can absolutely make sense provided it's protected by authentication and permissions.\r\n\r\nMaybe even limit the number of concurrent downloads at once such that there's always at least one database connection free for other requests.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1074019047", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1074019047, "node_id": "IC_kwDOBm6k_c5ABDrn", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-03-21T15:09:56Z", "updated_at": "2022-03-21T15:09:56Z", "author_association": "OWNER", "body": "I should research how much overhead creating a new connection costs - it may be that an easy way to solve this is to create A dedicated connection for the query and then close that connection at the end.", "reactions": "{\"total_count\": 1, \"+1\": 1, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-853567413", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 853567413, "node_id": "MDEyOklzc3VlQ29tbWVudDg1MzU2NzQxMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-06-03T05:11:27Z", "updated_at": "2021-06-03T05:11:27Z", "author_association": "OWNER", "body": "Another potential way to implement this would be to hold the SQLite connection open and execute the full query there.\r\n\r\nI've avoided this in the past due to concerns of resource exhaustion - if multiple requests attempt this at the same time all of the connections in the pool will become tied up and the site will be unable to respond to further requests.\r\n\r\nBut... now that Datasette has authentication there's the possibility of making this feature only available to specific authenticated users - the `--root` user for example. Which avoids the danger while unlocking a super-useful feature.\r\n\r\nNot to mention people who are running Datasette privately on their own laptop, or the proposed `--query` CLI feature in #1356.", "reactions": "{\"total_count\": 1, \"+1\": 1, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-505162238", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 505162238, "node_id": "MDEyOklzc3VlQ29tbWVudDUwNTE2MjIzOA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2019-06-24T20:14:51Z", "updated_at": "2019-06-24T20:14:51Z", "author_association": "OWNER", "body": "The other reason I didn't implement this in the first place is that adding offset/limit to a custom query (as opposed to a view) requires modifying the existing SQL - what if that SQL already has its own offset/limit clause?\r\n\r\nIt looks like I can solve that using a nested query:\r\n```sql\r\nselect * from (\r\n  select * from compound_three_primary_keys limit 1000\r\n) limit 10 offset 100\r\n```\r\nhttps://latest.datasette.io/fixtures?sql=select+*+from+%28%0D%0A++select+*+from+compound_three_primary_keys+limit+1000%0D%0A%29+limit+10+offset+100\r\n\r\nSo I can wrap any user-provided SQL query in an outer offset/limit and implement pagination that way.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-505161008", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 505161008, "node_id": "MDEyOklzc3VlQ29tbWVudDUwNTE2MTAwOA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2019-06-24T20:11:15Z", "updated_at": "2019-06-24T20:11:15Z", "author_association": "OWNER", "body": "Views already use offset/limit pagination so actually I may be over-thinking this.\r\n\r\nMaybe the right thing to do here is to have the feature enabled by default, since it will work for the VAST majority of queries - the only ones that might cause problems are complex queries across millions of rows. It can continue to use aggressive internal time limits so if someone DOES trigger something expensive they'll get an error.\r\n\r\nI can allow users to disable the feature with a config setting, or increase the time limit if they need to.\r\n\r\nDowngrading this from a medium to a small since it's much less effort to enable the existing pagination method for this type of query.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-505060332", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 505060332, "node_id": "MDEyOklzc3VlQ29tbWVudDUwNTA2MDMzMg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2019-06-24T15:28:16Z", "updated_at": "2019-06-24T15:28:16Z", "author_association": "OWNER", "body": "This is currently a deliberate feature decision.\r\n\r\nThe problem is that the streaming CSV feature relies on Datasette's automated efficient pagination under the hood. When you stream a CSV you're actually causing Datasette to paginate through the full set of \"pages\" under the hood, streaming each page out as a new chunk of CSV rows.\r\n\r\nThis mechanism only works if the `next_url` has been generated for the page. Currently the `next_url` is available for table views (where it uses [the primary key or the sort column](https://datasette.readthedocs.io/en/stable/sql_queries.html#pagination)) and for views, but it's not set for canned queries because I can't be certain they can be efficiently paginated.\r\n\r\nOffset/limit pagination for canned queries would be a pretty nasty performance hit, because each subsequent page would require even more time for SQLite to scroll through to the specified offset.\r\n\r\nThis does seem like it's worth fixing though: pulling every row for a canned queries would definitely be useful. The problem is that the pagination trick used elsewhere isn't right for canned queries - instead I would need to keep the database cursor open until ALL rows had been fetched. Figuring out how to do that efficiently within an asyncio managed thread pool may take some thought.\r\n\r\nMaybe this feature ends up as something which is turned off by default (due to the risk of it causing uptime problems for public sites) but that users working on their own private environments can turn on?\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}