{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1259718517", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1259718517, "node_id": "IC_kwDOBm6k_c5LFcd1", "user": {"value": 536941, "label": "fgregg"}, "created_at": "2022-09-27T16:02:51Z", "updated_at": "2022-09-27T16:04:46Z", "author_association": "CONTRIBUTOR", "body": "i think that `max_returned_rows` **is** a defense mechanism, just not for connection exhaustion. `max_returned_rows` is a defense mechanism against **memory bombs**.\r\n\r\nif you are potentially yielding out hundreds of thousands or even millions of rows, you need to be quite careful about data flow to not run out of memory on the server, or on the client.\r\n\r\nyou have a lot of places in your code that are protective of that right now, but `max_returned_rows` acts as the final backstop.\r\n\r\nso, given that, it makes sense to have removing `max_returned_rows` altogether be a non-goal, but instead allow for for specific codepaths (like streaming csv's) be able to bypass.\r\n\r\nthat could dramatically lower the surface area for a memory-bomb attack.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1259693536", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1259693536, "node_id": "IC_kwDOBm6k_c5LFWXg", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T15:42:55Z", "updated_at": "2022-09-27T15:42:55Z", "author_association": "OWNER", "body": "It's interesting to note WHY the time limit works against this so well.\r\n\r\nThe time limit as-implemented looks like this:\r\n\r\nhttps://github.com/simonw/datasette/blob/5f9f567acbc58c9fcd88af440e68034510fb5d2b/datasette/utils/__init__.py#L181-L201\r\n\r\nThe key here is `conn.set_progress_handler(handler, n)` - which specifies that the handler function should be called every `n` SQLite operations.\r\n\r\nThe handler function then checks to see if too much time has transpired and conditionally cancels the query.\r\n\r\nThis also doubles up as a \"maximum number of operations\" guard, which is what's happening when you attempt to fetch an infinite number of rows from an infinite table.\r\n\r\nThat limit code could even be extended to say \"exit the query after either 5s or 50,000,000 operations\".\r\n\r\nI don't think that's necessary though.\r\n\r\nTo be honest I'm having trouble with the idea of dropping `max_returned_rows` mainly because what Datasette does (allow arbitrary untrusted SQL queries) is dangerous, so I've designed in multiple redundant defence-in-depth mechanisms right from the start.", "reactions": "{\"total_count\": 1, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 1, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258910228", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258910228, "node_id": "IC_kwDOBm6k_c5LCXIU", "user": {"value": 536941, "label": "fgregg"}, "created_at": "2022-09-27T03:11:07Z", "updated_at": "2022-09-27T03:11:07Z", "author_association": "CONTRIBUTOR", "body": "i think this feature would be safe, as its really only the time limit that can, and imo, should protect against long running queries, as it is pretty easy to make very expensive queries that don't return many rows.\r\n\r\nmoving away from `max_returned_rows` will requires some thinking about:\r\n\r\n1. memory usage and data flows to handle potentially very large result sets\r\n2. how to avoid rendering tens or hundreds of thousands of [html rows](#1655).", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258906440", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258906440, "node_id": "IC_kwDOBm6k_c5LCWNI", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T03:04:37Z", "updated_at": "2022-09-27T03:04:37Z", "author_association": "OWNER", "body": "It would be really neat if we could explore this idea in a plugin, but I don't think Datasette has plugin hooks in the right place for that at the moment.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258905781", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258905781, "node_id": "IC_kwDOBm6k_c5LCWC1", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T03:03:35Z", "updated_at": "2022-09-27T03:03:47Z", "author_association": "OWNER", "body": "Yes good point, the time limit does already protect against that. I've been contemplating a permissioned-users-only relaxation of that time limit too, and I got that idea mixed up with this one in my head.\r\n\r\nOn that basis maybe this feature would be safe after all? Would need to do some testing, but it may be that the existing time limit provides enough protection here already.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258878311", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258878311, "node_id": "IC_kwDOBm6k_c5LCPVn", "user": {"value": 536941, "label": "fgregg"}, "created_at": "2022-09-27T02:19:48Z", "updated_at": "2022-09-27T02:19:48Z", "author_association": "CONTRIBUTOR", "body": "this sql query doesn't trip up `maximum_returned_rows` but does timeout\r\n\r\n```sql\r\nwith recursive counter(x) as (\r\n   select 0\r\n     union\r\n   select x + 1 from counter\r\n )\r\n select * from counter LIMIT 10 OFFSET 100000000 \r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258871525", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258871525, "node_id": "IC_kwDOBm6k_c5LCNrl", "user": {"value": 536941, "label": "fgregg"}, "created_at": "2022-09-27T02:09:32Z", "updated_at": "2022-09-27T02:14:53Z", "author_association": "CONTRIBUTOR", "body": "thanks @simonw, i learned something i didn't know about sqlite's execution model!\r\n\r\n> Imagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever.\r\n\r\nwhy wouldn't the `sqlite_timelimit` guard prevent that?\r\n\r\n--- \r\non my local version which has the code to [turn off truncations for query csv](#1820), `sqlite_timelimit` does protect me.\r\n\r\n![Screenshot 2022-09-26 at 22-14-31 Error 500](https://user-images.githubusercontent.com/536941/192415680-94b32b7f-868f-4b89-8194-5752d45f6009.png)\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258864140", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258864140, "node_id": "IC_kwDOBm6k_c5LCL4M", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T01:55:32Z", "updated_at": "2022-09-27T01:55:32Z", "author_association": "OWNER", "body": "That recursive query is a great example of the kind of thing having a maximum row limit protects against.\r\n\r\nImagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever.\r\n\r\nEven if this feature becomes a permission-guarded thing we still need to take that case into account.\r\n\r\nAt the very least it would be good if the query could be cancelled if the client disconnects - so if someone accidentally starts an infinite query they can cancel the request and free up the server resources.\r\n\r\nIt might be a good idea to implement a page that shows \"currently running\" queries and allows users with the right permission to terminate them from that page.\r\n\r\nAnother option: a \"limit of last resource\" - either a very high row limit (10,000,000 perhaps) or even a time limit, saying that all queries will be cancelled if they take longer than thirty minutes or similar.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258860845", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258860845, "node_id": "IC_kwDOBm6k_c5LCLEt", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T01:48:31Z", "updated_at": "2022-09-27T01:50:01Z", "author_association": "OWNER", "body": "The protection is supposed to be from this line:\r\n```python\r\nrows = cursor.fetchmany(max_returned_rows + 1) \r\n```\r\nBy capping the call to `.fetchman()` at `max_returned_rows + 1` (the `+ 1` is to allow detection of whether or not there is a next page) I'm ensuring that Datasette never attempts to iterate over a huge result set.\r\n\r\nSQLite and the `sqlite3` library seem to handle this correctly. Here's an example:\r\n\r\n```pycon\r\n>>> import sqlite3\r\n>>> conn = sqlite3.connect(\":memory:\")\r\n>>> cursor = conn.execute(\"\"\"\r\n... with recursive counter(x) as (\r\n...   select 0\r\n...     union\r\n...   select x + 1 from counter\r\n... )\r\n... select * from counter\"\"\")\r\n>>> cursor.fetchmany(10)\r\n[(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,)]\r\n```\r\n`counter` there is an infinitely long table ([see TIL](https://til.simonwillison.net/sqlite/simple-recursive-cte)) - but we can retrieve the first 10 results without going into an infinite loop.\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258849766", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258849766, "node_id": "IC_kwDOBm6k_c5LCIXm", "user": {"value": 536941, "label": "fgregg"}, "created_at": "2022-09-27T01:27:03Z", "updated_at": "2022-09-27T01:27:03Z", "author_association": "CONTRIBUTOR", "body": "i agree with that concern! but if i'm understanding the code correctly, `maximum_returned_rows` does not protect against long-running queries in any way.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/526#issuecomment-1258846992", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/526", "id": 1258846992, "node_id": "IC_kwDOBm6k_c5LCHsQ", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-09-27T01:21:41Z", "updated_at": "2022-09-27T01:21:41Z", "author_association": "OWNER", "body": "My main concern here is that public Datasette instances could easily have all of their available database connections consumed by long-running queries - either accidentally or deliberately.\r\n\r\nI do totally understand the need for this feature though. I think it can absolutely make sense provided it's protected by authentication and permissions.\r\n\r\nMaybe even limit the number of concurrent downloads at once such that there's always at least one database connection free for other requests.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 459882902, "label": "Stream all results for arbitrary SQL and canned queries"}, "performed_via_github_app": null}