home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

23 rows where issue = 459882902 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

user 3

  • simonw 12
  • fgregg 10
  • jokull 1

author_association 3

  • OWNER 12
  • CONTRIBUTOR 10
  • NONE 1

issue 1

  • Stream all results for arbitrary SQL and canned queries · 23 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1260355224 https://github.com/simonw/datasette/issues/526#issuecomment-1260355224 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LH36Y simonw 9599 2022-09-28T04:01:25Z 2022-09-28T04:01:25Z OWNER

The ultimate protection against those memory bombs is to support more streaming output formats. Related issues:

  • 1177

  • 1062

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1259718517 https://github.com/simonw/datasette/issues/526#issuecomment-1259718517 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LFcd1 fgregg 536941 2022-09-27T16:02:51Z 2022-09-27T16:04:46Z CONTRIBUTOR

i think that max_returned_rows is a defense mechanism, just not for connection exhaustion. max_returned_rows is a defense mechanism against memory bombs.

if you are potentially yielding out hundreds of thousands or even millions of rows, you need to be quite careful about data flow to not run out of memory on the server, or on the client.

you have a lot of places in your code that are protective of that right now, but max_returned_rows acts as the final backstop.

so, given that, it makes sense to have removing max_returned_rows altogether be a non-goal, but instead allow for for specific codepaths (like streaming csv's) be able to bypass.

that could dramatically lower the surface area for a memory-bomb attack.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1259693536 https://github.com/simonw/datasette/issues/526#issuecomment-1259693536 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LFWXg simonw 9599 2022-09-27T15:42:55Z 2022-09-27T15:42:55Z OWNER

It's interesting to note WHY the time limit works against this so well.

The time limit as-implemented looks like this:

https://github.com/simonw/datasette/blob/5f9f567acbc58c9fcd88af440e68034510fb5d2b/datasette/utils/init.py#L181-L201

The key here is conn.set_progress_handler(handler, n) - which specifies that the handler function should be called every n SQLite operations.

The handler function then checks to see if too much time has transpired and conditionally cancels the query.

This also doubles up as a "maximum number of operations" guard, which is what's happening when you attempt to fetch an infinite number of rows from an infinite table.

That limit code could even be extended to say "exit the query after either 5s or 50,000,000 operations".

I don't think that's necessary though.

To be honest I'm having trouble with the idea of dropping max_returned_rows mainly because what Datasette does (allow arbitrary untrusted SQL queries) is dangerous, so I've designed in multiple redundant defence-in-depth mechanisms right from the start.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 1,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258910228 https://github.com/simonw/datasette/issues/526#issuecomment-1258910228 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCXIU fgregg 536941 2022-09-27T03:11:07Z 2022-09-27T03:11:07Z CONTRIBUTOR

i think this feature would be safe, as its really only the time limit that can, and imo, should protect against long running queries, as it is pretty easy to make very expensive queries that don't return many rows.

moving away from max_returned_rows will requires some thinking about:

  1. memory usage and data flows to handle potentially very large result sets
  2. how to avoid rendering tens or hundreds of thousands of html rows.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258906440 https://github.com/simonw/datasette/issues/526#issuecomment-1258906440 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCWNI simonw 9599 2022-09-27T03:04:37Z 2022-09-27T03:04:37Z OWNER

It would be really neat if we could explore this idea in a plugin, but I don't think Datasette has plugin hooks in the right place for that at the moment.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258905781 https://github.com/simonw/datasette/issues/526#issuecomment-1258905781 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCWC1 simonw 9599 2022-09-27T03:03:35Z 2022-09-27T03:03:47Z OWNER

Yes good point, the time limit does already protect against that. I've been contemplating a permissioned-users-only relaxation of that time limit too, and I got that idea mixed up with this one in my head.

On that basis maybe this feature would be safe after all? Would need to do some testing, but it may be that the existing time limit provides enough protection here already.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258878311 https://github.com/simonw/datasette/issues/526#issuecomment-1258878311 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCPVn fgregg 536941 2022-09-27T02:19:48Z 2022-09-27T02:19:48Z CONTRIBUTOR

this sql query doesn't trip up maximum_returned_rows but does timeout

sql with recursive counter(x) as ( select 0 union select x + 1 from counter ) select * from counter LIMIT 10 OFFSET 100000000

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258871525 https://github.com/simonw/datasette/issues/526#issuecomment-1258871525 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCNrl fgregg 536941 2022-09-27T02:09:32Z 2022-09-27T02:14:53Z CONTRIBUTOR

thanks @simonw, i learned something i didn't know about sqlite's execution model!

Imagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever.

why wouldn't the sqlite_timelimit guard prevent that?


on my local version which has the code to turn off truncations for query csv, sqlite_timelimit does protect me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258864140 https://github.com/simonw/datasette/issues/526#issuecomment-1258864140 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCL4M simonw 9599 2022-09-27T01:55:32Z 2022-09-27T01:55:32Z OWNER

That recursive query is a great example of the kind of thing having a maximum row limit protects against.

Imagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever.

Even if this feature becomes a permission-guarded thing we still need to take that case into account.

At the very least it would be good if the query could be cancelled if the client disconnects - so if someone accidentally starts an infinite query they can cancel the request and free up the server resources.

It might be a good idea to implement a page that shows "currently running" queries and allows users with the right permission to terminate them from that page.

Another option: a "limit of last resource" - either a very high row limit (10,000,000 perhaps) or even a time limit, saying that all queries will be cancelled if they take longer than thirty minutes or similar.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258860845 https://github.com/simonw/datasette/issues/526#issuecomment-1258860845 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCLEt simonw 9599 2022-09-27T01:48:31Z 2022-09-27T01:50:01Z OWNER

The protection is supposed to be from this line: python rows = cursor.fetchmany(max_returned_rows + 1) By capping the call to .fetchman() at max_returned_rows + 1 (the + 1 is to allow detection of whether or not there is a next page) I'm ensuring that Datasette never attempts to iterate over a huge result set.

SQLite and the sqlite3 library seem to handle this correctly. Here's an example:

```pycon

import sqlite3 conn = sqlite3.connect(":memory:") cursor = conn.execute(""" ... with recursive counter(x) as ( ... select 0 ... union ... select x + 1 from counter ... ) ... select * from counter""") cursor.fetchmany(10) [(0,), (1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,)] ``counter` there is an infinitely long table (see TIL) - but we can retrieve the first 10 results without going into an infinite loop.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258849766 https://github.com/simonw/datasette/issues/526#issuecomment-1258849766 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCIXm fgregg 536941 2022-09-27T01:27:03Z 2022-09-27T01:27:03Z CONTRIBUTOR

i agree with that concern! but if i'm understanding the code correctly, maximum_returned_rows does not protect against long-running queries in any way.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258846992 https://github.com/simonw/datasette/issues/526#issuecomment-1258846992 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCHsQ simonw 9599 2022-09-27T01:21:41Z 2022-09-27T01:21:41Z OWNER

My main concern here is that public Datasette instances could easily have all of their available database connections consumed by long-running queries - either accidentally or deliberately.

I do totally understand the need for this feature though. I think it can absolutely make sense provided it's protected by authentication and permissions.

Maybe even limit the number of concurrent downloads at once such that there's always at least one database connection free for other requests.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258337011 https://github.com/simonw/datasette/issues/526#issuecomment-1258337011 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LALLz fgregg 536941 2022-09-26T16:49:48Z 2022-09-26T16:49:48Z CONTRIBUTOR

i think the smallest change that gets close to what i want is to change the behavior so that max_returned_rows is not applied in the execute method when we are are asking for a csv of query.

there are some infelicities for that approach, but i'll make a PR to make it easier to discuss.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258167564 https://github.com/simonw/datasette/issues/526#issuecomment-1258167564 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5K_h0M fgregg 536941 2022-09-26T14:57:44Z 2022-09-26T15:08:36Z CONTRIBUTOR

reading the database execute method i have a few questions.

https://github.com/simonw/datasette/blob/cb1e093fd361b758120aefc1a444df02462389a3/datasette/database.py#L229-L242


unless i'm missing something (which is very likely!!), the max_returned_rows argument doesn't actually offer any protections against running very expensive queries.

It's not like adding a LIMIT max_rows argument. it make sense that it isn't because, the query could already have an LIMIT argument. Doing something like select * from (query) limit {max_returned_rows} might be protective but wouldn't always.

Instead the code executes the full original query, and if still has time it fetches out the first max_rows + 1 rows.

this does offer some protection of memory exhaustion, as you won't hydrate a huge result set into python (however, there are data flow patterns that could avoid that too)

given the current architecture, i don't see how creating a new connection would be use?


If we just removed the max_return_rows limitation, then i think most things would be fine except for the QueryViews. Right now rendering, just 5000 rows takes a lot of client-side memory so some form of pagination would be required.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1254064260 https://github.com/simonw/datasette/issues/526#issuecomment-1254064260 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5Kv4CE fgregg 536941 2022-09-21T18:17:04Z 2022-09-21T18:18:01Z CONTRIBUTOR

hi @simonw, this is becoming more of a bother for my labor data warehouse. Is there any research or a spike i could do that would help you investigate this issue?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1074019047 https://github.com/simonw/datasette/issues/526#issuecomment-1074019047 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5ABDrn simonw 9599 2022-03-21T15:09:56Z 2022-03-21T15:09:56Z OWNER

I should research how much overhead creating a new connection costs - it may be that an easy way to solve this is to create A dedicated connection for the query and then close that connection at the end.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
993078038 https://github.com/simonw/datasette/issues/526#issuecomment-993078038 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c47MSsW fgregg 536941 2021-12-14T01:46:52Z 2021-12-14T01:46:52Z CONTRIBUTOR

the nested query idea is very nice, and i stole if for my client side paginator. However, it won't do the right thing if the original query orders by random().

If you go the nested query route, maybe raise a 4XX status code if the query has such a clause?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
992971072 https://github.com/simonw/datasette/issues/526#issuecomment-992971072 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c47L4lA fgregg 536941 2021-12-13T22:29:34Z 2021-12-13T22:29:34Z CONTRIBUTOR

just came by to open this issue. would make my data analysis in observable a lot better!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
853567413 https://github.com/simonw/datasette/issues/526#issuecomment-853567413 https://api.github.com/repos/simonw/datasette/issues/526 MDEyOklzc3VlQ29tbWVudDg1MzU2NzQxMw== simonw 9599 2021-06-03T05:11:27Z 2021-06-03T05:11:27Z OWNER

Another potential way to implement this would be to hold the SQLite connection open and execute the full query there.

I've avoided this in the past due to concerns of resource exhaustion - if multiple requests attempt this at the same time all of the connections in the pool will become tied up and the site will be unable to respond to further requests.

But... now that Datasette has authentication there's the possibility of making this feature only available to specific authenticated users - the --root user for example. Which avoids the danger while unlocking a super-useful feature.

Not to mention people who are running Datasette privately on their own laptop, or the proposed --query CLI feature in #1356.

{
    "total_count": 1,
    "+1": 1,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
810943882 https://github.com/simonw/datasette/issues/526#issuecomment-810943882 https://api.github.com/repos/simonw/datasette/issues/526 MDEyOklzc3VlQ29tbWVudDgxMDk0Mzg4Mg== jokull 701 2021-03-31T10:03:55Z 2021-03-31T10:03:55Z NONE

+1 on using nested queries to achieve this! Would be great as streaming CSV is an amazing feature.

Some UX/DX details:

I was expecting it to work to simply add &_stream=on to custom SQL queries because the docs say

Any Datasette table, view or custom SQL query can be exported as CSV.

After a bit of testing back and forth I realized streaming only works for full tables.

Would love this feature because I'm using pandas.read_csv to paint graphs from custom queries and the graphs are cut off because of the 1000 row limit.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
505162238 https://github.com/simonw/datasette/issues/526#issuecomment-505162238 https://api.github.com/repos/simonw/datasette/issues/526 MDEyOklzc3VlQ29tbWVudDUwNTE2MjIzOA== simonw 9599 2019-06-24T20:14:51Z 2019-06-24T20:14:51Z OWNER

The other reason I didn't implement this in the first place is that adding offset/limit to a custom query (as opposed to a view) requires modifying the existing SQL - what if that SQL already has its own offset/limit clause?

It looks like I can solve that using a nested query: sql select * from ( select * from compound_three_primary_keys limit 1000 ) limit 10 offset 100 https://latest.datasette.io/fixtures?sql=select++from+%28%0D%0A++select++from+compound_three_primary_keys+limit+1000%0D%0A%29+limit+10+offset+100

So I can wrap any user-provided SQL query in an outer offset/limit and implement pagination that way.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
505161008 https://github.com/simonw/datasette/issues/526#issuecomment-505161008 https://api.github.com/repos/simonw/datasette/issues/526 MDEyOklzc3VlQ29tbWVudDUwNTE2MTAwOA== simonw 9599 2019-06-24T20:11:15Z 2019-06-24T20:11:15Z OWNER

Views already use offset/limit pagination so actually I may be over-thinking this.

Maybe the right thing to do here is to have the feature enabled by default, since it will work for the VAST majority of queries - the only ones that might cause problems are complex queries across millions of rows. It can continue to use aggressive internal time limits so if someone DOES trigger something expensive they'll get an error.

I can allow users to disable the feature with a config setting, or increase the time limit if they need to.

Downgrading this from a medium to a small since it's much less effort to enable the existing pagination method for this type of query.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
505060332 https://github.com/simonw/datasette/issues/526#issuecomment-505060332 https://api.github.com/repos/simonw/datasette/issues/526 MDEyOklzc3VlQ29tbWVudDUwNTA2MDMzMg== simonw 9599 2019-06-24T15:28:16Z 2019-06-24T15:28:16Z OWNER

This is currently a deliberate feature decision.

The problem is that the streaming CSV feature relies on Datasette's automated efficient pagination under the hood. When you stream a CSV you're actually causing Datasette to paginate through the full set of "pages" under the hood, streaming each page out as a new chunk of CSV rows.

This mechanism only works if the next_url has been generated for the page. Currently the next_url is available for table views (where it uses the primary key or the sort column) and for views, but it's not set for canned queries because I can't be certain they can be efficiently paginated.

Offset/limit pagination for canned queries would be a pretty nasty performance hit, because each subsequent page would require even more time for SQLite to scroll through to the specified offset.

This does seem like it's worth fixing though: pulling every row for a canned queries would definitely be useful. The problem is that the pagination trick used elsewhere isn't right for canned queries - instead I would need to keep the database cursor open until ALL rows had been fetched. Figuring out how to do that efficiently within an asyncio managed thread pool may take some thought.

Maybe this feature ends up as something which is turned off by default (due to the risk of it causing uptime problems for public sites) but that users working on their own private environments can turn on?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1.2ms · About: github-to-sqlite