home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

6 rows where author_association = "CONTRIBUTOR" and "updated_at" is on date 2022-09-26 sorted by updated_at descending

✖
✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, created_at (date), updated_at (date)

issue 4

  • Stream all results for arbitrary SQL and canned queries 2
  • Ability to merge databases and tables 2
  • query result page is using 400mb of browser memory 40x size of html page and 400x size of csv data 1
  • Research: demonstrate if parallel SQL queries are worthwhile 1

user 2

  • fgregg 4
  • eyeseast 2

author_association 1

  • CONTRIBUTOR · 6 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1258712931 https://github.com/simonw/sqlite-utils/issues/491#issuecomment-1258712931 https://api.github.com/repos/simonw/sqlite-utils/issues/491 IC_kwDOCGYnMM5LBm9j eyeseast 25778 2022-09-26T22:31:58Z 2022-09-26T22:31:58Z CONTRIBUTOR

Right. The backup command will copy tables completely, but in the case of conflicting table names, the destination gets overwritten silently. That might not be what you want here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Ability to merge databases and tables 1383646615  
1258508215 https://github.com/simonw/sqlite-utils/issues/491#issuecomment-1258508215 https://api.github.com/repos/simonw/sqlite-utils/issues/491 IC_kwDOCGYnMM5LA0-3 eyeseast 25778 2022-09-26T19:22:14Z 2022-09-26T19:22:14Z CONTRIBUTOR

This might be fairly straightforward using SQLite's backup utility: https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.backup

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Ability to merge databases and tables 1383646615  
1258337011 https://github.com/simonw/datasette/issues/526#issuecomment-1258337011 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LALLz fgregg 536941 2022-09-26T16:49:48Z 2022-09-26T16:49:48Z CONTRIBUTOR

i think the smallest change that gets close to what i want is to change the behavior so that max_returned_rows is not applied in the execute method when we are are asking for a csv of query.

there are some infelicities for that approach, but i'll make a PR to make it easier to discuss.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258167564 https://github.com/simonw/datasette/issues/526#issuecomment-1258167564 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5K_h0M fgregg 536941 2022-09-26T14:57:44Z 2022-09-26T15:08:36Z CONTRIBUTOR

reading the database execute method i have a few questions.

https://github.com/simonw/datasette/blob/cb1e093fd361b758120aefc1a444df02462389a3/datasette/database.py#L229-L242


unless i'm missing something (which is very likely!!), the max_returned_rows argument doesn't actually offer any protections against running very expensive queries.

It's not like adding a LIMIT max_rows argument. it make sense that it isn't because, the query could already have an LIMIT argument. Doing something like select * from (query) limit {max_returned_rows} might be protective but wouldn't always.

Instead the code executes the full original query, and if still has time it fetches out the first max_rows + 1 rows.

this does offer some protection of memory exhaustion, as you won't hydrate a huge result set into python (however, there are data flow patterns that could avoid that too)

given the current architecture, i don't see how creating a new connection would be use?


If we just removed the max_return_rows limitation, then i think most things would be fine except for the QueryViews. Right now rendering, just 5000 rows takes a lot of client-side memory so some form of pagination would be required.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258166572 https://github.com/simonw/datasette/issues/1655#issuecomment-1258166572 https://api.github.com/repos/simonw/datasette/issues/1655 IC_kwDOBm6k_c5K_hks fgregg 536941 2022-09-26T14:57:04Z 2022-09-26T14:57:04Z CONTRIBUTOR

I think that paginating, even in javascript, could be very helpful. Maybe render json or csv into the page and let javascript loading that into the dom?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
query result page is using 400mb of browser memory 40x size of html page and 400x size of csv data 1163369515  
1258129113 https://github.com/simonw/datasette/issues/1727#issuecomment-1258129113 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5K_YbZ fgregg 536941 2022-09-26T14:30:11Z 2022-09-26T14:48:31Z CONTRIBUTOR

from your analysis, it seems like the GIL is blocking on loading of the data from sqlite to python, (particularly in the fetchmany call)

this is probably a simplistic idea, but what if you had the python code in the execute method iterate over the cursor and yield out rows or small chunks of rows.

something like:
python with sqlite_timelimit(conn, time_limit_ms): try: cursor = conn.cursor() cursor.execute(sql, params if params is not None else {}) except: ... max_returned_rows = self.ds.max_returned_rows if max_returned_rows == page_size: max_returned_rows += 1 if max_returned_rows and truncate: for i, row in enumerate(cursor): yield row if i == max_returned_rows - 1: break else: for row in cursor: yield row truncated = False

this kind of thing works well with a postgres server side cursor, but i'm not sure if it will hold for sqlite.

you would still spend about the same amount of time in python and would be contending for the gil, but it would be could be non blocking.

depending on the data flow, this could also some benefit for memory. (data stays in more compact sqlite-land until you need it)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1047.37ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows