home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

31 rows where "updated_at" is on date 2022-04-28 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: issue_url, updated_at (date)

issue 5

  • Writable canned queries fail with useless non-error against immutable databases 11
  • Research: demonstrate if parallel SQL queries are worthwhile 10
  • Implement ?_extra and new API design for TableView 8
  • base_url or prefix does not work with _exact match 1
  • Refactor TableView to use asyncinject 1

user 3

  • simonw 26
  • wragge 4
  • henrikek 1

author_association 3

  • OWNER 26
  • CONTRIBUTOR 4
  • NONE 1
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1112734577 https://github.com/simonw/datasette/issues/1729#issuecomment-1112734577 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUvtx simonw 9599 2022-04-28T23:08:42Z 2022-04-28T23:08:42Z OWNER

That prototype is a very small amount of code so far: ```diff diff --git a/datasette/renderer.py b/datasette/renderer.py index 4508949..b600e1b 100644 --- a/datasette/renderer.py +++ b/datasette/renderer.py @@ -28,6 +28,10 @@ def convert_specific_columns_to_json(rows, columns, json_cols):

def json_renderer(args, data, view_name): """Render a response as JSON""" + from pprint import pprint + + pprint(data) + status_code = 200

 # Handle the _json= parameter which may modify data["rows"]

@@ -43,6 +47,41 @@ def json_renderer(args, data, view_name): if "rows" in data and not value_as_boolean(args.get("_json_infinity", "0")): data["rows"] = [remove_infinites(row) for row in data["rows"]]

  • Start building the default JSON here

  • columns = data["columns"]
  • next_url = data.get("next_url")
  • output = {
  • "rows": [dict(zip(columns, row)) for row in data["rows"]],
  • "next": data["next"],
  • "next_url": next_url,
  • } +
  • extras = set(args.getlist("_extra")) +
  • extras_map = {
  • _extra= : data[field]

  • "count": "filtered_table_rows_count",
  • "facet_results": "facet_results",
  • "suggested_facets": "suggested_facets",
  • "columns": "columns",
  • "primary_keys": "primary_keys",
  • "query_ms": "query_ms",
  • "query": "query",
  • }
  • for extra_key, data_key in extras_map.items():
  • if extra_key in extras:
  • output[extra_key] = data[data_key] +
  • body = json.dumps(output, cls=CustomJSONEncoder)
  • content_type = "application/json; charset=utf-8"
  • headers = {}
  • if next_url:
  • headers["link"] = f'<{next_url}>; rel="next"'
  • return Response(
  • body, status=status_code, headers=headers, content_type=content_type
  • ) + + # Deal with the _shape option shape = args.get("_shape", "arrays") # if there's an error, ignore the shape entirely ```
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112732563 https://github.com/simonw/datasette/issues/1729#issuecomment-1112732563 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUvOT simonw 9599 2022-04-28T23:05:03Z 2022-04-28T23:05:03Z OWNER

OK, the prototype of this is looking really good - it's very pleasant to use.

http://127.0.0.1:8001/github_memory/issue_comments.json?_search=simon&_sort=id&_size=5&_extra=query_ms&_extra=count&_col=body returns this:

json { "rows": [ { "id": 338854988, "body": " /database-name/table-name?name__contains=simon&sort=id+desc\r\n\r\nNote that if there's a column called \"sort\" you can still do sort__exact=blah\r\n\r\n" }, { "id": 346427794, "body": "Thanks. There is a way to use pip to grab apsw, which also let's you configure it (flags to build extensions, use an internal sqlite, etc). Don't know how that works as a dependency for another package, though.\n\nOn November 22, 2017 11:38:06 AM EST, Simon Willison <notifications@github.com> wrote:\n>I have a solution for FTS already, but I'm interested in apsw as a\n>mechanism for allowing custom virtual tables to be written in Python\n>(pysqlite only lets you write custom functions)\n>\n>Not having PyPI support is pretty tough though. I'm planning a\n>plugin/extension system which would be ideal for things like an\n>optional apsw mode, but that's a lot harder if apsw isn't in PyPI.\n>\n>-- \n>You are receiving this because you authored the thread.\n>Reply to this email directly or view it on GitHub:\n>https://github.com/simonw/datasette/issues/144#issuecomment-346405660\n" }, { "id": 348252037, "body": "WOW!\n\n--\nPaul Ford // (646) 369-7128 // @ftrain\n\nOn Thu, Nov 30, 2017 at 11:47 AM, Simon Willison <notifications@github.com>\nwrote:\n\n> Remaining work on this now lives in a milestone:\n> https://github.com/simonw/datasette/milestone/6\n>\n> —\n> You are receiving this because you were mentioned.\n> Reply to this email directly, view it on GitHub\n> <https://github.com/simonw/datasette/issues/153#issuecomment-348248406>,\n> or mute the thread\n> <https://github.com/notifications/unsubscribe-auth/AABPKHzaVPKwTOoHouK2aMUnM-mPnPk6ks5s7twzgaJpZM4Qq2zW>\n> .\n>\n" }, { "id": 391141391, "body": "I'm going to clean this up for consistency tomorrow morning so hold off\nmerging until then please\n\nOn Tue, May 22, 2018 at 6:34 PM, Simon Willison <notifications@github.com>\nwrote:\n\n> Yeah let's try this without pysqlite3 and see if we still get the correct\n> version.\n>\n> —\n> You are receiving this because you authored the thread.\n> Reply to this email directly, view it on GitHub\n> <https://github.com/simonw/datasette/pull/280#issuecomment-391076458>, or mute\n> the thread\n> <https://github.com/notifications/unsubscribe-auth/AAihfMI-H6CBt-Py0xdBbH2xDK0KsjT2ks5t1EwYgaJpZM4UI_2m>\n> .\n>\n" }, { "id": 391355030, "body": "No objections;\r\nIt's good to go @simonw\r\n\r\nOn Wed, 23 May 2018, 14:51 Simon Willison, <notifications@github.com> wrote:\r\n\r\n> @r4vi <https://github.com/r4vi> any objections to me merging this?\r\n>\r\n> —\r\n> You are receiving this because you were mentioned.\r\n> Reply to this email directly, view it on GitHub\r\n> <https://github.com/simonw/datasette/pull/280#issuecomment-391354237>, or mute\r\n> the thread\r\n> <https://github.com/notifications/unsubscribe-auth/AAihfM_2DN5WR2mkO-VK6ozDmkUQ4IMjks5t1WlcgaJpZM4UI_2m>\r\n> .\r\n>\r\n" } ], "next": "391355030,391355030", "next_url": "http://127.0.0.1:8001/github_memory/issue_comments.json?_search=simon&_size=5&_extra=query_ms&_extra=count&_col=body&_next=391355030%2C391355030&_sort=id", "count": 57, "query_ms": 21.780223003588617 }

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112730416 https://github.com/simonw/datasette/issues/1729#issuecomment-1112730416 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUusw simonw 9599 2022-04-28T23:01:21Z 2022-04-28T23:01:21Z OWNER

I'm not sure what to do about the "truncated": true/false key.

It's not really relevant to table results, since they are paginated whether or not you ask for them to be.

It plays a role in query results, where you might run select * from table and get back 1000 results because Datasette truncates at that point rather than returning everything.

Adding it to every table result and always setting it to "truncated": false feels confusing.

I think I'm going to keep it exclusively in the default representation for the /db?sql=... query endpoint, and not return it at all for tables.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112721321 https://github.com/simonw/datasette/issues/1729#issuecomment-1112721321 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUsep simonw 9599 2022-04-28T22:44:05Z 2022-04-28T22:44:14Z OWNER

I may be able to implement this mostly in the json_renderer() function: https://github.com/simonw/datasette/blob/94a3171b01fde5c52697aeeff052e3ad4bab5391/datasette/renderer.py#L29-L34

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112717745 https://github.com/simonw/datasette/issues/1729#issuecomment-1112717745 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUrmx simonw 9599 2022-04-28T22:38:39Z 2022-04-28T22:39:05Z OWNER

(I remain keen on the idea of shipping a plugin that restores the old default API shape to people who have written pre-Datasette-1.0 code against it, but I'll tackle that much later. I really like how jQuery has a culture of doing this.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112717210 https://github.com/simonw/datasette/issues/1729#issuecomment-1112717210 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUrea simonw 9599 2022-04-28T22:37:37Z 2022-04-28T22:37:37Z OWNER

This means filtered_table_rows_count is going to become count. I had originally picked that terrible name to avoid confusion between the count of all rows in the table and the count of rows that were filtered.

I'll add ?_extra=table_count for getting back the full table count instead. I think count is clear enough!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112716611 https://github.com/simonw/datasette/issues/1729#issuecomment-1112716611 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUrVD simonw 9599 2022-04-28T22:36:24Z 2022-04-28T22:36:24Z OWNER

Then I'm going to implement the following ?_extra= options:

  • ?_extra=facet_results - to see facet results
  • ?_extra=suggested_facets - for suggested facets
  • ?_extra=count - for the count of total rows
  • ?_extra=columns - for a list of column names
  • ?_extra=primary_keys - for a list of primary keys
  • ?_extra=query - a {"sql" "select ...", "params": {}} object

I thought about having ?_extra=facet_results returned automatically if the user specifies at least one ?_facet - but that doesn't work for default facets configured in metadata.json - how can the user opt out of those being returned? So I'm going to say you don't see facets at all if you don't include ?_extra=facet_results.

I'm tempted to add ?_extra=_all to return everything, but I can decide if that's a good idea later.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112713581 https://github.com/simonw/datasette/issues/1729#issuecomment-1112713581 https://api.github.com/repos/simonw/datasette/issues/1729 IC_kwDOBm6k_c5CUqlt simonw 9599 2022-04-28T22:31:11Z 2022-04-28T22:31:11Z OWNER

I'm going to change the default API response to look like this: json { "rows": [ { "pk": 1, "created": "2019-01-14 08:00:00", "planet_int": 1, "on_earth": 1, "state": "CA", "_city_id": 1, "_neighborhood": "Mission", "tags": "[\"tag1\", \"tag2\"]", "complex_array": "[{\"foo\": \"bar\"}]", "distinct_some_null": "one", "n": "n1" }, { "pk": 2, "created": "2019-01-14 08:00:00", "planet_int": 1, "on_earth": 1, "state": "CA", "_city_id": 1, "_neighborhood": "Dogpatch", "tags": "[\"tag1\", \"tag3\"]", "complex_array": "[]", "distinct_some_null": "two", "n": "n2" } ], "next": null, "next_url": null } Basically https://latest.datasette.io/fixtures/facetable.json?_shape=objects but with just the rows, next and next_url fields returned by default.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Implement ?_extra and new API design for TableView 1219385669  
1112711115 https://github.com/simonw/datasette/issues/1715#issuecomment-1112711115 https://api.github.com/repos/simonw/datasette/issues/1715 IC_kwDOBm6k_c5CUp_L simonw 9599 2022-04-28T22:26:56Z 2022-04-28T22:26:56Z OWNER

I'm not going to use asyncinject in this refactor - at least not until I really need it. My research in these issues has put me off the idea ( in favour of asyncio.gather() or even not trying for parallel execution at all):

  • 1727

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Refactor TableView to use asyncinject 1212823665  
1112668411 https://github.com/simonw/datasette/issues/1727#issuecomment-1112668411 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CUfj7 simonw 9599 2022-04-28T21:25:34Z 2022-04-28T21:25:44Z OWNER

The two most promising theories at the moment, from here and Twitter and the SQLite forum, are:

  • SQLite is I/O bound - it generally only goes as fast as it can load data from disk. Multiple connections all competing for the same file on disk are going to end up blocked at the file system layer. But maybe this means in-memory databases will perform better?
  • It's the GIL. The sqlite3 C code may release the GIL, but the bits that do things like assembling Row objects to return still happen in Python, and that Python can only run on a single core.

A couple of ways to research the in-memory theory:

  • Use a RAM disk on macOS (or Linux). https://stackoverflow.com/a/2033417/6083 has instructions - short version:

    hdiutil attach -nomount ram://$((2 * 1024 * 100)) diskutil eraseVolume HFS+ RAMDisk name-returned-by-previous-command (was /dev/disk2 when I tried it) cd /Volumes/RAMDisk cp ~/fixtures.db .

  • Copy Datasette databases into an in-memory database on startup. I built a new plugin to do that here: https://github.com/simonw/datasette-copy-to-memory

I need to do some more, better benchmarks using these different approaches.

https://twitter.com/laurencerowe/status/1519780174560169987 also suggests:

Maybe try: 1. Copy the sqlite file to /dev/shm and rerun (all in ram.) 2. Create a CTE which calculates Fibonacci or similar so you can test something completely cpu bound (only return max value or something to avoid crossing between sqlite/Python.)

I like that second idea a lot - I could use the mandelbrot example from https://www.sqlite.org/lang_with.html#outlandish_recursive_query_examples

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111955628 https://github.com/simonw/datasette/issues/1633#issuecomment-1111955628 https://api.github.com/repos/simonw/datasette/issues/1633 IC_kwDOBm6k_c5CRxis henrikek 6613091 2022-04-28T09:12:56Z 2022-04-28T09:12:56Z NONE

I have verified that the problem with base_url still exists in the latest version 0.61.1. I would need some guidance if my code change suggestion is correct or if base_url should be included in some other code?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
base_url or prefix does not work with _exact match 1129052172  
1111752676 https://github.com/simonw/datasette/issues/1728#issuecomment-1111752676 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ__k wragge 127565 2022-04-28T05:11:54Z 2022-04-28T05:11:54Z CONTRIBUTOR

And in terms of the bug, yep I agree that option 2 would be the most useful and least frustrating.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111751734 https://github.com/simonw/datasette/issues/1728#issuecomment-1111751734 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ_w2 wragge 127565 2022-04-28T05:09:59Z 2022-04-28T05:09:59Z CONTRIBUTOR

Thanks, I'll give it a try!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111726586 https://github.com/simonw/datasette/issues/1727#issuecomment-1111726586 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQ5n6 simonw 9599 2022-04-28T04:17:16Z 2022-04-28T04:19:31Z OWNER

I could experiment with the await asyncio.run_in_executor(processpool_executor, fn) mechanism described in https://stackoverflow.com/a/29147750

Code examples: https://cs.github.com/?scopeName=All+repos&scope=&q=run_in_executor+ProcessPoolExecutor

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111725638 https://github.com/simonw/datasette/issues/1727#issuecomment-1111725638 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQ5ZG simonw 9599 2022-04-28T04:15:15Z 2022-04-28T04:15:15Z OWNER

Useful theory from Keith Medcalf https://sqlite.org/forum/forumpost/e363c69d3441172e

This is true, but the concurrency is limited to the execution which occurs with the GIL released (that is, in the native C sqlite3 library itself). Each row (for example) can be retrieved in parallel but "constructing the python return objects for each row" will be serialized (by the GIL).

That is to say that if your have two python threads each with their own connection, and each one is performing a select that returns 1,000,000 rows (lets say that is 25% of the candidates for each select) then the difference in execution time between executing two python threads in parallel vs a single serial thead will not be much different (if even detectable at all). In fact it is possible that the multiple-threaded version takes longer to run both queries to completion because of the increased contention over a shared resource (the GIL).

So maybe this is a GIL thing.

I should test with some expensive SQL queries (maybe big aggregations against large tables) and see if I can spot an improvement there.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111714665 https://github.com/simonw/datasette/issues/1728#issuecomment-1111714665 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ2tp simonw 9599 2022-04-28T03:52:47Z 2022-04-28T03:52:58Z OWNER

Nice custom template/theme!

Yeah, for that I'd recommend hosting elsewhere - on a regular VPS (I use systemd like this: https://docs.datasette.io/en/stable/deploying.html#running-datasette-using-systemd ) or using Fly if you want to tub containers without managing a full server.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111712953 https://github.com/simonw/datasette/issues/1728#issuecomment-1111712953 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ2S5 wragge 127565 2022-04-28T03:48:36Z 2022-04-28T03:48:36Z CONTRIBUTOR

I don't think that'd work for this project. The db is very big, and my aim was to have an environment where researchers could be making use of the data, but be easily able to add corrections to the HTR/OCR extracted data when they came across problems. It's in its immutable (!) form here: https://sydney-stock-exchange-xqtkxtd5za-ts.a.run.app/stock_exchange/stocks

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111708206 https://github.com/simonw/datasette/issues/1728#issuecomment-1111708206 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ1Iu simonw 9599 2022-04-28T03:38:56Z 2022-04-28T03:38:56Z OWNER

In terms of this bug, there are a few potential fixes:

  1. Detect the write to a immutable database and show the user a proper, meaningful error message in the red error box at the top of the page
  2. Don't allow the user to even submit the form - show a message saying that this canned query is unavailable because the database cannot be written to
  3. Don't even allow Datasette to start running at all - if there's a canned query configured in metadata.yml and the database it refers to is in -i immutable mode throw an error on startup

I'm not keen on that last one because it would be frustrating if you couldn't launch Datasette just because you had an old canned query lying around in your metadata file.

So I'm leaning towards option 2.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111707384 https://github.com/simonw/datasette/issues/1728#issuecomment-1111707384 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ074 simonw 9599 2022-04-28T03:36:46Z 2022-04-28T03:36:56Z OWNER

A more realistic solution (which I've been using on several of my own projects) is to keep the data itself in GitHub and encourage users to edit it there - using the GitHub web interface to edit YAML files or similar.

Needs your users to be comfortable hand-editing YAML though! You can at least guard against critical errors by having CI run tests against their YAML before deploying.

I have a dream of building a more friendly web forms interface which edits the YAML back on GitHub for the user, but that's just a concept at the moment.

Even more fun would be if a user-friendly form could submit PRs for review without the user having to know what a PR is!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111706519 https://github.com/simonw/datasette/issues/1728#issuecomment-1111706519 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ0uX simonw 9599 2022-04-28T03:34:49Z 2022-04-28T03:34:49Z OWNER

I've wanted to do stuff like that on Cloud Run too. So far I've assumed that it's not feasible, but recently I've been wondering how hard it would be to have a small (like less than 100KB or so) Datasette instance which persists data to a backing GitHub repository such that when it starts up it can pull the latest copy and any time someone edits it can push their changes.

I'm still not sure it would work well on Cloud Run due to the uncertainty at what would happen if Cloud Run decided to boot up a second instance - but it's still an interesting thought exercise.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111705323 https://github.com/simonw/datasette/issues/1728#issuecomment-1111705323 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ0br wragge 127565 2022-04-28T03:32:06Z 2022-04-28T03:32:06Z CONTRIBUTOR

Ah, that would be it! I have a core set of data which doesn't change to which I want authorised users to be able to submit corrections. I was going to deal with the persistence issue by just grabbing the user corrections at regular intervals and saving to GitHub. I might need to rethink. Thanks!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111705069 https://github.com/simonw/datasette/issues/1728#issuecomment-1111705069 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQ0Xt simonw 9599 2022-04-28T03:31:33Z 2022-04-28T03:31:33Z OWNER

Confirmed - this is a bug where immutable databases fail to show a useful error if you write to them with a canned query.

Steps to reproduce: echo ' databases: writable: queries: add_name: sql: insert into names(name) values (:name) write: true ' > write-metadata.yml echo '{"name": "Simon"}' | sqlite-utils insert writable.db names - datasette writable.db -m write-metadata.yml Then visit http://127.0.0.1:8001/writable/add_name - adding names works.

Now do this instead:

datasette -i writable.db -m write-metadata.yml

And I'm getting a broken error:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111699175 https://github.com/simonw/datasette/issues/1727#issuecomment-1111699175 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQy7n simonw 9599 2022-04-28T03:19:48Z 2022-04-28T03:20:08Z OWNER

I ran py-spy and then hammered refresh a bunch of times on the http://127.0.0.1:8856/github/commits?_facet=repo&_facet=committer&_trace=1&_noparallel= page - it generated this SVG profile for me.

The area on the right is the threads running the DB queries:

Interactive version here: https://static.simonwillison.net/static/2022/datasette-parallel-profile.svg

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111698307 https://github.com/simonw/datasette/issues/1728#issuecomment-1111698307 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQyuD simonw 9599 2022-04-28T03:18:02Z 2022-04-28T03:18:02Z OWNER

If the behaviour you are seeing is because the database is running in immutable mode then that's a bug - you should get a useful error message instead!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111697985 https://github.com/simonw/datasette/issues/1728#issuecomment-1111697985 https://api.github.com/repos/simonw/datasette/issues/1728 IC_kwDOBm6k_c5CQypB simonw 9599 2022-04-28T03:17:20Z 2022-04-28T03:17:20Z OWNER

How did you deploy to Cloud Run?

datasette publish cloudrun defaults to running databases there in -i immutable mode, because if you managed to change a file on disk on Cloud Run those changes would be lost the next time your container restarted there.

That's why I upgraded datasette-publish-fly to provide a way of working with their volumes support - they're the best option I know of right now for running Datasette in a container with a persistent volume that can accept writes: https://simonwillison.net/2022/Feb/15/fly-volumes/

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Writable canned queries fail with useless non-error against immutable databases 1218133366  
1111683539 https://github.com/simonw/datasette/issues/1727#issuecomment-1111683539 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQvHT simonw 9599 2022-04-28T02:47:57Z 2022-04-28T02:47:57Z OWNER

Maybe this is the Python GIL after all?

I've been hoping that the GIL won't be an issue because the sqlite3 module releases the GIL for the duration of the execution of a SQL query - see https://github.com/python/cpython/blob/f348154c8f8a9c254503306c59d6779d4d09b3a9/Modules/_sqlite/cursor.c#L749-L759

So I've been hoping this means that SQLite code itself can run concurrently on multiple cores even when Python threads cannot.

But maybe I'm misunderstanding how that works?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111681513 https://github.com/simonw/datasette/issues/1727#issuecomment-1111681513 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQunp simonw 9599 2022-04-28T02:44:26Z 2022-04-28T02:44:26Z OWNER

I could try py-spy top, which I previously used here: - https://github.com/simonw/datasette/issues/1673

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111661331 https://github.com/simonw/datasette/issues/1727#issuecomment-1111661331 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQpsT simonw 9599 2022-04-28T02:07:31Z 2022-04-28T02:07:31Z OWNER

Asked on the SQLite forum about this here: https://sqlite.org/forum/forumpost/ffbfa9f38e

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111602802 https://github.com/simonw/datasette/issues/1727#issuecomment-1111602802 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQbZy simonw 9599 2022-04-28T00:21:35Z 2022-04-28T00:21:35Z OWNER

Tried this but I'm getting back an empty JSON array of traces at the bottom of the page most of the time (intermittently it works correctly):

```diff diff --git a/datasette/database.py b/datasette/database.py index ba594a8..d7f9172 100644 --- a/datasette/database.py +++ b/datasette/database.py @@ -7,7 +7,7 @@ import sys import threading import uuid

-from .tracer import trace +from .tracer import trace, trace_child_tasks from .utils import ( detect_fts, detect_primary_keys, @@ -207,30 +207,31 @@ class Database: time_limit_ms = custom_time_limit

         with sqlite_timelimit(conn, time_limit_ms):
  • try:
  • cursor = conn.cursor()
  • cursor.execute(sql, params if params is not None else {})
  • max_returned_rows = self.ds.max_returned_rows
  • if max_returned_rows == page_size:
  • max_returned_rows += 1
  • if max_returned_rows and truncate:
  • rows = cursor.fetchmany(max_returned_rows + 1)
  • truncated = len(rows) > max_returned_rows
  • rows = rows[:max_returned_rows]
  • else:
  • rows = cursor.fetchall()
  • truncated = False
  • except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:
  • if e.args == ("interrupted",):
  • raise QueryInterrupted(e, sql, params)
  • if log_sql_errors:
  • sys.stderr.write(
  • "ERROR: conn={}, sql = {}, params = {}: {}\n".format(
  • conn, repr(sql), params, e
  • with trace("sql", database=self.name, sql=sql.strip(), params=params):
  • try:
  • cursor = conn.cursor()
  • cursor.execute(sql, params if params is not None else {})
  • max_returned_rows = self.ds.max_returned_rows
  • if max_returned_rows == page_size:
  • max_returned_rows += 1
  • if max_returned_rows and truncate:
  • rows = cursor.fetchmany(max_returned_rows + 1)
  • truncated = len(rows) > max_returned_rows
  • rows = rows[:max_returned_rows]
  • else:
  • rows = cursor.fetchall()
  • truncated = False
  • except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:
  • if e.args == ("interrupted",):
  • raise QueryInterrupted(e, sql, params)
  • if log_sql_errors:
  • sys.stderr.write(
  • "ERROR: conn={}, sql = {}, params = {}: {}\n".format(
  • conn, repr(sql), params, e
  • ) )
  • )
  • sys.stderr.flush()
  • raise
  • sys.stderr.flush()
  • raise

         if truncate:
             return Results(rows, truncated, cursor.description)
    

    @@ -238,9 +239,8 @@ class Database: else: return Results(rows, False, cursor.description)

  • with trace("sql", database=self.name, sql=sql.strip(), params=params):

  • results = await self.execute_fn(sql_operation_in_thread)
  • return results
  • with trace_child_tasks():
  • return await self.execute_fn(sql_operation_in_thread)

    @property def size(self): ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111597176 https://github.com/simonw/datasette/issues/1727#issuecomment-1111597176 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQaB4 simonw 9599 2022-04-28T00:11:44Z 2022-04-28T00:11:44Z OWNER

Though it would be interesting to also have the trace reveal how much time is spent in the functions that wrap that core SQL - the stuff that is being measured at the moment.

I have a hunch that this could help solve the over-arching performance mystery.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  
1111595319 https://github.com/simonw/datasette/issues/1727#issuecomment-1111595319 https://api.github.com/repos/simonw/datasette/issues/1727 IC_kwDOBm6k_c5CQZk3 simonw 9599 2022-04-28T00:09:45Z 2022-04-28T00:11:01Z OWNER

Here's where read queries are instrumented: https://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L241-L242

So the instrumentation is actually capturing quite a bit of Python activity before it gets to SQLite:

https://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L179-L190

And then:

https://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L204-L233

Ideally I'd like that trace() block to wrap just the cursor.execute() and cursor.fetchmany(...) or cursor.fetchall() calls.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Research: demonstrate if parallel SQL queries are worthwhile 1217759117  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1436.108ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows