{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1258129113", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1258129113, "node_id": "IC_kwDOBm6k_c5K_YbZ", "user": {"value": 536941, "label": "fgregg"}, "created_at": "2022-09-26T14:30:11Z", "updated_at": "2022-09-26T14:48:31Z", "author_association": "CONTRIBUTOR", "body": "from your analysis, it seems like the GIL is blocking on loading of the data from sqlite to python, (particularly in the `fetchmany` call)\r\n\r\nthis is probably a simplistic idea, but what if you had the python code in the `execute` method iterate over the cursor and yield out rows or small chunks of rows.\r\n\r\nsomething like: \r\n```python\r\n with sqlite_timelimit(conn, time_limit_ms):\r\n try:\r\n cursor = conn.cursor()\r\n cursor.execute(sql, params if params is not None else {})\r\n except:\r\n ...\r\n max_returned_rows = self.ds.max_returned_rows\r\n if max_returned_rows == page_size:\r\n max_returned_rows += 1\r\n if max_returned_rows and truncate:\r\n for i, row in enumerate(cursor):\r\n yield row\r\n if i == max_returned_rows - 1:\r\n break\r\n else:\r\n for row in cursor:\r\n yield row\r\n truncated = False \r\n```\r\n\r\nthis kind of thing works well with a postgres server side cursor, but i'm not sure if it will hold for sqlite. \r\n\r\nyou would still spend about the same amount of time in python and would be contending for the gil, but it would be could be non blocking.\r\n\r\ndepending on the data flow, this could also some benefit for memory. (data stays in more compact sqlite-land until you need it)", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111448928", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111448928, "node_id": "IC_kwDOBm6k_c5CP11g", "user": {"value": 716529, "label": "glyph"}, "created_at": "2022-04-27T20:27:05Z", "updated_at": "2022-04-27T20:27:05Z", "author_association": "NONE", "body": "You don't want to re-use an SQLite connection from multiple threads anyway: https://www.sqlite.org/threadsafe.html\r\n\r\nMultiple connections can operate on the file in parallel, but a single connection can't:\r\n\r\n> Multi-thread. In this mode, SQLite can be safely used by multiple threads **provided that no single database connection is used simultaneously in two or more threads**.\r\n\r\n(emphasis mine)", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111456500", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111456500, "node_id": "IC_kwDOBm6k_c5CP3r0", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T20:36:01Z", "updated_at": "2022-04-27T20:36:01Z", "author_association": "OWNER", "body": "Yeah all of this is pretty much assuming read-only connections. Datasette has a separate mechanism for ensuring that writes are executed one at a time against a dedicated connection from an in-memory queue:\r\n- https://github.com/simonw/datasette/issues/682", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111380282", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111380282, "node_id": "IC_kwDOBm6k_c5CPlE6", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T19:10:27Z", "updated_at": "2022-04-27T19:10:27Z", "author_association": "OWNER", "body": "Wrote more about that here: https://simonwillison.net/2022/Apr/27/parallel-queries/\r\n\r\nCompare https://latest-with-plugins.datasette.io/github/commits?_facet=repo&_facet=committer&_trace=1\r\n\r\n![image](https://user-images.githubusercontent.com/9599/165601503-2083c5d2-d740-405c-b34d-85570744ca82.png)\r\n\r\nWith the same thing but with parallel execution disabled:\r\n\r\nhttps://latest-with-plugins.datasette.io/github/commits?_facet=repo&_facet=committer&_trace=1&_noparallel=1\r\n\r\n![image](https://user-images.githubusercontent.com/9599/165601525-98abbfb1-5631-4040-b6bd-700948d1db6e.png)\r\n\r\nThose total page load time numbers are very similar. Is this parallel optimization worthwhile?\r\n\r\nMaybe it's only worth it on larger databases? Or maybe larger databases perform worse with this?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111460068", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111460068, "node_id": "IC_kwDOBm6k_c5CP4jk", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T20:38:32Z", "updated_at": "2022-04-27T20:38:32Z", "author_association": "OWNER", "body": "WAL mode didn't seem to make a difference. I thought there was a chance it might help multiple read connections operate at the same time but it looks like it really does only matter for when writes are going on.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111725638", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111725638, "node_id": "IC_kwDOBm6k_c5CQ5ZG", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T04:15:15Z", "updated_at": "2022-04-28T04:15:15Z", "author_association": "OWNER", "body": "Useful theory from Keith Medcalf https://sqlite.org/forum/forumpost/e363c69d3441172e\r\n\r\n> This is true, but the concurrency is limited to the execution which occurs with the GIL released (that is, in the native C sqlite3 library itself). Each row (for example) can be retrieved in parallel but \"constructing the python return objects for each row\" will be serialized (by the GIL).\r\n> \r\n> That is to say that if your have two python threads each with their own connection, and each one is performing a select that returns 1,000,000 rows (lets say that is 25% of the candidates for each select) then the difference in execution time between executing two python threads in parallel vs a single serial thead will not be much different (if even detectable at all). In fact it is possible that the multiple-threaded version takes longer to run both queries to completion because of the increased contention over a shared resource (the GIL).\r\n\r\nSo maybe this is a GIL thing.\r\n\r\nI should test with some expensive SQL queries (maybe big aggregations against large tables) and see if I can spot an improvement there.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111602802", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111602802, "node_id": "IC_kwDOBm6k_c5CQbZy", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T00:21:35Z", "updated_at": "2022-04-28T00:21:35Z", "author_association": "OWNER", "body": "Tried this but I'm getting back an empty JSON array of traces at the bottom of the page most of the time (intermittently it works correctly):\r\n\r\n```diff\r\ndiff --git a/datasette/database.py b/datasette/database.py\r\nindex ba594a8..d7f9172 100644\r\n--- a/datasette/database.py\r\n+++ b/datasette/database.py\r\n@@ -7,7 +7,7 @@ import sys\r\n import threading\r\n import uuid\r\n \r\n-from .tracer import trace\r\n+from .tracer import trace, trace_child_tasks\r\n from .utils import (\r\n detect_fts,\r\n detect_primary_keys,\r\n@@ -207,30 +207,31 @@ class Database:\r\n time_limit_ms = custom_time_limit\r\n \r\n with sqlite_timelimit(conn, time_limit_ms):\r\n- try:\r\n- cursor = conn.cursor()\r\n- cursor.execute(sql, params if params is not None else {})\r\n- max_returned_rows = self.ds.max_returned_rows\r\n- if max_returned_rows == page_size:\r\n- max_returned_rows += 1\r\n- if max_returned_rows and truncate:\r\n- rows = cursor.fetchmany(max_returned_rows + 1)\r\n- truncated = len(rows) > max_returned_rows\r\n- rows = rows[:max_returned_rows]\r\n- else:\r\n- rows = cursor.fetchall()\r\n- truncated = False\r\n- except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:\r\n- if e.args == (\"interrupted\",):\r\n- raise QueryInterrupted(e, sql, params)\r\n- if log_sql_errors:\r\n- sys.stderr.write(\r\n- \"ERROR: conn={}, sql = {}, params = {}: {}\\n\".format(\r\n- conn, repr(sql), params, e\r\n+ with trace(\"sql\", database=self.name, sql=sql.strip(), params=params):\r\n+ try:\r\n+ cursor = conn.cursor()\r\n+ cursor.execute(sql, params if params is not None else {})\r\n+ max_returned_rows = self.ds.max_returned_rows\r\n+ if max_returned_rows == page_size:\r\n+ max_returned_rows += 1\r\n+ if max_returned_rows and truncate:\r\n+ rows = cursor.fetchmany(max_returned_rows + 1)\r\n+ truncated = len(rows) > max_returned_rows\r\n+ rows = rows[:max_returned_rows]\r\n+ else:\r\n+ rows = cursor.fetchall()\r\n+ truncated = False\r\n+ except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:\r\n+ if e.args == (\"interrupted\",):\r\n+ raise QueryInterrupted(e, sql, params)\r\n+ if log_sql_errors:\r\n+ sys.stderr.write(\r\n+ \"ERROR: conn={}, sql = {}, params = {}: {}\\n\".format(\r\n+ conn, repr(sql), params, e\r\n+ )\r\n )\r\n- )\r\n- sys.stderr.flush()\r\n- raise\r\n+ sys.stderr.flush()\r\n+ raise\r\n \r\n if truncate:\r\n return Results(rows, truncated, cursor.description)\r\n@@ -238,9 +239,8 @@ class Database:\r\n else:\r\n return Results(rows, False, cursor.description)\r\n \r\n- with trace(\"sql\", database=self.name, sql=sql.strip(), params=params):\r\n- results = await self.execute_fn(sql_operation_in_thread)\r\n- return results\r\n+ with trace_child_tasks():\r\n+ return await self.execute_fn(sql_operation_in_thread)\r\n \r\n @property\r\n def size(self):\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111485722", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111485722, "node_id": "IC_kwDOBm6k_c5CP-0a", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T21:08:20Z", "updated_at": "2022-04-27T21:08:20Z", "author_association": "OWNER", "body": "Tried that and it didn't seem to make a difference either.\r\n\r\nI really need a much deeper view of what's going on here.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111597176", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111597176, "node_id": "IC_kwDOBm6k_c5CQaB4", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T00:11:44Z", "updated_at": "2022-04-28T00:11:44Z", "author_association": "OWNER", "body": "Though it would be interesting to also have the trace reveal how much time is spent in the functions that wrap that core SQL - the stuff that is being measured at the moment.\r\n\r\nI have a hunch that this could help solve the over-arching performance mystery.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111462442", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111462442, "node_id": "IC_kwDOBm6k_c5CP5Iq", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T20:40:59Z", "updated_at": "2022-04-27T20:42:49Z", "author_association": "OWNER", "body": "This looks VERY relevant: [SQLite Shared-Cache Mode](https://www.sqlite.org/sharedcache.html):\r\n\r\n> SQLite includes a special \"shared-cache\" mode (disabled by default) intended for use in embedded servers. If shared-cache mode is enabled and a thread establishes multiple connections to the same database, the connections share a single data and schema cache. This can significantly reduce the quantity of memory and IO required by the system.\r\n\r\nEnabled as part of the URI filename:\r\n\r\n ATTACH 'file:aux.db?cache=shared' AS aux;\r\n\r\nTurns out I'm already using this for in-memory databases that have `.memory_name` set, but not (yet) for regular file-backed databases:\r\n\r\nhttps://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L73-L75\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112668411", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112668411, "node_id": "IC_kwDOBm6k_c5CUfj7", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T21:25:34Z", "updated_at": "2022-04-28T21:25:44Z", "author_association": "OWNER", "body": "The two most promising theories at the moment, from here and Twitter and the SQLite forum, are:\r\n\r\n- SQLite is I/O bound - it generally only goes as fast as it can load data from disk. Multiple connections all competing for the same file on disk are going to end up blocked at the file system layer. But maybe this means in-memory databases will perform better?\r\n- It's the GIL. The sqlite3 C code may release the GIL, but the bits that do things like assembling `Row` objects to return still happen in Python, and that Python can only run on a single core.\r\n\r\nA couple of ways to research the in-memory theory:\r\n\r\n- Use a RAM disk on macOS (or Linux). https://stackoverflow.com/a/2033417/6083 has instructions - short version:\r\n\r\n hdiutil attach -nomount ram://$((2 * 1024 * 100))\r\n diskutil eraseVolume HFS+ RAMDisk name-returned-by-previous-command (was `/dev/disk2` when I tried it)\r\n cd /Volumes/RAMDisk\r\n cp ~/fixtures.db .\r\n\r\n- Copy Datasette databases into an in-memory database on startup. I built a new plugin to do that here: https://github.com/simonw/datasette-copy-to-memory\r\n\r\nI need to do some more, better benchmarks using these different approaches.\r\n\r\nhttps://twitter.com/laurencerowe/status/1519780174560169987 also suggests:\r\n\r\n> Maybe try:\r\n> 1. Copy the sqlite file to /dev/shm and rerun (all in ram.)\r\n> 2. Create a CTE which calculates Fibonacci or similar so you can test something completely cpu bound (only return max value or something to avoid crossing between sqlite/Python.)\r\n\r\nI like that second idea a lot - I could use the mandelbrot example from https://www.sqlite.org/lang_with.html#outlandish_recursive_query_examples", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111442012", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111442012, "node_id": "IC_kwDOBm6k_c5CP0Jc", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T20:19:00Z", "updated_at": "2022-04-27T20:19:00Z", "author_association": "OWNER", "body": "Something worth digging into: are these parallel queries running against the same SQLite connection or are they each rubbing against a separate SQLite connection?\r\n\r\nJust realized I know the answer: they're running against separate SQLite connections, because that's how the time limit mechanism works: it installs a progress handler for each connection which terminates it after a set time.\r\n\r\nThis means that if SQLite benefits from multiple threads using the same connection (due to shared caches or similar) then Datasette will not be seeing those benefits.\r\n\r\nIt also means that if there's some mechanism within SQLite that penalizes you for having multiple parallel connections to a single file (just guessing here, maybe there's some kind of locking going on?) then Datasette will suffer those penalties.\r\n\r\nI should try seeing what happens with WAL mode enabled.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1114058210", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1114058210, "node_id": "IC_kwDOBm6k_c5CZy3i", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-30T21:39:34Z", "updated_at": "2022-04-30T21:39:34Z", "author_association": "OWNER", "body": "Something to consider if I look into subprocesses for parallel query execution:\r\n\r\nhttps://sqlite.org/howtocorrupt.html#_carrying_an_open_database_connection_across_a_fork_\r\n\r\n> Do not open an SQLite database connection, then fork(), then try to use that database connection in the child process. All kinds of locking problems will result and you can easily end up with a corrupt database. SQLite is not designed to support that kind of behavior. Any database connection that is used in a child process must be opened in the child process, not inherited from the parent. ", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111408273", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111408273, "node_id": "IC_kwDOBm6k_c5CPr6R", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T19:40:51Z", "updated_at": "2022-04-27T19:42:17Z", "author_association": "OWNER", "body": "Relevant: here's the code that sets up a Datasette SQLite connection: https://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L73-L96\r\n\r\nIt's using `check_same_thread=False` - here's [the Python docs on that](https://docs.python.org/3/library/sqlite3.html#sqlite3.connect):\r\n\r\n> By default, *check_same_thread* is [`True`](https://docs.python.org/3/library/constants.html#True \"True\") and only the creating thread may use the connection. If set [`False`](https://docs.python.org/3/library/constants.html#False \"False\"), the returned connection may be shared across multiple threads. When using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption.\r\n\r\nThis is why Datasette reserves a single connection for write queries and queues them up in memory, [as described here](https://simonwillison.net/2020/Feb/26/weeknotes-datasette-writes/).", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111551076", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111551076, "node_id": "IC_kwDOBm6k_c5CQOxk", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T22:44:51Z", "updated_at": "2022-04-27T22:45:04Z", "author_association": "OWNER", "body": "Really wild idea: what if I created three copies of the SQLite database file - as three separate file names - and then balanced the parallel queries across all these? Any chance that could avoid any mysterious locking issues?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111390433", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111390433, "node_id": "IC_kwDOBm6k_c5CPnjh", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T19:21:02Z", "updated_at": "2022-04-27T19:21:02Z", "author_association": "OWNER", "body": "One weird thing: I noticed that in the parallel trace above the SQL query bars are wider. Mousover shows duration in ms, and I got 13ms for this query:\r\n\r\n select message as value, count(*) as n from (\r\n\r\nBut in the `?_noparallel=1` version that some query took 2.97ms.\r\n\r\nGiven those numbers though I would expect the overall page time to be MUCH worse for the parallel version - but the page load times are instead very close to each other, with parallel often winning.\r\n\r\nThis is super-weird.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112889800", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112889800, "node_id": "IC_kwDOBm6k_c5CVVnI", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-29T05:29:38Z", "updated_at": "2022-04-29T05:29:38Z", "author_association": "OWNER", "body": "OK, I just got the most incredible result with that!\r\n\r\nI started up a container running `bash` like this, from my `datasette` checkout. I'm mapping port 8005 on my laptop to port 8001 inside the container because laptop port 8001 was already doing something else:\r\n```\r\ndocker run -it --rm --name my-running-script -p 8005:8001 -v \"$PWD\":/usr/src/myapp \\\r\n -w /usr/src/myapp nogil/python bash\r\n```\r\nThen in `bash` I ran the following commands to install Datasette and its dependencies:\r\n```\r\npip install -e '.[test]'\r\npip install datasette-pretty-traces # For debug tracing\r\n```\r\nThen I started Datasette against my `github.db` database (from github-to-sqlite.dogsheep.net/github.db) like this:\r\n\r\n```\r\ndatasette github.db -h --setting trace_debug 1\r\n```\r\nI hit the following two URLs to compare the parallel v.s. not parallel implementations:\r\n\r\n- ``\r\n- ``\r\n\r\nAnd... the parallel one beat the non-parallel one decisively, on multiple page refreshes!\r\n\r\nNot parallel: 77ms\r\n\r\nParallel: 47ms\r\n\r\n\"CleanShot\r\n\r\n\"CleanShot\r\n\r\nSo yeah, I'm very confident this is a problem with the GIL. And I am absolutely **stunned** that @colesbury's fork ran Datasette (which has some reasonably tricky threading and async stuff going on) out of the box!", "reactions": "{\"total_count\": 2, \"+1\": 2, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111683539", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111683539, "node_id": "IC_kwDOBm6k_c5CQvHT", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T02:47:57Z", "updated_at": "2022-04-28T02:47:57Z", "author_association": "OWNER", "body": "Maybe this is the Python GIL after all?\r\n\r\nI've been hoping that the GIL won't be an issue because the `sqlite3` module releases the GIL for the duration of the execution of a SQL query - see https://github.com/python/cpython/blob/f348154c8f8a9c254503306c59d6779d4d09b3a9/Modules/_sqlite/cursor.c#L749-L759\r\n\r\nSo I've been hoping this means that SQLite code itself can run concurrently on multiple cores even when Python threads cannot.\r\n\r\nBut maybe I'm misunderstanding how that works?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112879463", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112879463, "node_id": "IC_kwDOBm6k_c5CVTFn", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-29T05:03:58Z", "updated_at": "2022-04-29T05:03:58Z", "author_association": "OWNER", "body": "It would be _really_ fun to try running this with the in-development `nogil` Python from https://github.com/colesbury/nogil\r\n\r\nThere's a Docker container for it: https://hub.docker.com/r/nogil/python\r\n\r\nIt suggests you can run something like this:\r\n\r\n docker run -it --rm --name my-running-script -v \"$PWD\":/usr/src/myapp \\\r\n -w /usr/src/myapp nogil/python python your-daemon-or-script.py", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111553029", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111553029, "node_id": "IC_kwDOBm6k_c5CQPQF", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T22:48:21Z", "updated_at": "2022-04-27T22:48:21Z", "author_association": "OWNER", "body": "I wonder if it would be worth exploring multiprocessing here.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111431785", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111431785, "node_id": "IC_kwDOBm6k_c5CPxpp", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T20:07:16Z", "updated_at": "2022-04-27T20:07:16Z", "author_association": "OWNER", "body": "I think I need some much more in-depth tracing tricks for this.\r\n\r\nhttps://www.maartenbreddels.com/perf/jupyter/python/tracing/gil/2021/01/14/Tracing-the-Python-GIL.html looks relevant - uses the `perf` tool on Linux.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111558204", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111558204, "node_id": "IC_kwDOBm6k_c5CQQg8", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T22:58:39Z", "updated_at": "2022-04-27T22:58:39Z", "author_association": "OWNER", "body": "I should check my timing mechanism. Am I capturing the time taken just in SQLite or does it include time spent in Python crossing between async and threaded world and waiting for a thread pool worker to become available?\r\n\r\nThat could explain the longer query times.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111699175", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111699175, "node_id": "IC_kwDOBm6k_c5CQy7n", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T03:19:48Z", "updated_at": "2022-04-28T03:20:08Z", "author_association": "OWNER", "body": "I ran `py-spy` and then hammered refresh a bunch of times on the `` page - it generated this SVG profile for me.\r\n\r\nThe area on the right is the threads running the DB queries:\r\n\r\n![profile](https://user-images.githubusercontent.com/9599/165669677-5461ede5-3dc4-4b49-8319-bfe5fd8a723d.svg)\r\n\r\nInteractive version here: https://static.simonwillison.net/static/2022/datasette-parallel-profile.svg", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111385875", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111385875, "node_id": "IC_kwDOBm6k_c5CPmcT", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T19:16:57Z", "updated_at": "2022-04-27T19:16:57Z", "author_association": "OWNER", "body": "I just remembered the `--setting num_sql_threads` option... which defaults to 3! https://github.com/simonw/datasette/blob/942411ef946e9a34a2094944d3423cddad27efd3/datasette/app.py#L109-L113\r\n\r\nWould explain why the first trace never seems to show more than three SQL queries executing at once.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111681513", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111681513, "node_id": "IC_kwDOBm6k_c5CQunp", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T02:44:26Z", "updated_at": "2022-04-28T02:44:26Z", "author_association": "OWNER", "body": "I could try `py-spy top`, which I previously used here:\r\n- https://github.com/simonw/datasette/issues/1673", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111726586", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111726586, "node_id": "IC_kwDOBm6k_c5CQ5n6", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T04:17:16Z", "updated_at": "2022-04-28T04:19:31Z", "author_association": "OWNER", "body": "I could experiment with the `await asyncio.run_in_executor(processpool_executor, fn)` mechanism described in https://stackoverflow.com/a/29147750\r\n\r\nCode examples: https://cs.github.com/?scopeName=All+repos&scope=&q=run_in_executor+ProcessPoolExecutor", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111595319", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111595319, "node_id": "IC_kwDOBm6k_c5CQZk3", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T00:09:45Z", "updated_at": "2022-04-28T00:11:01Z", "author_association": "OWNER", "body": "Here's where read queries are instrumented: https://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L241-L242\r\n\r\nSo the instrumentation is actually capturing quite a bit of Python activity before it gets to SQLite:\r\n\r\nhttps://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L179-L190\r\n\r\nAnd then:\r\n\r\nhttps://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L204-L233\r\n\r\nIdeally I'd like that `trace()` block to wrap just the `cursor.execute()` and `cursor.fetchmany(...)` or `cursor.fetchall()` calls.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112878955", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112878955, "node_id": "IC_kwDOBm6k_c5CVS9r", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-29T05:02:40Z", "updated_at": "2022-04-29T05:02:40Z", "author_association": "OWNER", "body": "Here's a very useful (recent) article about how the GIL works and how to think about it: https://pythonspeed.com/articles/python-gil/ - via https://lobste.rs/s/9hj80j/when_python_can_t_thread_deep_dive_into_gil\r\n\r\nFrom that article:\r\n\r\n> For example, let's consider an extension module written in C or Rust that lets you talk to a PostgreSQL database server.\r\n> \r\n> Conceptually, handling a SQL query with this library will go through three steps:\r\n> \r\n> 1. Deserialize from Python to the internal library representation. Since this will be reading Python objects, it needs to hold the GIL.\r\n> 2. Send the query to the database server, and wait for a response. This doesn't need the GIL.\r\n> 3. Convert the response into Python objects. This needs the GIL again.\r\n> \r\n> As you can see, how much parallelism you can get depends on how much time is spent in each step. If the bulk of time is spent in step 2, you'll get parallelism there. But if, for example, you run a `SELECT` and get a large number of rows back, the library will need to create many Python objects, and step 3 will have to hold GIL for a while.\r\n\r\nThat explains what I'm seeing here. I'm pretty convinced now that the reason I'm not getting a performance boost from parallel queries is that there's more time spent in Python code assembling the results than in SQLite C code executing the query.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111661331", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111661331, "node_id": "IC_kwDOBm6k_c5CQpsT", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T02:07:31Z", "updated_at": "2022-04-28T02:07:31Z", "author_association": "OWNER", "body": "Asked on the SQLite forum about this here: https://sqlite.org/forum/forumpost/ffbfa9f38e", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111535818", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111535818, "node_id": "IC_kwDOBm6k_c5CQLDK", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T22:18:45Z", "updated_at": "2022-04-27T22:18:45Z", "author_association": "OWNER", "body": "Another avenue: https://twitter.com/weargoggles/status/1519426289920270337\r\n\r\n> SQLite has its own mutexes to provide thread safety, which as another poster noted are out of play in multi process setups. Perhaps downgrading from the \u201cserializable\u201d to \u201cmulti-threaded\u201d safety would be okay for Datasette? https://sqlite.org/c3ref/c_config_covering_index_scan.html#sqliteconfigmultithread\r\n\r\nDoesn't look like there's an obvious way to access that from Python via the `sqlite3` module though.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111432375", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111432375, "node_id": "IC_kwDOBm6k_c5CPxy3", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-27T20:07:57Z", "updated_at": "2022-04-27T20:07:57Z", "author_association": "OWNER", "body": "Also useful: https://avi.im/blag/2021/fast-sqlite-inserts/ - from a tip on Twitter: https://twitter.com/ricardoanderegg/status/1519402047556235264", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111451790", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111451790, "node_id": "IC_kwDOBm6k_c5CP2iO", "user": {"value": 716529, "label": "glyph"}, "created_at": "2022-04-27T20:30:33Z", "updated_at": "2022-04-27T20:30:33Z", "author_association": "NONE", "body": "> I should try seeing what happens with WAL mode enabled.\r\n\r\nI've only skimmed above but it looks like you're doing mainly read-only queries? WAL mode is about better interactions between writers & readers, primarily.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}