{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1112668411", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1112668411, "node_id": "IC_kwDOBm6k_c5CUfj7", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T21:25:34Z", "updated_at": "2022-04-28T21:25:44Z", "author_association": "OWNER", "body": "The two most promising theories at the moment, from here and Twitter and the SQLite forum, are:\r\n\r\n- SQLite is I/O bound - it generally only goes as fast as it can load data from disk. Multiple connections all competing for the same file on disk are going to end up blocked at the file system layer. But maybe this means in-memory databases will perform better?\r\n- It's the GIL. The sqlite3 C code may release the GIL, but the bits that do things like assembling `Row` objects to return still happen in Python, and that Python can only run on a single core.\r\n\r\nA couple of ways to research the in-memory theory:\r\n\r\n- Use a RAM disk on macOS (or Linux). https://stackoverflow.com/a/2033417/6083 has instructions - short version:\r\n\r\n      hdiutil attach -nomount ram://$((2 * 1024 * 100))\r\n      diskutil eraseVolume HFS+ RAMDisk name-returned-by-previous-command (was `/dev/disk2` when I tried it)\r\n      cd /Volumes/RAMDisk\r\n      cp ~/fixtures.db .\r\n\r\n- Copy Datasette databases into an in-memory database on startup. I built a new plugin to do that here: https://github.com/simonw/datasette-copy-to-memory\r\n\r\nI need to do some more, better benchmarks using these different approaches.\r\n\r\nhttps://twitter.com/laurencerowe/status/1519780174560169987 also suggests:\r\n\r\n> Maybe try:\r\n> 1. Copy the sqlite file to /dev/shm and rerun (all in ram.)\r\n> 2. Create a CTE which calculates Fibonacci or similar so you can test something completely cpu bound (only return max value or something to avoid crossing between sqlite/Python.)\r\n\r\nI like that second idea a lot - I could use the mandelbrot example from https://www.sqlite.org/lang_with.html#outlandish_recursive_query_examples", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111726586", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111726586, "node_id": "IC_kwDOBm6k_c5CQ5n6", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T04:17:16Z", "updated_at": "2022-04-28T04:19:31Z", "author_association": "OWNER", "body": "I could experiment with the `await asyncio.run_in_executor(processpool_executor, fn)` mechanism described in https://stackoverflow.com/a/29147750\r\n\r\nCode examples: https://cs.github.com/?scopeName=All+repos&scope=&q=run_in_executor+ProcessPoolExecutor", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111725638", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111725638, "node_id": "IC_kwDOBm6k_c5CQ5ZG", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T04:15:15Z", "updated_at": "2022-04-28T04:15:15Z", "author_association": "OWNER", "body": "Useful theory from Keith Medcalf https://sqlite.org/forum/forumpost/e363c69d3441172e\r\n\r\n> This is true, but the concurrency is limited to the execution which occurs with the GIL released (that is, in the native C sqlite3 library itself). Each row (for example) can be retrieved in parallel but \"constructing the python return objects for each row\" will be serialized (by the GIL).\r\n> \r\n> That is to say that if your have two python threads each with their own connection, and each one is performing a select that returns 1,000,000 rows (lets say that is 25% of the candidates for each select) then the difference in execution time between executing two python threads in parallel vs a single serial thead will not be much different (if even detectable at all). In fact it is possible that the multiple-threaded version takes longer to run both queries to completion because of the increased contention over a shared resource (the GIL).\r\n\r\nSo maybe this is a GIL thing.\r\n\r\nI should test with some expensive SQL queries (maybe big aggregations against large tables) and see if I can spot an improvement there.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111699175", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111699175, "node_id": "IC_kwDOBm6k_c5CQy7n", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T03:19:48Z", "updated_at": "2022-04-28T03:20:08Z", "author_association": "OWNER", "body": "I ran `py-spy` and then hammered refresh a bunch of times on the `http://127.0.0.1:8856/github/commits?_facet=repo&_facet=committer&_trace=1&_noparallel=` page - it generated this SVG profile for me.\r\n\r\nThe area on the right is the threads running the DB queries:\r\n\r\n![profile](https://user-images.githubusercontent.com/9599/165669677-5461ede5-3dc4-4b49-8319-bfe5fd8a723d.svg)\r\n\r\nInteractive version here: https://static.simonwillison.net/static/2022/datasette-parallel-profile.svg", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111683539", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111683539, "node_id": "IC_kwDOBm6k_c5CQvHT", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T02:47:57Z", "updated_at": "2022-04-28T02:47:57Z", "author_association": "OWNER", "body": "Maybe this is the Python GIL after all?\r\n\r\nI've been hoping that the GIL won't be an issue because the `sqlite3` module releases the GIL for the duration of the execution of a SQL query - see https://github.com/python/cpython/blob/f348154c8f8a9c254503306c59d6779d4d09b3a9/Modules/_sqlite/cursor.c#L749-L759\r\n\r\nSo I've been hoping this means that SQLite code itself can run concurrently on multiple cores even when Python threads cannot.\r\n\r\nBut maybe I'm misunderstanding how that works?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111681513", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111681513, "node_id": "IC_kwDOBm6k_c5CQunp", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T02:44:26Z", "updated_at": "2022-04-28T02:44:26Z", "author_association": "OWNER", "body": "I could try `py-spy top`, which I previously used here:\r\n- https://github.com/simonw/datasette/issues/1673", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111661331", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111661331, "node_id": "IC_kwDOBm6k_c5CQpsT", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T02:07:31Z", "updated_at": "2022-04-28T02:07:31Z", "author_association": "OWNER", "body": "Asked on the SQLite forum about this here: https://sqlite.org/forum/forumpost/ffbfa9f38e", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111602802", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111602802, "node_id": "IC_kwDOBm6k_c5CQbZy", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T00:21:35Z", "updated_at": "2022-04-28T00:21:35Z", "author_association": "OWNER", "body": "Tried this but I'm getting back an empty JSON array of traces at the bottom of the page most of the time (intermittently it works correctly):\r\n\r\n```diff\r\ndiff --git a/datasette/database.py b/datasette/database.py\r\nindex ba594a8..d7f9172 100644\r\n--- a/datasette/database.py\r\n+++ b/datasette/database.py\r\n@@ -7,7 +7,7 @@ import sys\r\n import threading\r\n import uuid\r\n \r\n-from .tracer import trace\r\n+from .tracer import trace, trace_child_tasks\r\n from .utils import (\r\n     detect_fts,\r\n     detect_primary_keys,\r\n@@ -207,30 +207,31 @@ class Database:\r\n                 time_limit_ms = custom_time_limit\r\n \r\n             with sqlite_timelimit(conn, time_limit_ms):\r\n-                try:\r\n-                    cursor = conn.cursor()\r\n-                    cursor.execute(sql, params if params is not None else {})\r\n-                    max_returned_rows = self.ds.max_returned_rows\r\n-                    if max_returned_rows == page_size:\r\n-                        max_returned_rows += 1\r\n-                    if max_returned_rows and truncate:\r\n-                        rows = cursor.fetchmany(max_returned_rows + 1)\r\n-                        truncated = len(rows) > max_returned_rows\r\n-                        rows = rows[:max_returned_rows]\r\n-                    else:\r\n-                        rows = cursor.fetchall()\r\n-                        truncated = False\r\n-                except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:\r\n-                    if e.args == (\"interrupted\",):\r\n-                        raise QueryInterrupted(e, sql, params)\r\n-                    if log_sql_errors:\r\n-                        sys.stderr.write(\r\n-                            \"ERROR: conn={}, sql = {}, params = {}: {}\\n\".format(\r\n-                                conn, repr(sql), params, e\r\n+                with trace(\"sql\", database=self.name, sql=sql.strip(), params=params):\r\n+                    try:\r\n+                        cursor = conn.cursor()\r\n+                        cursor.execute(sql, params if params is not None else {})\r\n+                        max_returned_rows = self.ds.max_returned_rows\r\n+                        if max_returned_rows == page_size:\r\n+                            max_returned_rows += 1\r\n+                        if max_returned_rows and truncate:\r\n+                            rows = cursor.fetchmany(max_returned_rows + 1)\r\n+                            truncated = len(rows) > max_returned_rows\r\n+                            rows = rows[:max_returned_rows]\r\n+                        else:\r\n+                            rows = cursor.fetchall()\r\n+                            truncated = False\r\n+                    except (sqlite3.OperationalError, sqlite3.DatabaseError) as e:\r\n+                        if e.args == (\"interrupted\",):\r\n+                            raise QueryInterrupted(e, sql, params)\r\n+                        if log_sql_errors:\r\n+                            sys.stderr.write(\r\n+                                \"ERROR: conn={}, sql = {}, params = {}: {}\\n\".format(\r\n+                                    conn, repr(sql), params, e\r\n+                                )\r\n                             )\r\n-                        )\r\n-                        sys.stderr.flush()\r\n-                    raise\r\n+                            sys.stderr.flush()\r\n+                        raise\r\n \r\n             if truncate:\r\n                 return Results(rows, truncated, cursor.description)\r\n@@ -238,9 +239,8 @@ class Database:\r\n             else:\r\n                 return Results(rows, False, cursor.description)\r\n \r\n-        with trace(\"sql\", database=self.name, sql=sql.strip(), params=params):\r\n-            results = await self.execute_fn(sql_operation_in_thread)\r\n-        return results\r\n+        with trace_child_tasks():\r\n+            return await self.execute_fn(sql_operation_in_thread)\r\n \r\n     @property\r\n     def size(self):\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111597176", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111597176, "node_id": "IC_kwDOBm6k_c5CQaB4", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T00:11:44Z", "updated_at": "2022-04-28T00:11:44Z", "author_association": "OWNER", "body": "Though it would be interesting to also have the trace reveal how much time is spent in the functions that wrap that core SQL - the stuff that is being measured at the moment.\r\n\r\nI have a hunch that this could help solve the over-arching performance mystery.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}
{"html_url": "https://github.com/simonw/datasette/issues/1727#issuecomment-1111595319", "issue_url": "https://api.github.com/repos/simonw/datasette/issues/1727", "id": 1111595319, "node_id": "IC_kwDOBm6k_c5CQZk3", "user": {"value": 9599, "label": "simonw"}, "created_at": "2022-04-28T00:09:45Z", "updated_at": "2022-04-28T00:11:01Z", "author_association": "OWNER", "body": "Here's where read queries are instrumented: https://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L241-L242\r\n\r\nSo the instrumentation is actually capturing quite a bit of Python activity before it gets to SQLite:\r\n\r\nhttps://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L179-L190\r\n\r\nAnd then:\r\n\r\nhttps://github.com/simonw/datasette/blob/7a6654a253dee243518dc542ce4c06dbb0d0801d/datasette/database.py#L204-L233\r\n\r\nIdeally I'd like that `trace()` block to wrap just the `cursor.execute()` and `cursor.fetchmany(...)` or `cursor.fetchall()` calls.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1217759117, "label": "Research: demonstrate if parallel SQL queries are worthwhile"}, "performed_via_github_app": null}