{"id": 688670158, "node_id": "MDU6SXNzdWU2ODg2NzAxNTg=", "number": 147, "title": "SQLITE_MAX_VARS maybe hard-coded too low", "user": {"value": 96218, "label": "simonwiles"}, "state": "open", "locked": 0, "assignee": null, "milestone": null, "comments": 7, "created_at": "2020-08-30T07:26:45Z", "updated_at": "2021-02-15T21:27:55Z", "closed_at": null, "author_association": "CONTRIBUTOR", "pull_request": null, "body": "I came across this while about to open an issue and PR against the documentation for `batch_size`, which is a bit incomplete.\r\n\r\nAs mentioned in #145, while:\r\n\r\n> [`SQLITE_MAX_VARIABLE_NUMBER`](https://www.sqlite.org/limits.html#max_variable_number) ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.\r\n\r\nit is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with SQLITE_MAX_VARIABLE_NUMBER set to 250,000, and I believe this is the case for homebrew installations too.\r\n\r\nIn working to understand what `batch_size` was actually doing and why, I realized that by setting `SQLITE_MAX_VARS` in `db.py` to match the value my sqlite was compiled with (I'm on Debian), I was able to decrease the time to `insert_all()` my test data set (~128k records across 7 tables) from ~26.5s to ~3.5s.  Given that this about .05% of my total dataset, this is time I am keen to save...\r\n\r\nUnfortunately, it seems that `sqlite3` in the python standard library doesn't expose the `get_limit()` C API (even though `pysqlite` used to), so it's hard to know what value sqlite has been compiled with (note that this could mean, I suppose, that it's less than 999, and even hardcoding `SQLITE_MAX_VARS` to the conservative default might not be adequate.  It can also be lowered -- but not raised -- at runtime).  The best I could come up with is `echo \"\" | sqlite3 -cmd \".limits variable_number\"` (only available in `sqlite >= 2015-05-07 (3.8.10)`).\r\n\r\nObviously this couldn't be relied upon in `sqlite_utils`, but I wonder what your opinion would be about exposing `SQLITE_MAX_VARS` as a user-configurable parameter (with suitable \"here be dragons\" warnings)?  I'm going to go ahead and monkey-patch it for my purposes in any event, but it seems like it might be worth considering.", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/147/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": null}
{"id": 697030843, "node_id": "MDExOlB1bGxSZXF1ZXN0NDgzMDI3NTg3", "number": 156, "title": "Typos in tests", "user": {"value": 96218, "label": "simonwiles"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 1, "created_at": "2020-09-09T18:00:58Z", "updated_at": "2020-09-09T18:24:50Z", "closed_at": "2020-09-09T18:21:23Z", "author_association": "CONTRIBUTOR", "pull_request": "simonw/sqlite-utils/pulls/156", "body": "One of these is my fault, and the other is one I just happened to come across.  They're harmless, but might as well be fixed.", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "pull", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/156/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": 0, "state_reason": null}
{"id": 688659182, "node_id": "MDU6SXNzdWU2ODg2NTkxODI=", "number": 145, "title": "Bug when first record contains fewer columns than subsequent records", "user": {"value": 96218, "label": "simonwiles"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 2, "created_at": "2020-08-30T05:44:44Z", "updated_at": "2020-09-08T23:21:23Z", "closed_at": "2020-09-08T23:21:23Z", "author_association": "CONTRIBUTOR", "pull_request": null, "body": "`insert_all()` selects the maximum batch size based on the number of fields in the first record.  If the first record has fewer fields than subsequent records (and `alter=True` is passed), this can result in SQL statements with more than the maximum permitted number of host parameters.  This situation is perhaps unlikely to occur, but could happen if the first record had, say, 10 columns, such that `batch_size` (based on  `SQLITE_MAX_VARIABLE_NUMBER = 999`) would be 99.  If the next 98 rows had 11 columns, the resulting SQL statement for the first batch would have `10 * 1 + 11 * 98 = 1088` host parameters (and subsequent batches, if the data were consistent from thereon out, would have `99 * 11 = 1089`).\r\n\r\nI suspect that this bug is masked somewhat by the fact that while:\r\n> [`SQLITE_MAX_VARIABLE_NUMBER`](https://www.sqlite.org/limits.html#max_variable_number) ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.\r\n\r\nit is common that it is increased at compile time.  Debian-based systems, for example, seem to ship with a version of sqlite compiled with `SQLITE_MAX_VARIABLE_NUMBER` set to 250,000, and I believe this is the case for homebrew installations too.\r\n\r\nA test for this issue might look like this:\r\n```python\r\ndef test_columns_not_in_first_record_should_not_cause_batch_to_be_too_large(fresh_db):\r\n    # sqlite on homebrew and Debian/Ubuntu etc. is typically compiled with\r\n    #  SQLITE_MAX_VARIABLE_NUMBER set to 250,000, so we need to exceed this value to\r\n    #  trigger the error on these systems.\r\n    THRESHOLD = 250000\r\n    extra_columns = 1 + (THRESHOLD - 1) // 99\r\n    records = [\r\n        {\"c0\": \"first record\"},  # one column in first record -> batch_size = 100\r\n        # fill out the batch with 99 records with enough columns to exceed THRESHOLD\r\n        *[\r\n            dict([(\"c{}\".format(i), j) for i in range(extra_columns)])\r\n            for j in range(99)\r\n        ]\r\n    ]\r\n    try:\r\n        fresh_db[\"too_many_columns\"].insert_all(records, alter=True)\r\n    except sqlite3.OperationalError:\r\n        raise\r\n```\r\n\r\nThe best solution, I think, is simply to process all the records when determining columns, column types, and the batch size.  In my tests this doesn't seem to be particularly costly at all, and cuts out a lot of complications (including obviating my implementation of #139 at #142).  I'll raise a PR for your consideration.\r\n\r\n", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/145/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": "completed"}
{"id": 688668680, "node_id": "MDExOlB1bGxSZXF1ZXN0NDc1ODc0NDkz", "number": 146, "title": "Handle case where subsequent records (after first batch) include extra columns", "user": {"value": 96218, "label": "simonwiles"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 5, "created_at": "2020-08-30T07:13:58Z", "updated_at": "2020-09-08T23:20:37Z", "closed_at": "2020-09-08T23:20:37Z", "author_association": "CONTRIBUTOR", "pull_request": "simonw/sqlite-utils/pulls/146", "body": "Addresses #145.\r\n\r\nI think this should do the job.  If it meets with your approval I'll update this PR to include an update to the documentation -- I came across this bug while preparing a PR to update the documentation around `batch_size` in any event.", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "pull", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/146/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": 0, "state_reason": null}
{"id": 688386219, "node_id": "MDExOlB1bGxSZXF1ZXN0NDc1NjY1OTg0", "number": 142, "title": "insert_all(..., alter=True) should work for new columns introduced after the first 100 records", "user": {"value": 96218, "label": "simonwiles"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 3, "created_at": "2020-08-28T22:22:57Z", "updated_at": "2020-08-30T07:28:23Z", "closed_at": "2020-08-28T22:30:14Z", "author_association": "CONTRIBUTOR", "pull_request": "simonw/sqlite-utils/pulls/142", "body": "Closes #139.", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "pull", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/142/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": 0, "state_reason": null}
{"id": 686978131, "node_id": "MDU6SXNzdWU2ODY5NzgxMzE=", "number": 139, "title": "insert_all(..., alter=True) should work for new columns introduced after the first 100 records", "user": {"value": 96218, "label": "simonwiles"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 7, "created_at": "2020-08-27T06:25:25Z", "updated_at": "2020-08-28T22:48:51Z", "closed_at": "2020-08-28T22:30:14Z", "author_association": "CONTRIBUTOR", "pull_request": null, "body": "Is there a way to make `.insert_all()` work properly when new columns are introduced outside the first 100 records (with or without the `alter=True` argument)?\r\n\r\nI'm using `.insert_all()` to bulk insert ~3-4k records at a time and it is common for records to need to introduce new columns.  However, if new columns are introduced after the first 100 records, `sqlite_utils` doesn't even raise the `OperationalError: table ... has no column named ...` exception; it just silently drops the extra data and moves on.\r\n\r\nIt took me a while to find this little snippet in the [documentation for `.insert_all()`](https://sqlite-utils.readthedocs.io/en/stable/python-api.html#bulk-inserts) (it's not mentioned under [Adding columns automatically on insert/update](https://sqlite-utils.readthedocs.io/en/stable/python-api.html#bulk-inserts)):\r\n\r\n> The column types used in the CREATE TABLE statement are automatically derived from the types of data in that first batch of rows. **_Any additional or missing columns in subsequent batches will be ignored._**\r\n\r\nI tried changing the `batch_size` argument to the total number of records, but it seems only to effect the number of rows that are committed at a time, and has no influence on this problem.\r\n\r\nIs there a way around this that you would suggest?  It seems like it should raise an exception at least.", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/139/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": "completed"}