issues: 688659182

This data as json

id	node_id	number	title	user	state	locked	assignee	milestone	comments	created_at	updated_at	closed_at	author_association	pull_request	body	repo	type	active_lock_reason	performed_via_github_app	reactions	draft	state_reason
688659182	MDU6SXNzdWU2ODg2NTkxODI=	145	Bug when first record contains fewer columns than subsequent records	96218	closed	0			2	2020-08-30T05:44:44Z	2020-09-08T23:21:23Z	2020-09-08T23:21:23Z	CONTRIBUTOR		`insert_all()` selects the maximum batch size based on the number of fields in the first record. If the first record has fewer fields than subsequent records (and `alter=True` is passed), this can result in SQL statements with more than the maximum permitted number of host parameters. This situation is perhaps unlikely to occur, but could happen if the first record had, say, 10 columns, such that `batch_size` (based on `SQLITE_MAX_VARIABLE_NUMBER = 999`) would be 99. If the next 98 rows had 11 columns, the resulting SQL statement for the first batch would have `10 * 1 + 11 * 98 = 1088` host parameters (and subsequent batches, if the data were consistent from thereon out, would have `99 * 11 = 1089`). I suspect that this bug is masked somewhat by the fact that while: `SQLITE_MAX_VARIABLE_NUMBER` ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0. it is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with `SQLITE_MAX_VARIABLE_NUMBER` set to 250,000, and I believe this is the case for homebrew installations too. A test for this issue might look like this: python def test_columns_not_in_first_record_should_not_cause_batch_to_be_too_large(fresh_db): # sqlite on homebrew and Debian/Ubuntu etc. is typically compiled with # SQLITE_MAX_VARIABLE_NUMBER set to 250,000, so we need to exceed this value to # trigger the error on these systems. THRESHOLD = 250000 extra_columns = 1 + (THRESHOLD - 1) // 99 records = [ {"c0": "first record"}, # one column in first record -> batch_size = 100 # fill out the batch with 99 records with enough columns to exceed THRESHOLD *[ dict([("c{}".format(i), j) for i in range(extra_columns)]) for j in range(99) ] ] try: fresh_db["too_many_columns"].insert_all(records, alter=True) except sqlite3.OperationalError: raise The best solution, I think, is simply to process all the records when determining columns, column types, and the batch size. In my tests this doesn't seem to be particularly costly at all, and cuts out a lot of complications (including obviating my implementation of #139 at #142). I'll raise a PR for your consideration.	140912432	issue			{ "url": "https://api.github.com/repos/simonw/sqlite-utils/issues/145/reactions", "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 }		completed

Links from other tables

1 row from issues_id in issues_labels
2 rows from issue in issue_comments