issues: 688659182

This data as json

id node_id number title user state locked assignee milestone comments created_at updated_at closed_at author_association pull_request body repo type active_lock_reason performed_via_github_app
688659182 MDU6SXNzdWU2ODg2NTkxODI= 145 Bug when first record contains fewer columns than subsequent records 96218 closed 0     2 2020-08-30T05:44:44Z 2020-09-08T23:21:23Z 2020-09-08T23:21:23Z CONTRIBUTOR  

insert_all() selects the maximum batch size based on the number of fields in the first record. If the first record has fewer fields than subsequent records (and alter=True is passed), this can result in SQL statements with more than the maximum permitted number of host parameters. This situation is perhaps unlikely to occur, but could happen if the first record had, say, 10 columns, such that batch_size (based on SQLITE_MAX_VARIABLE_NUMBER = 999) would be 99. If the next 98 rows had 11 columns, the resulting SQL statement for the first batch would have 10 * 1 + 11 * 98 = 1088 host parameters (and subsequent batches, if the data were consistent from thereon out, would have 99 * 11 = 1089).

I suspect that this bug is masked somewhat by the fact that while:

SQLITE_MAX_VARIABLE_NUMBER ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

it is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with SQLITE_MAX_VARIABLE_NUMBER set to 250,000, and I believe this is the case for homebrew installations too.

A test for this issue might look like this:

def test_columns_not_in_first_record_should_not_cause_batch_to_be_too_large(fresh_db):
    # sqlite on homebrew and Debian/Ubuntu etc. is typically compiled with
    #  SQLITE_MAX_VARIABLE_NUMBER set to 250,000, so we need to exceed this value to
    #  trigger the error on these systems.
    THRESHOLD = 250000
    extra_columns = 1 + (THRESHOLD - 1) // 99
    records = [
        {"c0": "first record"},  # one column in first record -> batch_size = 100
        # fill out the batch with 99 records with enough columns to exceed THRESHOLD
        *[
            dict([("c{}".format(i), j) for i in range(extra_columns)])
            for j in range(99)
        ]
    ]
    try:
        fresh_db["too_many_columns"].insert_all(records, alter=True)
    except sqlite3.OperationalError:
        raise

The best solution, I think, is simply to process all the records when determining columns, column types, and the batch size. In my tests this doesn't seem to be particularly costly at all, and cuts out a lot of complications (including obviating my implementation of #139 at #142). I'll raise a PR for your consideration.

140912432 issue    

Links from other tables

Powered by Datasette · Query took 1.596ms · About: github-to-sqlite