{"html_url": "https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778854808", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/227", "id": 778854808, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg1NDgwOA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T22:46:54Z", "updated_at": "2021-02-14T22:46:54Z", "author_association": "OWNER", "body": "Fix is released in 3.5.", "reactions": "{\"total_count\": 1, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 1, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807174161, "label": "Error reading csv files with large column data"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778851721", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778851721, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg1MTcyMQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T22:23:46Z", "updated_at": "2021-02-14T22:23:46Z", "author_association": "OWNER", "body": "I called this `--no-headers` for consistency with the existing output option: https://github.com/simonw/sqlite-utils/blob/427dace184c7da57f4a04df07b1e84cdae3261e8/sqlite_utils/cli.py#L61-L64", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778849394", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778849394, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0OTM5NA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T22:06:53Z", "updated_at": "2021-02-14T22:06:53Z", "author_association": "OWNER", "body": "For the moment I think just adding `--no-header` - which causes column names \"unknown1,unknown2,...\" to be used - should be enough.\r\n\r\nUsers can import with that option, then use `sqlite-utils transform --rename` to rename them.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778844016", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/229", "id": 778844016, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0NDAxNg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:22:45Z", "updated_at": "2021-02-14T21:22:45Z", "author_association": "OWNER", "body": "I'm going to use this pattern from https://stackoverflow.com/a/15063941\r\n```python\r\nimport sys\r\nimport csv\r\nmaxInt = sys.maxsize\r\n\r\nwhile True:\r\n # decrease the maxInt value by factor 10 \r\n # as long as the OverflowError occurs.\r\n\r\n try:\r\n csv.field_size_limit(maxInt)\r\n break\r\n except OverflowError:\r\n maxInt = int(maxInt/10)\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807817197, "label": "Hitting `_csv.Error: field larger than field limit (131072)`"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843503", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/229", "id": 778843503, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0MzUwMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:18:51Z", "updated_at": "2021-02-14T21:18:51Z", "author_association": "OWNER", "body": "I want to set this to the maximum allowed limit, which seems to be surprisingly hard! That StackOverflow thread is full of ideas for that, many of them involving `ctypes`. I'm a bit loathe to add a dependency on `ctypes` though - even though it's in the Python standard library I worry that it might not be available on some architectures.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807817197, "label": "Hitting `_csv.Error: field larger than field limit (131072)`"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843362", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/229", "id": 778843362, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0MzM2Mg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:17:53Z", "updated_at": "2021-02-14T21:17:53Z", "author_association": "OWNER", "body": "Same issue as #227.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807817197, "label": "Hitting `_csv.Error: field larger than field limit (131072)`"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811746", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778811746, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxMTc0Ng==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T17:39:30Z", "updated_at": "2021-02-14T21:16:54Z", "author_association": "OWNER", "body": "I'm going to detach this from the #131 column types idea.\r\n\r\nThe three things I need to handle here are:\r\n\r\n- The CSV file doesn't have a header row at all, so I need to specify what the column names should be\r\n- The CSV file DOES have a header row but I want to ignore it and use alternative column names\r\n- The CSV doesn't have a header row at all and I want to automatically use `unknown1,unknown2...` so I can start exploring it as quickly as possible.\r\n\r\nHere's a potential design that covers the first two:\r\n\r\n`--replace-header=\"foo,bar,baz\"` - ignore whatever is in the first row and pretend it was this instead\r\n`--add-header=\"foo,bar,baz\"` - add a first row with these details, to use as the header\r\n\r\nIt doesn't cover the \"give me unknown column names\" case though.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778843086", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778843086, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0MzA4Ng==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:15:43Z", "updated_at": "2021-02-14T21:15:43Z", "author_association": "OWNER", "body": "I'm not convinced the `.has_header()` rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383\r\n\r\n```python\r\n def has_header(self, sample):\r\n # Creates a dictionary of types of data in each column. If any\r\n # column is of a single type (say, integers), *except* for the first\r\n # row, then the first row is presumed to be labels. If the type\r\n # can't be determined, it is assumed to be a string in which case\r\n # the length of the string is the determining factor: if all of the\r\n # rows except for the first are the same length, it's a header.\r\n # Finally, a 'vote' is taken at the end for each column, adding or\r\n # subtracting from the likelihood of the first row being a header.\r\n```\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778842982", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778842982, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0Mjk4Mg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:15:11Z", "updated_at": "2021-02-14T21:15:11Z", "author_association": "OWNER", "body": "Implementation tip: I have code that reads the first row and uses it as headers here: https://github.com/simonw/sqlite-utils/blob/8f042ae1fd323995d966a94e8e6df85cc843b938/sqlite_utils/cli.py#L689-L691\r\n\r\nSo If I want to use `unknown1,unknown2...` I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with `itertools.chain()` to replay the record we already read).\r\n\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778841704", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/227", "id": 778841704, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0MTcwNA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:05:20Z", "updated_at": "2021-02-14T21:05:20Z", "author_association": "OWNER", "body": "This has also been reported in #229.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807174161, "label": "Error reading csv files with large column data"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778841547", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/225", "id": 778841547, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0MTU0Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:04:13Z", "updated_at": "2021-02-14T21:04:13Z", "author_association": "OWNER", "body": "I added a test and fixed this in #234 - thanks for the fix.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 797159961, "label": "fix for problem in Table.insert_all on search for columns per chunk of rows"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/234#issuecomment-778841278", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/234", "id": 778841278, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODg0MTI3OA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T21:02:11Z", "updated_at": "2021-02-14T21:02:11Z", "author_association": "OWNER", "body": "I managed to replicate this in a test:\r\n```python\r\ndef test_insert_all_with_extra_columns_in_later_chunks(fresh_db):\r\n chunk = [\r\n {\"record\": \"Record 1\"},\r\n {\"record\": \"Record 2\"},\r\n {\"record\": \"Record 3\"},\r\n {\"record\": \"Record 4\", \"extra\": 1},\r\n ]\r\n fresh_db[\"t\"].insert_all(chunk, batch_size=2, alter=True)\r\n assert list(fresh_db[\"t\"].rows) == [\r\n {\"record\": \"Record 1\", \"extra\": None},\r\n {\"record\": \"Record 2\", \"extra\": None},\r\n {\"record\": \"Record 3\", \"extra\": None},\r\n {\"record\": \"Record 4\", \"extra\": 1},\r\n ]\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808046597, "label": ".insert_all() fails if subsequent chunks contain additional columns"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778834504", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/225", "id": 778834504, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgzNDUwNA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T20:09:30Z", "updated_at": "2021-02-14T20:09:30Z", "author_association": "OWNER", "body": "Thanks for this. I'm going to try and get the test suite to run in Windows on GitHub Actions.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 797159961, "label": "fix for problem in Table.insert_all on search for columns per chunk of rows"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778829456", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/231", "id": 778829456, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyOTQ1Ng==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T19:37:52Z", "updated_at": "2021-02-14T19:37:52Z", "author_association": "OWNER", "body": "I'm going to add `limit` and `offset` to the following methods:\r\n\r\n- `rows_where()`\r\n- `search_sql()`\r\n- `search()`", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808028757, "label": "limit=X, offset=Y parameters for more Python methods"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778828758", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/231", "id": 778828758, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyODc1OA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T19:33:14Z", "updated_at": "2021-02-14T19:33:14Z", "author_association": "OWNER", "body": "The `limit=` parameter is currently only available on the `.search()` method - it would make sense to add this to other methods as well.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808028757, "label": "limit=X, offset=Y parameters for more Python methods"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/pull/224#issuecomment-778828495", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/224", "id": 778828495, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyODQ5NQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T19:31:06Z", "updated_at": "2021-02-14T19:31:06Z", "author_association": "OWNER", "body": "I'm going to add a `offset=` parameter to support this case. Thanks for the suggestion!", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 792297010, "label": "Add fts offset docs."}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778827570", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778827570, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyNzU3MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T19:24:20Z", "updated_at": "2021-02-14T19:24:20Z", "author_association": "OWNER", "body": "Here's the implementation in Python: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L204-L225", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778824361", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778824361, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyNDM2MQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:59:22Z", "updated_at": "2021-02-14T18:59:22Z", "author_association": "OWNER", "body": "I think I've got it. I can use `io.BufferedReader()` to get an object I can run `.peek(2048)` on, then wrap THAT in `io.TextIOWrapper`:\r\n\r\n```python\r\n encoding = encoding or \"utf-8\"\r\n buffered = io.BufferedReader(json_file, buffer_size=4096)\r\n decoded = io.TextIOWrapper(buffered, encoding=encoding, line_buffering=True)\r\n if pk and len(pk) == 1:\r\n pk = pk[0]\r\n if csv or tsv:\r\n if sniff:\r\n # Read first 2048 bytes and use that to detect\r\n first_bytes = buffered.peek(2048)\r\n print('first_bytes', first_bytes)\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778821403", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778821403, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyMTQwMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:38:16Z", "updated_at": "2021-02-14T18:38:16Z", "author_association": "OWNER", "body": "There are two code paths here that matter:\r\n\r\n- For a regular file, can read the first 2048 bytes, then `.seek(0)` before continuing. That's easy.\r\n- `stdin` is harder. I need to read and buffer the first 2048 bytes, then pass an object to `csv.reader()` which will replay that chunk and then play the rest of stdin.\r\n\r\nI'm a bit stuck on the second one. Ideally I could use something like `itertools.chain()` but I can't find an alternative for file-like objects.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778818639", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778818639, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxODYzOQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:22:38Z", "updated_at": "2021-02-14T18:22:38Z", "author_association": "OWNER", "body": "Maybe I shouldn't be using `StreamReader` at all - https://www.python.org/dev/peps/pep-0400/ suggests that it should be deprecated in favour of `io.TextIOWrapper`. I'm using `StreamReader` due to this line: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L667-L668", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778817494", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778817494, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxNzQ5NA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:16:06Z", "updated_at": "2021-02-14T18:16:06Z", "author_association": "OWNER", "body": "Types involved:\r\n```\r\n(Pdb) type(json_file.raw)\r\n\r\n(Pdb) type(json_file)\r\n\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778816333", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778816333, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxNjMzMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:08:44Z", "updated_at": "2021-02-14T18:08:44Z", "author_association": "OWNER", "body": "No, you can't `.seek(0)` on stdin:\r\n```\r\n File \"/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py\", line 678, in insert_upsert_implementation\r\n json_file.raw.seek(0)\r\nOSError: [Errno 29] Illegal seek\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778815740", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778815740, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxNTc0MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:05:03Z", "updated_at": "2021-02-14T18:05:03Z", "author_association": "OWNER", "body": "The challenge here is how to read the first 2048 bytes and then reset the incoming file.\r\n\r\nThe Python docs example looks like this:\r\n\r\n```python\r\nwith open('example.csv', newline='') as csvfile:\r\n dialect = csv.Sniffer().sniff(csvfile.read(1024))\r\n csvfile.seek(0)\r\n reader = csv.reader(csvfile, dialect)\r\n```\r\nHere's the relevant code in `sqlite-utils`: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L671-L679\r\n\r\nThe challenge is going to be having the `--sniff` option work with the progress bar. Here's how `file_progress()` works: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/utils.py#L106-L113\r\n\r\nIf `file.raw` is `stdin` can I do the equivalent of `csvfile.seek(0)` on it?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778812684", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778812684, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxMjY4NA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T17:45:16Z", "updated_at": "2021-02-14T17:45:16Z", "author_association": "OWNER", "body": "Running this could take any CSV (or TSV) file and automatically detect the delimiter. If no header row is detected it could add `unknown1,unknown2` headers:\r\n\r\n sqlite-utils insert db.db data file.csv --sniff\r\n\r\n(Using `--sniff` would imply `--csv`)\r\n\r\nThis could be called `--sniffer` instead but I like `--sniff` better.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778812050", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778812050, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxMjA1MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T17:41:30Z", "updated_at": "2021-02-14T17:41:30Z", "author_association": "OWNER", "body": "I just spotted that `csv.Sniffer` in the Python standard library has a `.has_header(sample)` method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811934", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/228", "id": 778811934, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxMTkzNA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T17:40:48Z", "updated_at": "2021-02-14T17:40:48Z", "author_association": "OWNER", "body": "Another pattern that might be useful is to generate a header that is just \"unknown1,unknown2,unknown3\" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use `sqlite-utils transform --rename` to rename the columns.\r\n\r\nI needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 807437089, "label": "--no-headers option for CSV and TSV"}, "performed_via_github_app": null}