html_url,issue_url,id,node_id,user,user_label,created_at,updated_at,author_association,body,reactions,issue,issue_label,performed_via_github_app https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778854808,https://api.github.com/repos/simonw/sqlite-utils/issues/227,778854808,MDEyOklzc3VlQ29tbWVudDc3ODg1NDgwOA==,9599,simonw,2021-02-14T22:46:54Z,2021-02-14T22:46:54Z,OWNER,Fix is released in 3.5.,"{""total_count"": 1, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 1, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807174161,Error reading csv files with large column data, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778851721,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778851721,MDEyOklzc3VlQ29tbWVudDc3ODg1MTcyMQ==,9599,simonw,2021-02-14T22:23:46Z,2021-02-14T22:23:46Z,OWNER,I called this `--no-headers` for consistency with the existing output option: https://github.com/simonw/sqlite-utils/blob/427dace184c7da57f4a04df07b1e84cdae3261e8/sqlite_utils/cli.py#L61-L64,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778849394,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778849394,MDEyOklzc3VlQ29tbWVudDc3ODg0OTM5NA==,9599,simonw,2021-02-14T22:06:53Z,2021-02-14T22:06:53Z,OWNER,"For the moment I think just adding `--no-header` - which causes column names ""unknown1,unknown2,..."" to be used - should be enough. Users can import with that option, then use `sqlite-utils transform --rename` to rename them.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV, https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778844016,https://api.github.com/repos/simonw/sqlite-utils/issues/229,778844016,MDEyOklzc3VlQ29tbWVudDc3ODg0NDAxNg==,9599,simonw,2021-02-14T21:22:45Z,2021-02-14T21:22:45Z,OWNER,"I'm going to use this pattern from https://stackoverflow.com/a/15063941 ```python import sys import csv maxInt = sys.maxsize while True: # decrease the maxInt value by factor 10 # as long as the OverflowError occurs. try: csv.field_size_limit(maxInt) break except OverflowError: maxInt = int(maxInt/10) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807817197,Hitting `_csv.Error: field larger than field limit (131072)`, https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843503,https://api.github.com/repos/simonw/sqlite-utils/issues/229,778843503,MDEyOklzc3VlQ29tbWVudDc3ODg0MzUwMw==,9599,simonw,2021-02-14T21:18:51Z,2021-02-14T21:18:51Z,OWNER,"I want to set this to the maximum allowed limit, which seems to be surprisingly hard! That StackOverflow thread is full of ideas for that, many of them involving `ctypes`. I'm a bit loathe to add a dependency on `ctypes` though - even though it's in the Python standard library I worry that it might not be available on some architectures.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807817197,Hitting `_csv.Error: field larger than field limit (131072)`, https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843362,https://api.github.com/repos/simonw/sqlite-utils/issues/229,778843362,MDEyOklzc3VlQ29tbWVudDc3ODg0MzM2Mg==,9599,simonw,2021-02-14T21:17:53Z,2021-02-14T21:17:53Z,OWNER,Same issue as #227.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807817197,Hitting `_csv.Error: field larger than field limit (131072)`, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811746,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778811746,MDEyOklzc3VlQ29tbWVudDc3ODgxMTc0Ng==,9599,simonw,2021-02-14T17:39:30Z,2021-02-14T21:16:54Z,OWNER,"I'm going to detach this from the #131 column types idea. The three things I need to handle here are: - The CSV file doesn't have a header row at all, so I need to specify what the column names should be - The CSV file DOES have a header row but I want to ignore it and use alternative column names - The CSV doesn't have a header row at all and I want to automatically use `unknown1,unknown2...` so I can start exploring it as quickly as possible. Here's a potential design that covers the first two: `--replace-header=""foo,bar,baz""` - ignore whatever is in the first row and pretend it was this instead `--add-header=""foo,bar,baz""` - add a first row with these details, to use as the header It doesn't cover the ""give me unknown column names"" case though.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778843086,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778843086,MDEyOklzc3VlQ29tbWVudDc3ODg0MzA4Ng==,9599,simonw,2021-02-14T21:15:43Z,2021-02-14T21:15:43Z,OWNER,"I'm not convinced the `.has_header()` rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383 ```python def has_header(self, sample): # Creates a dictionary of types of data in each column. If any # column is of a single type (say, integers), *except* for the first # row, then the first row is presumed to be labels. If the type # can't be determined, it is assumed to be a string in which case # the length of the string is the determining factor: if all of the # rows except for the first are the same length, it's a header. # Finally, a 'vote' is taken at the end for each column, adding or # subtracting from the likelihood of the first row being a header. ``` ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778842982,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778842982,MDEyOklzc3VlQ29tbWVudDc3ODg0Mjk4Mg==,9599,simonw,2021-02-14T21:15:11Z,2021-02-14T21:15:11Z,OWNER,"Implementation tip: I have code that reads the first row and uses it as headers here: https://github.com/simonw/sqlite-utils/blob/8f042ae1fd323995d966a94e8e6df85cc843b938/sqlite_utils/cli.py#L689-L691 So If I want to use `unknown1,unknown2...` I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with `itertools.chain()` to replay the record we already read). ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV, https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778841704,https://api.github.com/repos/simonw/sqlite-utils/issues/227,778841704,MDEyOklzc3VlQ29tbWVudDc3ODg0MTcwNA==,9599,simonw,2021-02-14T21:05:20Z,2021-02-14T21:05:20Z,OWNER,This has also been reported in #229.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807174161,Error reading csv files with large column data, https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778841547,https://api.github.com/repos/simonw/sqlite-utils/issues/225,778841547,MDEyOklzc3VlQ29tbWVudDc3ODg0MTU0Nw==,9599,simonw,2021-02-14T21:04:13Z,2021-02-14T21:04:13Z,OWNER,I added a test and fixed this in #234 - thanks for the fix.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",797159961,fix for problem in Table.insert_all on search for columns per chunk of rows, https://github.com/simonw/sqlite-utils/issues/234#issuecomment-778841278,https://api.github.com/repos/simonw/sqlite-utils/issues/234,778841278,MDEyOklzc3VlQ29tbWVudDc3ODg0MTI3OA==,9599,simonw,2021-02-14T21:02:11Z,2021-02-14T21:02:11Z,OWNER,"I managed to replicate this in a test: ```python def test_insert_all_with_extra_columns_in_later_chunks(fresh_db): chunk = [ {""record"": ""Record 1""}, {""record"": ""Record 2""}, {""record"": ""Record 3""}, {""record"": ""Record 4"", ""extra"": 1}, ] fresh_db[""t""].insert_all(chunk, batch_size=2, alter=True) assert list(fresh_db[""t""].rows) == [ {""record"": ""Record 1"", ""extra"": None}, {""record"": ""Record 2"", ""extra"": None}, {""record"": ""Record 3"", ""extra"": None}, {""record"": ""Record 4"", ""extra"": 1}, ] ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808046597,.insert_all() fails if subsequent chunks contain additional columns, https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778834504,https://api.github.com/repos/simonw/sqlite-utils/issues/225,778834504,MDEyOklzc3VlQ29tbWVudDc3ODgzNDUwNA==,9599,simonw,2021-02-14T20:09:30Z,2021-02-14T20:09:30Z,OWNER,Thanks for this. I'm going to try and get the test suite to run in Windows on GitHub Actions.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",797159961,fix for problem in Table.insert_all on search for columns per chunk of rows, https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778829456,https://api.github.com/repos/simonw/sqlite-utils/issues/231,778829456,MDEyOklzc3VlQ29tbWVudDc3ODgyOTQ1Ng==,9599,simonw,2021-02-14T19:37:52Z,2021-02-14T19:37:52Z,OWNER,"I'm going to add `limit` and `offset` to the following methods: - `rows_where()` - `search_sql()` - `search()`","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808028757,"limit=X, offset=Y parameters for more Python methods", https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778828758,https://api.github.com/repos/simonw/sqlite-utils/issues/231,778828758,MDEyOklzc3VlQ29tbWVudDc3ODgyODc1OA==,9599,simonw,2021-02-14T19:33:14Z,2021-02-14T19:33:14Z,OWNER,The `limit=` parameter is currently only available on the `.search()` method - it would make sense to add this to other methods as well.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808028757,"limit=X, offset=Y parameters for more Python methods", https://github.com/simonw/sqlite-utils/pull/224#issuecomment-778828495,https://api.github.com/repos/simonw/sqlite-utils/issues/224,778828495,MDEyOklzc3VlQ29tbWVudDc3ODgyODQ5NQ==,9599,simonw,2021-02-14T19:31:06Z,2021-02-14T19:31:06Z,OWNER,I'm going to add a `offset=` parameter to support this case. Thanks for the suggestion!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",792297010,Add fts offset docs., https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778827570,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778827570,MDEyOklzc3VlQ29tbWVudDc3ODgyNzU3MA==,9599,simonw,2021-02-14T19:24:20Z,2021-02-14T19:24:20Z,OWNER,Here's the implementation in Python: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L204-L225,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778824361,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778824361,MDEyOklzc3VlQ29tbWVudDc3ODgyNDM2MQ==,9599,simonw,2021-02-14T18:59:22Z,2021-02-14T18:59:22Z,OWNER,"I think I've got it. I can use `io.BufferedReader()` to get an object I can run `.peek(2048)` on, then wrap THAT in `io.TextIOWrapper`: ```python encoding = encoding or ""utf-8"" buffered = io.BufferedReader(json_file, buffer_size=4096) decoded = io.TextIOWrapper(buffered, encoding=encoding, line_buffering=True) if pk and len(pk) == 1: pk = pk[0] if csv or tsv: if sniff: # Read first 2048 bytes and use that to detect first_bytes = buffered.peek(2048) print('first_bytes', first_bytes) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778821403,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778821403,MDEyOklzc3VlQ29tbWVudDc3ODgyMTQwMw==,9599,simonw,2021-02-14T18:38:16Z,2021-02-14T18:38:16Z,OWNER,"There are two code paths here that matter: - For a regular file, can read the first 2048 bytes, then `.seek(0)` before continuing. That's easy. - `stdin` is harder. I need to read and buffer the first 2048 bytes, then pass an object to `csv.reader()` which will replay that chunk and then play the rest of stdin. I'm a bit stuck on the second one. Ideally I could use something like `itertools.chain()` but I can't find an alternative for file-like objects.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778818639,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778818639,MDEyOklzc3VlQ29tbWVudDc3ODgxODYzOQ==,9599,simonw,2021-02-14T18:22:38Z,2021-02-14T18:22:38Z,OWNER,Maybe I shouldn't be using `StreamReader` at all - https://www.python.org/dev/peps/pep-0400/ suggests that it should be deprecated in favour of `io.TextIOWrapper`. I'm using `StreamReader` due to this line: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L667-L668,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778817494,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778817494,MDEyOklzc3VlQ29tbWVudDc3ODgxNzQ5NA==,9599,simonw,2021-02-14T18:16:06Z,2021-02-14T18:16:06Z,OWNER,"Types involved: ``` (Pdb) type(json_file.raw) (Pdb) type(json_file) ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778816333,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778816333,MDEyOklzc3VlQ29tbWVudDc3ODgxNjMzMw==,9599,simonw,2021-02-14T18:08:44Z,2021-02-14T18:08:44Z,OWNER,"No, you can't `.seek(0)` on stdin: ``` File ""/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py"", line 678, in insert_upsert_implementation json_file.raw.seek(0) OSError: [Errno 29] Illegal seek ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778815740,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778815740,MDEyOklzc3VlQ29tbWVudDc3ODgxNTc0MA==,9599,simonw,2021-02-14T18:05:03Z,2021-02-14T18:05:03Z,OWNER,"The challenge here is how to read the first 2048 bytes and then reset the incoming file. The Python docs example looks like this: ```python with open('example.csv', newline='') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.reader(csvfile, dialect) ``` Here's the relevant code in `sqlite-utils`: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L671-L679 The challenge is going to be having the `--sniff` option work with the progress bar. Here's how `file_progress()` works: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/utils.py#L106-L113 If `file.raw` is `stdin` can I do the equivalent of `csvfile.seek(0)` on it?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778812684,https://api.github.com/repos/simonw/sqlite-utils/issues/230,778812684,MDEyOklzc3VlQ29tbWVudDc3ODgxMjY4NA==,9599,simonw,2021-02-14T17:45:16Z,2021-02-14T17:45:16Z,OWNER,"Running this could take any CSV (or TSV) file and automatically detect the delimiter. If no header row is detected it could add `unknown1,unknown2` headers: sqlite-utils insert db.db data file.csv --sniff (Using `--sniff` would imply `--csv`) This could be called `--sniffer` instead but I like `--sniff` better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",808008305,--sniff option for sniffing delimiters, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778812050,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778812050,MDEyOklzc3VlQ29tbWVudDc3ODgxMjA1MA==,9599,simonw,2021-02-14T17:41:30Z,2021-02-14T17:41:30Z,OWNER,"I just spotted that `csv.Sniffer` in the Python standard library has a `.has_header(sample)` method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV, https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811934,https://api.github.com/repos/simonw/sqlite-utils/issues/228,778811934,MDEyOklzc3VlQ29tbWVudDc3ODgxMTkzNA==,9599,simonw,2021-02-14T17:40:48Z,2021-02-14T17:40:48Z,OWNER,"Another pattern that might be useful is to generate a header that is just ""unknown1,unknown2,unknown3"" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use `sqlite-utils transform --rename` to rename the columns. I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",807437089,--no-headers option for CSV and TSV,