github
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | issue | performed_via_github_app |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778854808 | https://api.github.com/repos/simonw/sqlite-utils/issues/227 | 778854808 | MDEyOklzc3VlQ29tbWVudDc3ODg1NDgwOA== | 9599 | 2021-02-14T22:46:54Z | 2021-02-14T22:46:54Z | OWNER | Fix is released in 3.5. | { "total_count": 1, "+1": 0, "-1": 0, "laugh": 0, "hooray": 1, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807174161 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778851721 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778851721 | MDEyOklzc3VlQ29tbWVudDc3ODg1MTcyMQ== | 9599 | 2021-02-14T22:23:46Z | 2021-02-14T22:23:46Z | OWNER | I called this `--no-headers` for consistency with the existing output option: https://github.com/simonw/sqlite-utils/blob/427dace184c7da57f4a04df07b1e84cdae3261e8/sqlite_utils/cli.py#L61-L64 | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778849394 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778849394 | MDEyOklzc3VlQ29tbWVudDc3ODg0OTM5NA== | 9599 | 2021-02-14T22:06:53Z | 2021-02-14T22:06:53Z | OWNER | For the moment I think just adding `--no-header` - which causes column names "unknown1,unknown2,..." to be used - should be enough. Users can import with that option, then use `sqlite-utils transform --rename` to rename them. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 | |
https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778844016 | https://api.github.com/repos/simonw/sqlite-utils/issues/229 | 778844016 | MDEyOklzc3VlQ29tbWVudDc3ODg0NDAxNg== | 9599 | 2021-02-14T21:22:45Z | 2021-02-14T21:22:45Z | OWNER | I'm going to use this pattern from https://stackoverflow.com/a/15063941 ```python import sys import csv maxInt = sys.maxsize while True: # decrease the maxInt value by factor 10 # as long as the OverflowError occurs. try: csv.field_size_limit(maxInt) break except OverflowError: maxInt = int(maxInt/10) ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807817197 | |
https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843503 | https://api.github.com/repos/simonw/sqlite-utils/issues/229 | 778843503 | MDEyOklzc3VlQ29tbWVudDc3ODg0MzUwMw== | 9599 | 2021-02-14T21:18:51Z | 2021-02-14T21:18:51Z | OWNER | I want to set this to the maximum allowed limit, which seems to be surprisingly hard! That StackOverflow thread is full of ideas for that, many of them involving `ctypes`. I'm a bit loathe to add a dependency on `ctypes` though - even though it's in the Python standard library I worry that it might not be available on some architectures. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807817197 | |
https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843362 | https://api.github.com/repos/simonw/sqlite-utils/issues/229 | 778843362 | MDEyOklzc3VlQ29tbWVudDc3ODg0MzM2Mg== | 9599 | 2021-02-14T21:17:53Z | 2021-02-14T21:17:53Z | OWNER | Same issue as #227. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807817197 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811746 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778811746 | MDEyOklzc3VlQ29tbWVudDc3ODgxMTc0Ng== | 9599 | 2021-02-14T17:39:30Z | 2021-02-14T21:16:54Z | OWNER | I'm going to detach this from the #131 column types idea. The three things I need to handle here are: - The CSV file doesn't have a header row at all, so I need to specify what the column names should be - The CSV file DOES have a header row but I want to ignore it and use alternative column names - The CSV doesn't have a header row at all and I want to automatically use `unknown1,unknown2...` so I can start exploring it as quickly as possible. Here's a potential design that covers the first two: `--replace-header="foo,bar,baz"` - ignore whatever is in the first row and pretend it was this instead `--add-header="foo,bar,baz"` - add a first row with these details, to use as the header It doesn't cover the "give me unknown column names" case though. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778843086 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778843086 | MDEyOklzc3VlQ29tbWVudDc3ODg0MzA4Ng== | 9599 | 2021-02-14T21:15:43Z | 2021-02-14T21:15:43Z | OWNER | I'm not convinced the `.has_header()` rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383 ```python def has_header(self, sample): # Creates a dictionary of types of data in each column. If any # column is of a single type (say, integers), *except* for the first # row, then the first row is presumed to be labels. If the type # can't be determined, it is assumed to be a string in which case # the length of the string is the determining factor: if all of the # rows except for the first are the same length, it's a header. # Finally, a 'vote' is taken at the end for each column, adding or # subtracting from the likelihood of the first row being a header. ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778842982 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778842982 | MDEyOklzc3VlQ29tbWVudDc3ODg0Mjk4Mg== | 9599 | 2021-02-14T21:15:11Z | 2021-02-14T21:15:11Z | OWNER | Implementation tip: I have code that reads the first row and uses it as headers here: https://github.com/simonw/sqlite-utils/blob/8f042ae1fd323995d966a94e8e6df85cc843b938/sqlite_utils/cli.py#L689-L691 So If I want to use `unknown1,unknown2...` I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with `itertools.chain()` to replay the record we already read). | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 | |
https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778841704 | https://api.github.com/repos/simonw/sqlite-utils/issues/227 | 778841704 | MDEyOklzc3VlQ29tbWVudDc3ODg0MTcwNA== | 9599 | 2021-02-14T21:05:20Z | 2021-02-14T21:05:20Z | OWNER | This has also been reported in #229. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807174161 | |
https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778841547 | https://api.github.com/repos/simonw/sqlite-utils/issues/225 | 778841547 | MDEyOklzc3VlQ29tbWVudDc3ODg0MTU0Nw== | 9599 | 2021-02-14T21:04:13Z | 2021-02-14T21:04:13Z | OWNER | I added a test and fixed this in #234 - thanks for the fix. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
797159961 | |
https://github.com/simonw/sqlite-utils/issues/234#issuecomment-778841278 | https://api.github.com/repos/simonw/sqlite-utils/issues/234 | 778841278 | MDEyOklzc3VlQ29tbWVudDc3ODg0MTI3OA== | 9599 | 2021-02-14T21:02:11Z | 2021-02-14T21:02:11Z | OWNER | I managed to replicate this in a test: ```python def test_insert_all_with_extra_columns_in_later_chunks(fresh_db): chunk = [ {"record": "Record 1"}, {"record": "Record 2"}, {"record": "Record 3"}, {"record": "Record 4", "extra": 1}, ] fresh_db["t"].insert_all(chunk, batch_size=2, alter=True) assert list(fresh_db["t"].rows) == [ {"record": "Record 1", "extra": None}, {"record": "Record 2", "extra": None}, {"record": "Record 3", "extra": None}, {"record": "Record 4", "extra": 1}, ] ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808046597 | |
https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778834504 | https://api.github.com/repos/simonw/sqlite-utils/issues/225 | 778834504 | MDEyOklzc3VlQ29tbWVudDc3ODgzNDUwNA== | 9599 | 2021-02-14T20:09:30Z | 2021-02-14T20:09:30Z | OWNER | Thanks for this. I'm going to try and get the test suite to run in Windows on GitHub Actions. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
797159961 | |
https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778829456 | https://api.github.com/repos/simonw/sqlite-utils/issues/231 | 778829456 | MDEyOklzc3VlQ29tbWVudDc3ODgyOTQ1Ng== | 9599 | 2021-02-14T19:37:52Z | 2021-02-14T19:37:52Z | OWNER | I'm going to add `limit` and `offset` to the following methods: - `rows_where()` - `search_sql()` - `search()` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808028757 | |
https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778828758 | https://api.github.com/repos/simonw/sqlite-utils/issues/231 | 778828758 | MDEyOklzc3VlQ29tbWVudDc3ODgyODc1OA== | 9599 | 2021-02-14T19:33:14Z | 2021-02-14T19:33:14Z | OWNER | The `limit=` parameter is currently only available on the `.search()` method - it would make sense to add this to other methods as well. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808028757 | |
https://github.com/simonw/sqlite-utils/pull/224#issuecomment-778828495 | https://api.github.com/repos/simonw/sqlite-utils/issues/224 | 778828495 | MDEyOklzc3VlQ29tbWVudDc3ODgyODQ5NQ== | 9599 | 2021-02-14T19:31:06Z | 2021-02-14T19:31:06Z | OWNER | I'm going to add a `offset=` parameter to support this case. Thanks for the suggestion! | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
792297010 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778827570 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778827570 | MDEyOklzc3VlQ29tbWVudDc3ODgyNzU3MA== | 9599 | 2021-02-14T19:24:20Z | 2021-02-14T19:24:20Z | OWNER | Here's the implementation in Python: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L204-L225 | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778824361 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778824361 | MDEyOklzc3VlQ29tbWVudDc3ODgyNDM2MQ== | 9599 | 2021-02-14T18:59:22Z | 2021-02-14T18:59:22Z | OWNER | I think I've got it. I can use `io.BufferedReader()` to get an object I can run `.peek(2048)` on, then wrap THAT in `io.TextIOWrapper`: ```python encoding = encoding or "utf-8" buffered = io.BufferedReader(json_file, buffer_size=4096) decoded = io.TextIOWrapper(buffered, encoding=encoding, line_buffering=True) if pk and len(pk) == 1: pk = pk[0] if csv or tsv: if sniff: # Read first 2048 bytes and use that to detect first_bytes = buffered.peek(2048) print('first_bytes', first_bytes) ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778821403 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778821403 | MDEyOklzc3VlQ29tbWVudDc3ODgyMTQwMw== | 9599 | 2021-02-14T18:38:16Z | 2021-02-14T18:38:16Z | OWNER | There are two code paths here that matter: - For a regular file, can read the first 2048 bytes, then `.seek(0)` before continuing. That's easy. - `stdin` is harder. I need to read and buffer the first 2048 bytes, then pass an object to `csv.reader()` which will replay that chunk and then play the rest of stdin. I'm a bit stuck on the second one. Ideally I could use something like `itertools.chain()` but I can't find an alternative for file-like objects. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778818639 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778818639 | MDEyOklzc3VlQ29tbWVudDc3ODgxODYzOQ== | 9599 | 2021-02-14T18:22:38Z | 2021-02-14T18:22:38Z | OWNER | Maybe I shouldn't be using `StreamReader` at all - https://www.python.org/dev/peps/pep-0400/ suggests that it should be deprecated in favour of `io.TextIOWrapper`. I'm using `StreamReader` due to this line: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L667-L668 | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778817494 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778817494 | MDEyOklzc3VlQ29tbWVudDc3ODgxNzQ5NA== | 9599 | 2021-02-14T18:16:06Z | 2021-02-14T18:16:06Z | OWNER | Types involved: ``` (Pdb) type(json_file.raw) <class '_io.FileIO'> (Pdb) type(json_file) <class 'encodings.utf_8.StreamReader'> ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778816333 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778816333 | MDEyOklzc3VlQ29tbWVudDc3ODgxNjMzMw== | 9599 | 2021-02-14T18:08:44Z | 2021-02-14T18:08:44Z | OWNER | No, you can't `.seek(0)` on stdin: ``` File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py", line 678, in insert_upsert_implementation json_file.raw.seek(0) OSError: [Errno 29] Illegal seek ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778815740 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778815740 | MDEyOklzc3VlQ29tbWVudDc3ODgxNTc0MA== | 9599 | 2021-02-14T18:05:03Z | 2021-02-14T18:05:03Z | OWNER | The challenge here is how to read the first 2048 bytes and then reset the incoming file. The Python docs example looks like this: ```python with open('example.csv', newline='') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.reader(csvfile, dialect) ``` Here's the relevant code in `sqlite-utils`: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L671-L679 The challenge is going to be having the `--sniff` option work with the progress bar. Here's how `file_progress()` works: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/utils.py#L106-L113 If `file.raw` is `stdin` can I do the equivalent of `csvfile.seek(0)` on it? | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778812684 | https://api.github.com/repos/simonw/sqlite-utils/issues/230 | 778812684 | MDEyOklzc3VlQ29tbWVudDc3ODgxMjY4NA== | 9599 | 2021-02-14T17:45:16Z | 2021-02-14T17:45:16Z | OWNER | Running this could take any CSV (or TSV) file and automatically detect the delimiter. If no header row is detected it could add `unknown1,unknown2` headers: sqlite-utils insert db.db data file.csv --sniff (Using `--sniff` would imply `--csv`) This could be called `--sniffer` instead but I like `--sniff` better. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
808008305 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778812050 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778812050 | MDEyOklzc3VlQ29tbWVudDc3ODgxMjA1MA== | 9599 | 2021-02-14T17:41:30Z | 2021-02-14T17:41:30Z | OWNER | I just spotted that `csv.Sniffer` in the Python standard library has a `.has_header(sample)` method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 | |
https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811934 | https://api.github.com/repos/simonw/sqlite-utils/issues/228 | 778811934 | MDEyOklzc3VlQ29tbWVudDc3ODgxMTkzNA== | 9599 | 2021-02-14T17:40:48Z | 2021-02-14T17:40:48Z | OWNER | Another pattern that might be useful is to generate a header that is just "unknown1,unknown2,unknown3" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use `sqlite-utils transform --rename` to rename the columns. I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
807437089 |