{"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778827570", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778827570, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyNzU3MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T19:24:20Z", "updated_at": "2021-02-14T19:24:20Z", "author_association": "OWNER", "body": "Here's the implementation in Python: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L204-L225", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778824361", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778824361, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyNDM2MQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:59:22Z", "updated_at": "2021-02-14T18:59:22Z", "author_association": "OWNER", "body": "I think I've got it. I can use `io.BufferedReader()` to get an object I can run `.peek(2048)` on, then wrap THAT in `io.TextIOWrapper`:\r\n\r\n```python\r\n encoding = encoding or \"utf-8\"\r\n buffered = io.BufferedReader(json_file, buffer_size=4096)\r\n decoded = io.TextIOWrapper(buffered, encoding=encoding, line_buffering=True)\r\n if pk and len(pk) == 1:\r\n pk = pk[0]\r\n if csv or tsv:\r\n if sniff:\r\n # Read first 2048 bytes and use that to detect\r\n first_bytes = buffered.peek(2048)\r\n print('first_bytes', first_bytes)\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778821403", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778821403, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgyMTQwMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:38:16Z", "updated_at": "2021-02-14T18:38:16Z", "author_association": "OWNER", "body": "There are two code paths here that matter:\r\n\r\n- For a regular file, can read the first 2048 bytes, then `.seek(0)` before continuing. That's easy.\r\n- `stdin` is harder. I need to read and buffer the first 2048 bytes, then pass an object to `csv.reader()` which will replay that chunk and then play the rest of stdin.\r\n\r\nI'm a bit stuck on the second one. Ideally I could use something like `itertools.chain()` but I can't find an alternative for file-like objects.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778818639", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778818639, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxODYzOQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:22:38Z", "updated_at": "2021-02-14T18:22:38Z", "author_association": "OWNER", "body": "Maybe I shouldn't be using `StreamReader` at all - https://www.python.org/dev/peps/pep-0400/ suggests that it should be deprecated in favour of `io.TextIOWrapper`. I'm using `StreamReader` due to this line: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L667-L668", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778817494", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778817494, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxNzQ5NA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:16:06Z", "updated_at": "2021-02-14T18:16:06Z", "author_association": "OWNER", "body": "Types involved:\r\n```\r\n(Pdb) type(json_file.raw)\r\n\r\n(Pdb) type(json_file)\r\n\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778816333", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778816333, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxNjMzMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:08:44Z", "updated_at": "2021-02-14T18:08:44Z", "author_association": "OWNER", "body": "No, you can't `.seek(0)` on stdin:\r\n```\r\n File \"/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py\", line 678, in insert_upsert_implementation\r\n json_file.raw.seek(0)\r\nOSError: [Errno 29] Illegal seek\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778815740", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778815740, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxNTc0MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T18:05:03Z", "updated_at": "2021-02-14T18:05:03Z", "author_association": "OWNER", "body": "The challenge here is how to read the first 2048 bytes and then reset the incoming file.\r\n\r\nThe Python docs example looks like this:\r\n\r\n```python\r\nwith open('example.csv', newline='') as csvfile:\r\n dialect = csv.Sniffer().sniff(csvfile.read(1024))\r\n csvfile.seek(0)\r\n reader = csv.reader(csvfile, dialect)\r\n```\r\nHere's the relevant code in `sqlite-utils`: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L671-L679\r\n\r\nThe challenge is going to be having the `--sniff` option work with the progress bar. Here's how `file_progress()` works: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/utils.py#L106-L113\r\n\r\nIf `file.raw` is `stdin` can I do the equivalent of `csvfile.seek(0)` on it?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778812684", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/230", "id": 778812684, "node_id": "MDEyOklzc3VlQ29tbWVudDc3ODgxMjY4NA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2021-02-14T17:45:16Z", "updated_at": "2021-02-14T17:45:16Z", "author_association": "OWNER", "body": "Running this could take any CSV (or TSV) file and automatically detect the delimiter. If no header row is detected it could add `unknown1,unknown2` headers:\r\n\r\n sqlite-utils insert db.db data file.csv --sniff\r\n\r\n(Using `--sniff` would imply `--csv`)\r\n\r\nThis could be called `--sniffer` instead but I like `--sniff` better.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 808008305, "label": "--sniff option for sniffing delimiters"}, "performed_via_github_app": null}