{"id": 1839344979, "node_id": "I_kwDOCGYnMM5toi1T", "number": 582, "title": "Handling CSV/file input that contains NUL bytes", "user": {"value": 1448859, "label": "betatim"}, "state": "open", "locked": 0, "assignee": null, "milestone": null, "comments": 0, "created_at": "2023-08-07T12:24:14Z", "updated_at": "2023-08-07T12:24:14Z", "closed_at": null, "author_association": "NONE", "pull_request": null, "body": "I was using sqlite-utils to create a DB from a CSV and it turns out the CSV contains a NUL byte.\r\n\r\nWhen the processing reaches the line that contains the NUL an exception is raised.\r\n\r\nI'm wondering if there is something that can be done in `sqlite-utils` to say \"skip lines with encoding errors\" or some such. I think it isn't super straightforward though as the exception comes from inside the `csv` module that does all the parsing.\r\n\r\nConcretely the file is the `KernelVersions.csv` from https://www.kaggle.com/datasets/kaggle/meta-kaggle\r\n\r\nThis is the command and output:\r\n```\r\n$ sqlite-utils insert --csv kaggle.db kaggle KernelVersions.csv\r\n [------------------------------------] 0%\r\n [#####################---------------] 60% 00:04:24Traceback (most recent call last):\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/bin/sqlite-utils\", line 10, in \r\n sys.exit(cli())\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py\", line 1128, in __call__\r\n return self.main(*args, **kwargs)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py\", line 1053, in main\r\n rv = self.invoke(ctx)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py\", line 1659, in invoke\r\n return _process_result(sub_ctx.command.invoke(sub_ctx))\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py\", line 1395, in invoke\r\n return ctx.invoke(self.callback, **ctx.params)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/click/core.py\", line 754, in invoke\r\n return __callback(*args, **kwargs)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py\", line 1223, in insert\r\n insert_upsert_implementation(\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py\", line 1085, in insert_upsert_implementation\r\n db[table].insert_all(\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/db.py\", line 3198, in insert_all\r\n chunk = list(chunk)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/db.py\", line 3742, in fix_square_braces\r\n for record in records:\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py\", line 1071, in \r\n docs = (decode_base64_values(doc) for doc in docs)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py\", line 1068, in \r\n docs = (verify_is_dict(doc) for doc in docs)\r\n File \"/home/foobar/miniconda/envs/meta-kaggle/lib/python3.10/site-packages/sqlite_utils/cli.py\", line 1003, in \r\n docs = (dict(zip(headers, row)) for row in reader)\r\n_csv.Error: line contains NUL\r\n```", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/582/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": null} {"id": 1257724585, "node_id": "I_kwDOCGYnMM5K91qp", "number": 441, "title": "Combining `rows_where()` and `search()` to limit which rows are searched", "user": {"value": 1448859, "label": "betatim"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 4, "created_at": "2022-06-02T06:01:55Z", "updated_at": "2022-06-14T21:57:57Z", "closed_at": "2022-06-14T21:54:38Z", "author_association": "NONE", "pull_request": null, "body": "What is the right way to limit a full text search query to some rows of a table?\r\n\r\nFor example, I have a table that contains the following columns: `title`, `content`, `owner` (each row represents a document). The `owner` column is a username. It feels right to store all documents in one table, instead of having one table per owner. In particular because I'd like to full text search all documents, only documents owned by one user and documents owned by a set of users.\r\n\r\nI tried to combine `.rows_where(\"owner = ?\", \"1234\")` and `.search()` from the `Table` class but I don't think that is meant to work. I discovered `.search_sql()` as a way to generate the FTS SQL statement. By hand I can edit it to add a `AND [original].[owner] = :owner` to the `where` clause. This seems to do what I want.\r\n\r\nMy two questions:\r\n1. is adding a `AND ...` to the `where` clause actually the right thing to do or should I be doing something else (my SQL skills are low)?\r\n2. is there a built-in to sqlite-utils way to achieve this?\r\n\r\nRight now I am thinking I will make my own version of `search_sql()` that generates a query that contains an additional `owner = :owner` for my particular use-case.\r\n\r\nBonus question: is this generally useful/something to add to sqlite-utils or too niche?", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/441/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": "completed"} {"id": 705057955, "node_id": "MDU6SXNzdWU3MDUwNTc5NTU=", "number": 969, "title": "Add --tar option to \"datasette publish heroku\"", "user": {"value": 1448859, "label": "betatim"}, "state": "closed", "locked": 0, "assignee": null, "milestone": {"value": 5971510, "label": "Datasette 0.50"}, "comments": 3, "created_at": "2020-09-20T06:54:53Z", "updated_at": "2020-10-08T23:55:59Z", "closed_at": "2020-10-08T23:30:59Z", "author_association": "NONE", "pull_request": null, "body": "This issue is about how best to pass additional options to tools used for publishing datasettes. A concrete example is wanting to pass the `--tar` flag to the heroku CLI tool. I think there are at least two options for doing this: documentation for each publishing tool to explain how to set flags via env variables (if possible) or building a mechanism that lets users pass additional flags through datasette.\r\n\r\nWhen using `datasette publish heroku binder-launches.db --extra-options=\"--config facet_time_limit_ms:35000 --config sql_time_limit_ms:35000\" --name=binderlytics --install=datasette-vega` to publish https://binderlytics.herokuapp.com/ the following error happens:\r\n\r\n```\r\n \u203a Warning: heroku update available from 7.42.1 to 7.43.0.\r\n \u203a Warning: heroku update available from 7.42.1 to 7.43.0.\r\n \u203a Warning: heroku update available from 7.42.1 to 7.43.0.\r\nSetting WEB_CONCURRENCY and restarting \u2b22 binderlytics... done, v13\r\nWEB_CONCURRENCY: 1\r\n \u203a Warning: heroku update available from 7.42.1 to 7.43.0.\r\n \u25b8 Couldn't detect GNU tar. Builds could fail due to decompression errors\r\n \u25b8 See https://devcenter.heroku.com/articles/platform-api-deploying-slugs#create-slug-archive\r\n \u25b8 Please install it, or specify the '--tar' option\r\n \u25b8 Falling back to node's built-in compressor\r\nbuffer.js:358\r\n throw new ERR_INVALID_OPT_VALUE.RangeError('size', size);\r\n ^\r\n\r\nRangeError [ERR_INVALID_OPT_VALUE]: The value \"3303763968\" is invalid for option \"size\"\r\n at Function.alloc (buffer.js:367:3)\r\n at new Buffer (buffer.js:281:19)\r\n at Readable. (/Users/thead/.local/share/heroku/node_modules/archiver-utils/index.js:39:15)\r\n at Readable.emit (events.js:322:22)\r\n at endReadableNT (/Users/thead/.local/share/heroku/node_modules/readable-stream/lib/_stream_readable.js:1010:12)\r\n at processTicksAndRejections (internal/process/task_queues.js:84:21) {\r\n code: 'ERR_INVALID_OPT_VALUE'\r\n}\r\n```\r\n\r\nAfter installing GNU tar with `brew install gnu-tar` and modifying `datasette/publish/heroku.py` to include the `--tar=/path/to/gnu-tar` publishing works.\r\n\r\nI think the problem occurs once your heroku slug reaches a certain size. At least when I add only a few 100 entries to the datasette then the error does not occcur.\r\n\r\ndatasette version 0.49.1\r\nOSX 10.14.6 (18G103)", "repo": {"value": 107914493, "label": "datasette"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/datasette/issues/969/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": "completed"} {"id": 593751293, "node_id": "MDU6SXNzdWU1OTM3NTEyOTM=", "number": 97, "title": "Adding a \"recreate\" flag to the `Database` constructor", "user": {"value": 1448859, "label": "betatim"}, "state": "closed", "locked": 0, "assignee": null, "milestone": null, "comments": 4, "created_at": "2020-04-04T05:41:10Z", "updated_at": "2020-04-15T14:29:31Z", "closed_at": "2020-04-13T03:52:29Z", "author_association": "NONE", "pull_request": null, "body": "I have a [script](https://github.com/betatim/binder-datasette/blob/master/create-db.ipynb) that imports data into a sqlite DB. When I re-run that script I'd like to remove the existing sqlite DB, instead of adding to it. The pragmatic answer is to add the check and file deletion to my script.\r\n\r\nHowever I thought it would be easy and useful for others to add a `recreate=True` flag to `db = sqlite_utils.Database(\"binder-launches.db\")`. After taking a look at the code for it I am not so sure any more. This is because the connection string could be a URL (or \"connection string\") like `\"file:///tmp/foo.db\"`. I don't know what the equivalent of `os.path.exists()` is for a connection string or how to detect that something is a connection string and raise an error \"can't use recreate=True and conn_string at the same time\".\r\n\r\nDoes anyone have an idea/suggestion where to start investigating?", "repo": {"value": 140912432, "label": "sqlite-utils"}, "type": "issue", "active_lock_reason": null, "performed_via_github_app": null, "reactions": "{\"url\": \"https://api.github.com/repos/simonw/sqlite-utils/issues/97/reactions\", \"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "draft": null, "state_reason": "completed"}