github
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | issue | performed_via_github_app |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/simonw/datasette/issues/1439#issuecomment-900699670 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900699670 | IC_kwDOBm6k_c41r5YW | 9599 | 2021-08-17T23:34:23Z | 2021-08-17T23:34:23Z | OWNER | The challenge comes down to telling the difference between the following: - `/db/table` - an HTML table page - `/db/table.csv` - the CSV version of `/db/table` - `/db/table.csv` - no this one is actually a database table called `table.csv` - `/db/table.csv.csv` - the CSV version of `/db/table.csv` - `/db/table.csv.csv.csv` and so on... | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-900705226 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900705226 | IC_kwDOBm6k_c41r6vK | 9599 | 2021-08-17T23:50:32Z | 2021-08-17T23:50:47Z | OWNER | An alternative solution would be to use some form of escaping for the characters that form the name of the table. The obvious way to do this would be URL-encoding - but it doesn't hold for `.` characters. The hex for that is `%2E` but watch what happens with that in a URL: ``` # Against Cloud Run: curl -s 'https://datasette.io/-/asgi-scope/foo/bar%2Fbaz%2E' | rg path 'path': '/-/asgi-scope/foo/bar/baz.', 'raw_path': b'/-/asgi-scope/foo/bar%2Fbaz.', 'root_path': '', # Against Vercel: curl -s 'https://til.simonwillison.net/-/asgi-scope/foo/bar%2Fbaz%2E' | rg path 'path': '/-/asgi-scope/foo/bar%2Fbaz%2E', 'raw_path': b'/-/asgi-scope/foo/bar%2Fbaz%2E', 'root_path': '', ``` Surprisingly in this case Vercel DOES keep it intact, but Cloud Run does not. It's still no good though: I need a solution that works on Vercel, Cloud Run and every other potential hosting provider too. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-900709703 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900709703 | IC_kwDOBm6k_c41r71H | 9599 | 2021-08-18T00:03:09Z | 2021-08-18T00:03:09Z | OWNER | But... what if I invent my own escaping scheme? I actually did this once before, in https://github.com/simonw/datasette/commit/9fdb47ca952b93b7b60adddb965ea6642b1ff523 - while I was working on porting Datasette to ASGI in https://github.com/simonw/datasette/issues/272#issuecomment-494192779 because ASGI didn't yet have the `raw_path` mechanism. I could bring that back - it looked like this: ``` "table/and/slashes" => "tableU+002FandU+002Fslashes" "~table" => "U+007Etable" "+bobcats!" => "U+002Bbobcats!" "U+007Etable" => "UU+002B007Etable" ``` But I didn't particularly like it - it was quite verbose. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-900711967 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900711967 | IC_kwDOBm6k_c41r8Yf | 9599 | 2021-08-18T00:08:09Z | 2021-08-18T00:08:09Z | OWNER | Here's an alternative I just made up which I'm calling "dot dash" encoding: ```python def dot_dash_encode(s): return s.replace("-", "--").replace(".", "-.") def dot_dash_decode(s): return s.replace("-.", ".").replace("--", "-") ``` And some examples: ```python for example in ( "hello", "hello.csv", "hello-and-so-on.csv", "hello-.csv", "hello--and--so--on-.csv", "hello.csv.", "hello.csv.-", "hello.csv.--", ): print(example) print(dot_dash_encode(example)) print(example == dot_dash_decode(dot_dash_encode(example))) print() ``` Outputs: ``` hello hello True hello.csv hello-.csv True hello-and-so-on.csv hello--and--so--on-.csv True hello-.csv hello---.csv True hello--and--so--on-.csv hello----and----so----on---.csv True hello.csv. hello-.csv-. True hello.csv.- hello-.csv-.-- True hello.csv.-- hello-.csv-.---- True ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-900712981 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900712981 | IC_kwDOBm6k_c41r8oV | 9599 | 2021-08-18T00:09:59Z | 2021-08-18T00:12:32Z | OWNER | So given the original examples, a table called `table.csv` would have the following URLs: - `/db/table-.csv` - the HTML version - `/db/table-.csv.csv` - the CSV version - `/db/table-.csv.json` - the JSON version And if for some horific reason you had a table with the name `/db/table-.csv.csv` (so `/db/` was the first part of the actual table name in SQLite) the URLs would look like this: - `/db/%2Fdb%2Ftable---.csv-.csv` - the HTML version - `/db/%2Fdb%2Ftable---.csv-.csv.csv` - the CSV version - `/db/%2Fdb%2Ftable---.csv-.csv.json` - the JSON version | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-900714630 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900714630 | IC_kwDOBm6k_c41r9CG | 9599 | 2021-08-18T00:13:33Z | 2021-08-18T00:13:33Z | OWNER | The documentation should definitely cover how table names become URLs, in case any third party code needs to be able to calculate this themselves. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-900715375 | https://api.github.com/repos/simonw/datasette/issues/1439 | 900715375 | IC_kwDOBm6k_c41r9Nv | 9599 | 2021-08-18T00:15:28Z | 2021-08-18T00:15:28Z | OWNER | Maybe I should use `-/` to encode forward slashes too, to defend against any ASGI servers that might not implement `raw_path` correctly. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1031141849 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1031141849 | IC_kwDOBm6k_c49dfnZ | 9599 | 2022-02-07T07:11:11Z | 2022-02-07T07:11:11Z | OWNER | I added a Link header to solve this problem for the JSON version in: - #1533 | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059802318 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059802318 | IC_kwDOBm6k_c4_K0zO | 9599 | 2022-03-05T17:34:33Z | 2022-03-05T17:34:33Z | OWNER | Wrote documentation: <img width="741" alt="Dash encoding. Datasette uses a custom encoding scheme in some places, called dash encoding. This is primarily used for table names and row primary keys, to avoid any confusion between / characters in those values and the Datasette URL that references them. Dash encoding applies the following rules, in order: 1. All single - characters are replaced by -- 2. . characters are replaced by -. 3. / characters are replaced by ./ These rules are applied in reverse order to decode a dash encoded string." src="https://user-images.githubusercontent.com/9599/156893903-5723f60e-e054-4365-84bc-f3084d11183d.png"> | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059822151 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059822151 | IC_kwDOBm6k_c4_K5pH | 9599 | 2022-03-05T19:48:35Z | 2022-03-05T19:48:35Z | OWNER | Those new docs: https://github.com/simonw/datasette/blob/d1cb73180b4b5a07538380db76298618a5fc46b6/docs/internals.rst#dash-encoding | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059822391 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059822391 | IC_kwDOBm6k_c4_K5s3 | 9599 | 2022-03-05T19:50:12Z | 2022-03-05T19:50:12Z | OWNER | I'm going to move this work to a PR. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059836599 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059836599 | IC_kwDOBm6k_c4_K9K3 | 9599 | 2022-03-05T21:52:10Z | 2022-03-05T21:52:10Z | OWNER | Blogged about this here: https://simonwillison.net/2022/Mar/5/dash-encoding/ | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059850369 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059850369 | IC_kwDOBm6k_c4_LAiB | 9599 | 2022-03-05T23:28:56Z | 2022-03-05T23:28:56Z | OWNER | Lots of great conversations about the dash encoding implementation on Twitter: https://twitter.com/simonw/status/1500228316309061633 @dracos helped me figure out a simpler regex: https://twitter.com/dracos/status/1500236433809973248 `^/(?P<database>[^/]+)/(?P<table>[^\/\-\.]*|\-/|\-\.|\-\-)*(?P<format>\.\w+)?$` ![image](https://user-images.githubusercontent.com/9599/156903088-c01933ae-4713-4e91-8d71-affebf70b945.png) | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059851259 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059851259 | IC_kwDOBm6k_c4_LAv7 | 9599 | 2022-03-05T23:35:47Z | 2022-03-05T23:35:59Z | OWNER | This [comment from glyph](https://twitter.com/glyph/status/1500244937312329730) got me thinking: > Have you considered replacing % with some other character and then using percent-encoding? What happens if a table name includes a `%` character and that ends up getting mangled by a misbehaving proxy? I should consider `%` in the escaping system too. And maybe go with that suggestion of using percent-encoding directly but with a different character. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059853526 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059853526 | IC_kwDOBm6k_c4_LBTW | 9599 | 2022-03-05T23:49:59Z | 2022-03-05T23:49:59Z | OWNER | I want to try regular percentage encoding, except that it also encodes both the `-` and the `.` characters, AND it uses `-` instead of `%` as the encoding character. Should check what it does with emoji too. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059854864 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059854864 | IC_kwDOBm6k_c4_LBoQ | 9599 | 2022-03-05T23:59:05Z | 2022-03-05T23:59:05Z | OWNER | OK, for that percentage thing: the Python core implementation of URL percentage escaping deliberately ignores two of the characters we want to escape: `.` and `-`: https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L780-L783 ```python _ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789' b'_.-~') ``` It also defaults to skipping `/` (passed as a `safe=` parameter to various things). I'm going to try borrowing and modifying the core of the Python implementation: https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L795-L814 ```python class _Quoter(dict): """A mapping from bytes numbers (in range(0,256)) to strings. String values are percent-encoded byte values, unless the key < 128, and in either of the specified safe set, or the always safe set. """ # Keeps a cache internally, via __missing__, for efficiency (lookups # of cached keys don't call Python code at all). def __init__(self, safe): """safe: bytes object.""" self.safe = _ALWAYS_SAFE.union(safe) def __repr__(self): return f"<Quoter {dict(self)!r}>" def __missing__(self, b): # Handle a cache miss. Store quoted string in cache and return. res = chr(b) if b in self.safe else '%{:02X}'.format(b) self[b] = res return res ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059855418 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059855418 | IC_kwDOBm6k_c4_LBw6 | 9599 | 2022-03-06T00:00:53Z | 2022-03-06T00:04:18Z | OWNER | ```python _ESCAPE_SAFE = frozenset( b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789_' ) # I removed b'.-~') class Quoter(dict): # Keeps a cache internally, via __missing__ def __missing__(self, b): # Handle a cache miss. Store quoted string in cache and return. res = chr(b) if b in _ESCAPE_SAFE else '-{:02X}'.format(b) self[b] = res return res quoter = Quoter().__getitem__ ''.join([quoter(char) for char in b'foo/bar.csv']) # 'foo-2Fbar-2Ecsv' ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059864154 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059864154 | IC_kwDOBm6k_c4_LD5a | 9599 | 2022-03-06T00:59:04Z | 2022-03-06T00:59:04Z | OWNER | Needs more testing, but this seems to work for decoding the percent-escaped-with-dashes format: `urllib.parse.unquote(s.replace('-', '%'))` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059903309 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059903309 | IC_kwDOBm6k_c4_LNdN | 9599 | 2022-03-06T06:17:51Z | 2022-03-06T06:17:51Z | OWNER | Suggestion from a conversation with Seth Michael Larson: it would be neat if plugins could easily integrate with whatever scheme this ends up using, maybe with the `/db/table/-/plugin-name` standardized pattern or similar. Making it easy for plugins to do the right, consistent thing is a good idea. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1060044007 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1060044007 | IC_kwDOBm6k_c4_Lvzn | 9599 | 2022-03-06T21:38:15Z | 2022-03-06T21:38:15Z | OWNER | Test: https://github.com/simonw/datasette/blob/d2e3fe3facf0ed0abf8b00cd54463af90dd6904d/tests/test_utils.py#L651-L666 One big advantage to this scheme is that redirecting old links to `%2F` pages (e.g. https://fivethirtyeight.datasettes.com/fivethirtyeight/twitter-ratio%2Fsenators) is easy - if you see a `%` in the `raw_path`, redirect to that page with the `%` replaced by `-`. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1060870237 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1060870237 | IC_kwDOBm6k_c4_O5hd | 9599 | 2022-03-07T16:19:22Z | 2022-03-07T16:19:22Z | OWNER | I didn't need to do any of the fancy regular expression routing stuff after all, since the new dash encoding format avoids using `/` so a simple `[^/]+` can capture the correct segments from the URL. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1065987808 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1065987808 | IC_kwDOBm6k_c4_ia7g | 9599 | 2022-03-13T00:02:32Z | 2022-03-13T00:02:32Z | OWNER | OK, this has broken a lot more than I expected it would. Turns out `-` is a very common character in existing Datasette database names! https://datasette.io/-/databases for example has two: ```json [ { "name": "docs-index", "path": "docs-index.db", "size": 1007616, "is_mutable": false, "is_memory": false, "hash": "0ac6c3de2762fcd174fd249fed8a8fa6046ea345173d22c2766186bf336462b2" }, { "name": "dogsheep-index", "path": "dogsheep-index.db", "size": 5496832, "is_mutable": false, "is_memory": false, "hash": "d1ea238d204e5b9ae783c86e4af5bcdf21267c1f391de3e468d9665494ee012a" } ] ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1065988403 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1065988403 | IC_kwDOBm6k_c4_ibEz | 9599 | 2022-03-13T00:06:38Z | 2022-03-13T00:07:19Z | OWNER | If I want to reserve `-` as a character that CAN be used in URLs, the only remaining character that might make sense for escape sequences is `~` - based on this last line of characters that are escape from percentage encoding: ```python _ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789' b'_.-~') ``` So I'd add both `-` and `_` back to the safe list, but use `~` to escape `.` and `/` and suchlike. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1068461449 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1068461449 | IC_kwDOBm6k_c4_r22J | 9599 | 2022-03-15T20:51:26Z | 2022-03-15T20:51:26Z | OWNER | I'm happy with this now that I've landed Tilde encoding in #1657. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 |