github
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | issue | performed_via_github_app |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/simonw/datasette/issues/1439#issuecomment-1045024276 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045024276 | IC_kwDOBm6k_c4-Sc4U | 9599 | 2022-02-18T19:01:42Z | 2022-02-18T19:55:24Z | OWNER | > Maybe I should use `-/` to encode forward slashes too, to defend against any ASGI servers that might not implement `raw_path` correctly. ```python def dash_encode(s): return s.replace("-", "--").replace(".", "-.").replace("/", "-/") def dash_decode(s): return s.replace("-/", "/").replace("-.", ".").replace("--", "-") ``` ```pycon >>> dash_encode("foo/bar/baz.csv") 'foo-/bar-/baz-.csv' >>> dash_decode('foo-/bar-/baz-.csv') 'foo/bar/baz.csv' ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045027067 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045027067 | IC_kwDOBm6k_c4-Sdj7 | 9599 | 2022-02-18T19:03:26Z | 2022-02-18T19:03:26Z | OWNER | (If I make this change it may break some existing Datasette installations when they upgrade - I could try and build a plugin for them which triggers on 404s and checks to see if the old format would return a 200 response, then returns that.) | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045032377 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045032377 | IC_kwDOBm6k_c4-Se25 | 9599 | 2022-02-18T19:06:50Z | 2022-02-18T19:06:50Z | OWNER | How does URL routing for https://latest.datasette.io/fixtures/table%2Fwith%2Fslashes.csv work? Right now it's https://github.com/simonw/datasette/blob/7d24fd405f3c60e4c852c5d746c91aa2ba23cf5b/datasette/app.py#L1098-L1101 That's not going to capture the dot-dash encoding version of that table name: ```pycon >>> dot_dash_encode("table/with/slashes.csv") 'table-/with-/slashes-.csv' ``` Probably needs a fancy regex trick like a negative lookbehind assertion or similar. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045055772 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045055772 | IC_kwDOBm6k_c4-Skkc | 9599 | 2022-02-18T19:23:33Z | 2022-02-18T19:25:42Z | OWNER | I want a match for this URL: /db/table-/with-/slashes-.csv Maybe this: ^/(?P<db_name>[^/]+)/(?P<table_and_format>([^/]*|(\-/)*|(\-\.)*|(\.\.)*)*$) Here we are matching a sequence of: ([^/]*|(\-/)*|(\-\.)*|(\-\-)*)* So a combination of not-slashes OR -/ or -. Or -- sequences <img width="224" alt="image" src="https://user-images.githubusercontent.com/9599/154748362-84909d4e-dccf-454b-a9cd-a036f9f66f09.png"> ^/(?P<db_name>[^/]+)/(?P<table_and_format>([^/]*|(\-/)*|(\-\.)*|(\-\-)*)*$) Try that with non-capturing bits: ^/(?P<db_name>[^/]+)/(?P<table_and_format>(?:[^/]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*$) `(?:[^/]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*` visualized is: <img width="193" alt="image" src="https://user-images.githubusercontent.com/9599/154748441-decea502-0d04-44f4-9ca9-fb6883767833.png"> Here's the explanation on regex101.com https://regex101.com/r/CPnsIO/1 <img width="1074" alt="image" src="https://user-images.githubusercontent.com/9599/154748720-cdda61db-5498-49a8-91c2-e726b394fa49.png"> | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045059427 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045059427 | IC_kwDOBm6k_c4-Sldj | 9599 | 2022-02-18T19:26:25Z | 2022-02-18T19:26:25Z | OWNER | With this new pattern I could probably extract out the optional `.json` format string as part of the initial route capturing regex too, rather than the current `table_and_format` hack. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045075207 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045075207 | IC_kwDOBm6k_c4-SpUH | 9599 | 2022-02-18T19:39:35Z | 2022-02-18T19:40:13Z | OWNER | > And if for some horific reason you had a table with the name `/db/table-.csv.csv` (so `/db/` was the first part of the actual table name in SQLite) the URLs would look like this: > > * `/db/%2Fdb%2Ftable---.csv-.csv` - the HTML version > * `/db/%2Fdb%2Ftable---.csv-.csv.csv` - the CSV version > * `/db/%2Fdb%2Ftable---.csv-.csv.json` - the JSON version Here's what those look like with the updated version of `dot_dash_encode()` that also encodes `/` as `-/`: - `/db/-/db-/table---.csv-.csv` - HTML - `/db/-/db-/table---.csv-.csv.csv` - CSV - `/db/-/db-/table---.csv-.csv.json` - JSON <img width="1050" alt="image" src="https://user-images.githubusercontent.com/9599/154750631-a8a23c62-3dfc-43e4-8026-4d117dc4bf8d.png"> | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045077590 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045077590 | IC_kwDOBm6k_c4-Sp5W | 9599 | 2022-02-18T19:41:37Z | 2022-02-18T19:42:41Z | OWNER | Ugh, one disadvantage I just spotted with this: Datasette already has a `/-/versions.json` convention where "system" URLs are namespaced under `/-/` - but that could be confused under this new scheme with the `-/` escaping sequence. And I've thought about adding `/db/-/special` and `/db/table/-/special` URLs in the past too. Maybe change this system to use `.` as the escaping character instead of `-`? | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045081042 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045081042 | IC_kwDOBm6k_c4-SqvS | 9599 | 2022-02-18T19:44:12Z | 2022-02-18T19:51:34Z | OWNER | ```python def dot_encode(s): return s.replace(".", "..").replace("/", "./") def dot_decode(s): return s.replace("./", "/").replace("..", ".") ``` No need for hyphen encoding in this variant at all, which simplifies things a bit. (Update: this is flawed, see https://github.com/simonw/datasette/issues/1439#issuecomment-1045086033) | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045082891 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045082891 | IC_kwDOBm6k_c4-SrML | 9599 | 2022-02-18T19:45:32Z | 2022-02-18T19:45:32Z | OWNER | ```pycon >>> dot_encode("/db/table-.csv.csv") './db./table-..csv..csv' >>> dot_decode('./db./table-..csv..csv') '/db/table-.csv.csv' ``` I worry that web servers might treat `./` in a special way though. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045086033 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045086033 | IC_kwDOBm6k_c4-Sr9R | 9599 | 2022-02-18T19:47:43Z | 2022-02-18T19:51:11Z | OWNER | - https://datasette.io/-/asgi-scope/db/./db./table-..csv..csv - https://til.simonwillison.net/-/asgi-scope/db/./db./table-..csv..csv Do both of those survive the round-trip to populate `raw_path` correctly? No! In both cases the `/./` bit goes missing. It looks like this might even be a client issue - `curl` shows me this: ``` ~ % curl -vv -i 'https://datasette.io/-/asgi-scope/db/./db./table-..csv..csv' * Trying 216.239.32.21:443... * Connected to datasette.io (216.239.32.21) port 443 (#0) * ALPN, offering http/1.1 * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: datasette.io * Server certificate: R3 * Server certificate: ISRG Root X1 > GET /-/asgi-scope/db/db./table-..csv..csv HTTP/1.1 ``` So `curl` decided to turn `/-/asgi-scope/db/./db./table` into `/-/asgi-scope/db/db./table` before even sending the request. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045095348 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045095348 | IC_kwDOBm6k_c4-SuO0 | 9599 | 2022-02-18T19:53:48Z | 2022-02-18T19:53:48Z | OWNER | > Ugh, one disadvantage I just spotted with this: Datasette already has a `/-/versions.json` convention where "system" URLs are namespaced under `/-/` - but that could be confused under this new scheme with the `-/` escaping sequence. > > And I've thought about adding `/db/-/special` and `/db/table/-/special` URLs in the past too. I don't think this matters. The new regex does indeed capture that kind of page: <img width="1052" alt="image" src="https://user-images.githubusercontent.com/9599/154752309-e1787755-3bdb-47c2-867c-7ac5fe65664d.png"> But Datasette goes through configured route regular expressions in order - so I can have the regex that captures `/db/-/special` routes listed before the one that captures tables and formats. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045099290 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045099290 | IC_kwDOBm6k_c4-SvMa | 9599 | 2022-02-18T19:56:18Z | 2022-02-18T19:56:30Z | OWNER | > ```python > def dash_encode(s): > return s.replace("-", "--").replace(".", "-.").replace("/", "-/") > > def dash_decode(s): > return s.replace("-/", "/").replace("-.", ".").replace("--", "-") > ``` I think **dash-encoding** (new name for this) is the right way forward here. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045108611 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045108611 | IC_kwDOBm6k_c4-SxeD | 9599 | 2022-02-18T20:02:19Z | 2022-02-18T20:08:34Z | OWNER | One other potential variant: ```python def dash_encode(s): return s.replace("-", "-dash-").replace(".", "-dot-").replace("/", "-slash-") def dash_decode(s): return s.replace("-slash-", "/").replace("-dot-", ".").replace("-dash-", "-") ``` Except this has bugs - it doesn't round-trip safely, because it can get confused about things like `-dash-slash-` in terms of is that a `-dash-` or a `-slash-`? ```pycon >>> dash_encode("/db/table-.csv.csv") '-slash-db-slash-table-dash--dot-csv-dot-csv' >>> dash_decode('-slash-db-slash-table-dash--dot-csv-dot-csv') '/db/table-.csv.csv' >>> dash_encode('-slash-db-slash-table-dash--dot-csv-dot-csv') '-dash-slash-dash-db-dash-slash-dash-table-dash-dash-dash--dash-dot-dash-csv-dash-dot-dash-csv' >>> dash_decode('-dash-slash-dash-db-dash-slash-dash-table-dash-dash-dash--dash-dot-dash-csv-dash-dot-dash-csv') '-dash/dash-db-dash/dash-table-dash--dash.dash-csv-dash.dash-csv' ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045111309 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045111309 | IC_kwDOBm6k_c4-SyIN | 9599 | 2022-02-18T20:04:24Z | 2022-02-18T20:05:40Z | OWNER | This made me worry that my current `dash_decode()` implementation had unknown round-trip bugs, but thankfully this works OK: ```pycon >>> dash_encode("/db/table-.csv.csv") '-/db-/table---.csv-.csv' >>> dash_encode('-/db-/table---.csv-.csv') '---/db---/table-------.csv---.csv' >>> dash_decode('---/db---/table-------.csv---.csv') '-/db-/table---.csv-.csv' >>> dash_decode('-/db-/table---.csv-.csv') '/db/table-.csv.csv' ``` The regex still works against that double-encoded example too: <img width="1032" alt="image" src="https://user-images.githubusercontent.com/9599/154753916-b7d2159e-4284-4c92-ae61-110671fa320e.png"> | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045117304 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045117304 | IC_kwDOBm6k_c4-Szl4 | 9599 | 2022-02-18T20:09:22Z | 2022-02-18T20:09:22Z | OWNER | Adopting this could result in supporting database files with surprising characters in their filename too. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045131086 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045131086 | IC_kwDOBm6k_c4-S29O | 9599 | 2022-02-18T20:22:13Z | 2022-02-18T20:22:47Z | OWNER | Should it encode `%` symbols too, since they have a special meaning in URLs and we can't guarantee that every single web server / proxy out there will round-trip them safely using percentage encoding? If so, would need to pick a different encoding character for them. Maybe `%` becomes `-p` - and in that case `/` could become `-s` too. Is it worth expanding dash-encoding outside of just `/` and `-` and `.` though? Not sure. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045134050 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045134050 | IC_kwDOBm6k_c4-S3ri | 9599 | 2022-02-18T20:25:04Z | 2022-02-18T20:25:04Z | OWNER | Here's a useful modern spec for how existing URL percentage encoding is supposed to work: https://url.spec.whatwg.org/#percent-encoded-bytes | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1045269544 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045269544 | IC_kwDOBm6k_c4-TYwo | 9599 | 2022-02-18T22:19:29Z | 2022-02-18T22:19:29Z | OWNER | Note that I've ruled out using `Accept: application/json` to return JSON because it turns out Cloudflare and potentially other CDNs ignore the `Vary: Accept` header entirely: - https://github.com/simonw/datasette/issues/1534 | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 |