github
html_url | issue_url | id | node_id | user | created_at | updated_at | author_association | body | reactions | issue | performed_via_github_app |
---|---|---|---|---|---|---|---|---|---|---|---|
https://github.com/simonw/datasette/issues/1439#issuecomment-1045069481 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1045069481 | IC_kwDOBm6k_c4-Sn6p | 9599 | 2022-02-18T19:34:41Z | 2022-03-05T21:32:22Z | OWNER | I think I got format extraction working! https://regex101.com/r/A0bW1D/1 ^/(?P<database>[^/]+)/(?P<table>(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*?)(?:(?<!\-)\.(?P<format>\w+))?$ I had to make that crazy inner one even more complicated to stop it from capturing `.` that was not part of `-.`. (?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)* Visualized: <img width="222" alt="image" src="https://user-images.githubusercontent.com/9599/154749714-44579899-5dc7-4e5f-ad4f-dc59dac48979.png"> So now I have a regex which can extract out the dot-encoded table name AND spot if there is an optional `.format` at the end: <img width="1090" alt="image" src="https://user-images.githubusercontent.com/9599/156900484-7912073f-28aa-4301-86e2-e5cbe625e1d5.png"> If I end up using this in Datasette it's going to need VERY comprehensive unit tests and inline documentation. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059802318 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059802318 | IC_kwDOBm6k_c4_K0zO | 9599 | 2022-03-05T17:34:33Z | 2022-03-05T17:34:33Z | OWNER | Wrote documentation: <img width="741" alt="Dash encoding. Datasette uses a custom encoding scheme in some places, called dash encoding. This is primarily used for table names and row primary keys, to avoid any confusion between / characters in those values and the Datasette URL that references them. Dash encoding applies the following rules, in order: 1. All single - characters are replaced by -- 2. . characters are replaced by -. 3. / characters are replaced by ./ These rules are applied in reverse order to decode a dash encoded string." src="https://user-images.githubusercontent.com/9599/156893903-5723f60e-e054-4365-84bc-f3084d11183d.png"> | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059822151 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059822151 | IC_kwDOBm6k_c4_K5pH | 9599 | 2022-03-05T19:48:35Z | 2022-03-05T19:48:35Z | OWNER | Those new docs: https://github.com/simonw/datasette/blob/d1cb73180b4b5a07538380db76298618a5fc46b6/docs/internals.rst#dash-encoding | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059822391 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059822391 | IC_kwDOBm6k_c4_K5s3 | 9599 | 2022-03-05T19:50:12Z | 2022-03-05T19:50:12Z | OWNER | I'm going to move this work to a PR. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059836599 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059836599 | IC_kwDOBm6k_c4_K9K3 | 9599 | 2022-03-05T21:52:10Z | 2022-03-05T21:52:10Z | OWNER | Blogged about this here: https://simonwillison.net/2022/Mar/5/dash-encoding/ | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059850369 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059850369 | IC_kwDOBm6k_c4_LAiB | 9599 | 2022-03-05T23:28:56Z | 2022-03-05T23:28:56Z | OWNER | Lots of great conversations about the dash encoding implementation on Twitter: https://twitter.com/simonw/status/1500228316309061633 @dracos helped me figure out a simpler regex: https://twitter.com/dracos/status/1500236433809973248 `^/(?P<database>[^/]+)/(?P<table>[^\/\-\.]*|\-/|\-\.|\-\-)*(?P<format>\.\w+)?$` ![image](https://user-images.githubusercontent.com/9599/156903088-c01933ae-4713-4e91-8d71-affebf70b945.png) | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059851259 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059851259 | IC_kwDOBm6k_c4_LAv7 | 9599 | 2022-03-05T23:35:47Z | 2022-03-05T23:35:59Z | OWNER | This [comment from glyph](https://twitter.com/glyph/status/1500244937312329730) got me thinking: > Have you considered replacing % with some other character and then using percent-encoding? What happens if a table name includes a `%` character and that ends up getting mangled by a misbehaving proxy? I should consider `%` in the escaping system too. And maybe go with that suggestion of using percent-encoding directly but with a different character. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059853526 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059853526 | IC_kwDOBm6k_c4_LBTW | 9599 | 2022-03-05T23:49:59Z | 2022-03-05T23:49:59Z | OWNER | I want to try regular percentage encoding, except that it also encodes both the `-` and the `.` characters, AND it uses `-` instead of `%` as the encoding character. Should check what it does with emoji too. | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 | |
https://github.com/simonw/datasette/issues/1439#issuecomment-1059854864 | https://api.github.com/repos/simonw/datasette/issues/1439 | 1059854864 | IC_kwDOBm6k_c4_LBoQ | 9599 | 2022-03-05T23:59:05Z | 2022-03-05T23:59:05Z | OWNER | OK, for that percentage thing: the Python core implementation of URL percentage escaping deliberately ignores two of the characters we want to escape: `.` and `-`: https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L780-L783 ```python _ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789' b'_.-~') ``` It also defaults to skipping `/` (passed as a `safe=` parameter to various things). I'm going to try borrowing and modifying the core of the Python implementation: https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L795-L814 ```python class _Quoter(dict): """A mapping from bytes numbers (in range(0,256)) to strings. String values are percent-encoded byte values, unless the key < 128, and in either of the specified safe set, or the always safe set. """ # Keeps a cache internally, via __missing__, for efficiency (lookups # of cached keys don't call Python code at all). def __init__(self, safe): """safe: bytes object.""" self.safe = _ALWAYS_SAFE.union(safe) def __repr__(self): return f"<Quoter {dict(self)!r}>" def __missing__(self, b): # Handle a cache miss. Store quoted string in cache and return. res = chr(b) if b in self.safe else '%{:02X}'.format(b) self[b] = res return res ``` | { "total_count": 0, "+1": 0, "-1": 0, "laugh": 0, "hooray": 0, "confused": 0, "heart": 0, "rocket": 0, "eyes": 0 } |
973139047 |