home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

48 rows where issue = 973139047 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 2

  • simonw 47
  • karlcow 1

author_association 2

  • OWNER 47
  • NONE 1

issue 1

  • Rethink how .ext formats (v.s. ?_format=) works before 1.0 · 48 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1068461449 https://github.com/simonw/datasette/issues/1439#issuecomment-1068461449 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_r22J simonw 9599 2022-03-15T20:51:26Z 2022-03-15T20:51:26Z OWNER

I'm happy with this now that I've landed Tilde encoding in #1657.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1065988403 https://github.com/simonw/datasette/issues/1439#issuecomment-1065988403 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_ibEz simonw 9599 2022-03-13T00:06:38Z 2022-03-13T00:07:19Z OWNER

If I want to reserve - as a character that CAN be used in URLs, the only remaining character that might make sense for escape sequences is ~ - based on this last line of characters that are escape from percentage encoding:

python _ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789' b'_.-~') So I'd add both - and _ back to the safe list, but use ~ to escape . and / and suchlike.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1065987808 https://github.com/simonw/datasette/issues/1439#issuecomment-1065987808 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_ia7g simonw 9599 2022-03-13T00:02:32Z 2022-03-13T00:02:32Z OWNER

OK, this has broken a lot more than I expected it would.

Turns out - is a very common character in existing Datasette database names!

https://datasette.io/-/databases for example has two:

json [ { "name": "docs-index", "path": "docs-index.db", "size": 1007616, "is_mutable": false, "is_memory": false, "hash": "0ac6c3de2762fcd174fd249fed8a8fa6046ea345173d22c2766186bf336462b2" }, { "name": "dogsheep-index", "path": "dogsheep-index.db", "size": 5496832, "is_mutable": false, "is_memory": false, "hash": "d1ea238d204e5b9ae783c86e4af5bcdf21267c1f391de3e468d9665494ee012a" } ]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1060870237 https://github.com/simonw/datasette/issues/1439#issuecomment-1060870237 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_O5hd simonw 9599 2022-03-07T16:19:22Z 2022-03-07T16:19:22Z OWNER

I didn't need to do any of the fancy regular expression routing stuff after all, since the new dash encoding format avoids using / so a simple [^/]+ can capture the correct segments from the URL.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1060044007 https://github.com/simonw/datasette/issues/1439#issuecomment-1060044007 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_Lvzn simonw 9599 2022-03-06T21:38:15Z 2022-03-06T21:38:15Z OWNER

Test: https://github.com/simonw/datasette/blob/d2e3fe3facf0ed0abf8b00cd54463af90dd6904d/tests/test_utils.py#L651-L666

One big advantage to this scheme is that redirecting old links to %2F pages (e.g. https://fivethirtyeight.datasettes.com/fivethirtyeight/twitter-ratio%2Fsenators) is easy - if you see a % in the raw_path, redirect to that page with the % replaced by -.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059903309 https://github.com/simonw/datasette/issues/1439#issuecomment-1059903309 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LNdN simonw 9599 2022-03-06T06:17:51Z 2022-03-06T06:17:51Z OWNER

Suggestion from a conversation with Seth Michael Larson: it would be neat if plugins could easily integrate with whatever scheme this ends up using, maybe with the /db/table/-/plugin-name standardized pattern or similar.

Making it easy for plugins to do the right, consistent thing is a good idea.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059864154 https://github.com/simonw/datasette/issues/1439#issuecomment-1059864154 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LD5a simonw 9599 2022-03-06T00:59:04Z 2022-03-06T00:59:04Z OWNER

Needs more testing, but this seems to work for decoding the percent-escaped-with-dashes format: urllib.parse.unquote(s.replace('-', '%'))

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059863997 https://github.com/simonw/datasette/issues/1439#issuecomment-1059863997 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LD29 karlcow 505230 2022-03-06T00:57:57Z 2022-03-06T00:57:57Z NONE

Probably too late… but I have just seen this because http://simonwillison.net/2022/Mar/5/dash-encoding/#atom-everything

And it reminded me of comma tools at W3C. http://www.w3.org/,tools

Example, the text version of W3C homepage https://www.w3.org/,text

The challenge comes down to telling the difference between the following:

* `/db/table` - an HTML table page

/db/table

* `/db/table.csv` - the CSV version of `/db/table`

/db/table,csv

* `/db/table.csv` - no this one is actually a database table called `table.csv`

/db/table.csv

* `/db/table.csv.csv` - the CSV version of `/db/table.csv`

/db/table.csv,csv

* `/db/table.csv.csv.csv` and so on...

/db/table.csv.csv,csv

I haven't checked all the cases in the thread.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059855418 https://github.com/simonw/datasette/issues/1439#issuecomment-1059855418 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LBw6 simonw 9599 2022-03-06T00:00:53Z 2022-03-06T00:04:18Z OWNER

```python ESCAPE_SAFE = frozenset( b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789' )

I removed b'.-~')

class Quoter(dict): # Keeps a cache internally, via missing def missing(self, b): # Handle a cache miss. Store quoted string in cache and return. res = chr(b) if b in _ESCAPE_SAFE else '-{:02X}'.format(b) self[b] = res return res

quoter = Quoter().getitem

''.join([quoter(char) for char in b'foo/bar.csv'])

'foo-2Fbar-2Ecsv'

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059854864 https://github.com/simonw/datasette/issues/1439#issuecomment-1059854864 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LBoQ simonw 9599 2022-03-05T23:59:05Z 2022-03-05T23:59:05Z OWNER

OK, for that percentage thing: the Python core implementation of URL percentage escaping deliberately ignores two of the characters we want to escape: . and -:

https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L780-L783

python _ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ' b'abcdefghijklmnopqrstuvwxyz' b'0123456789' b'_.-~') It also defaults to skipping / (passed as a safe= parameter to various things).

I'm going to try borrowing and modifying the core of the Python implementation: https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L795-L814 ```python class _Quoter(dict): """A mapping from bytes numbers (in range(0,256)) to strings. String values are percent-encoded byte values, unless the key < 128, and in either of the specified safe set, or the always safe set. """ # Keeps a cache internally, via missing, for efficiency (lookups # of cached keys don't call Python code at all). def init(self, safe): """safe: bytes object.""" self.safe = _ALWAYS_SAFE.union(safe)

def __repr__(self):
    return f"<Quoter {dict(self)!r}>"

def __missing__(self, b):
    # Handle a cache miss. Store quoted string in cache and return.
    res = chr(b) if b in self.safe else '%{:02X}'.format(b)
    self[b] = res
    return res

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059853526 https://github.com/simonw/datasette/issues/1439#issuecomment-1059853526 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LBTW simonw 9599 2022-03-05T23:49:59Z 2022-03-05T23:49:59Z OWNER

I want to try regular percentage encoding, except that it also encodes both the - and the . characters, AND it uses - instead of % as the encoding character.

Should check what it does with emoji too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059851259 https://github.com/simonw/datasette/issues/1439#issuecomment-1059851259 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LAv7 simonw 9599 2022-03-05T23:35:47Z 2022-03-05T23:35:59Z OWNER

This comment from glyph got me thinking:

Have you considered replacing % with some other character and then using percent-encoding?

What happens if a table name includes a % character and that ends up getting mangled by a misbehaving proxy?

I should consider % in the escaping system too. And maybe go with that suggestion of using percent-encoding directly but with a different character.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059850369 https://github.com/simonw/datasette/issues/1439#issuecomment-1059850369 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_LAiB simonw 9599 2022-03-05T23:28:56Z 2022-03-05T23:28:56Z OWNER

Lots of great conversations about the dash encoding implementation on Twitter: https://twitter.com/simonw/status/1500228316309061633

@dracos helped me figure out a simpler regex: https://twitter.com/dracos/status/1500236433809973248

^/(?P<database>[^/]+)/(?P<table>[^\/\-\.]*|\-/|\-\.|\-\-)*(?P<format>\.\w+)?$

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059836599 https://github.com/simonw/datasette/issues/1439#issuecomment-1059836599 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_K9K3 simonw 9599 2022-03-05T21:52:10Z 2022-03-05T21:52:10Z OWNER

Blogged about this here: https://simonwillison.net/2022/Mar/5/dash-encoding/

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045069481 https://github.com/simonw/datasette/issues/1439#issuecomment-1045069481 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Sn6p simonw 9599 2022-02-18T19:34:41Z 2022-03-05T21:32:22Z OWNER

I think I got format extraction working! https://regex101.com/r/A0bW1D/1

^/(?P<database>[^/]+)/(?P<table>(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*?)(?:(?<!\-)\.(?P<format>\w+))?$

I had to make that crazy inner one even more complicated to stop it from capturing . that was not part of -..

(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*

Visualized:

So now I have a regex which can extract out the dot-encoded table name AND spot if there is an optional .format at the end:

If I end up using this in Datasette it's going to need VERY comprehensive unit tests and inline documentation.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059822391 https://github.com/simonw/datasette/issues/1439#issuecomment-1059822391 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_K5s3 simonw 9599 2022-03-05T19:50:12Z 2022-03-05T19:50:12Z OWNER

I'm going to move this work to a PR.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059822151 https://github.com/simonw/datasette/issues/1439#issuecomment-1059822151 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_K5pH simonw 9599 2022-03-05T19:48:35Z 2022-03-05T19:48:35Z OWNER

Those new docs: https://github.com/simonw/datasette/blob/d1cb73180b4b5a07538380db76298618a5fc46b6/docs/internals.rst#dash-encoding

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1059802318 https://github.com/simonw/datasette/issues/1439#issuecomment-1059802318 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4_K0zO simonw 9599 2022-03-05T17:34:33Z 2022-03-05T17:34:33Z OWNER

Wrote documentation:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1053973425 https://github.com/simonw/datasette/issues/1439#issuecomment-1053973425 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-0lux simonw 9599 2022-02-28T07:40:12Z 2022-02-28T07:40:12Z OWNER

If I make this change it will break existing links to one of the oldest Datasette demos: http://fivethirtyeight.datasettes.com/fivethirtyeight/avengers%2Favengers

A plugin that fixes those by redirecting them on 404 would be neat.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1049126151 https://github.com/simonw/datasette/issues/1439#issuecomment-1049126151 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-iGUH simonw 9599 2022-02-23T19:17:01Z 2022-02-23T19:17:01Z OWNER

Actually the relevant code looks to be: https://github.com/simonw/datasette/blob/7d24fd405f3c60e4c852c5d746c91aa2ba23cf5b/datasette/views/base.py#L481-L498

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1049124390 https://github.com/simonw/datasette/issues/1439#issuecomment-1049124390 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-iF4m simonw 9599 2022-02-23T19:15:00Z 2022-02-23T19:15:00Z OWNER

I'll start by modifying this function: https://github.com/simonw/datasette/blob/458f03ad3a454d271f47a643f4530bd8b60ddb76/datasette/utils/init.py#L732-L749

Later I want to move this to the routing layer to split out format automatically, as seen in the regexes here: https://github.com/simonw/datasette/issues/1439#issuecomment-1045069481

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1049114724 https://github.com/simonw/datasette/issues/1439#issuecomment-1049114724 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-iDhk simonw 9599 2022-02-23T19:04:40Z 2022-02-23T19:04:40Z OWNER

I'm going to try dash encoding for table names (and row IDs) in a branch and see how I like it.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045269544 https://github.com/simonw/datasette/issues/1439#issuecomment-1045269544 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-TYwo simonw 9599 2022-02-18T22:19:29Z 2022-02-18T22:19:29Z OWNER

Note that I've ruled out using Accept: application/json to return JSON because it turns out Cloudflare and potentially other CDNs ignore the Vary: Accept header entirely: - https://github.com/simonw/datasette/issues/1534

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045134050 https://github.com/simonw/datasette/issues/1439#issuecomment-1045134050 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-S3ri simonw 9599 2022-02-18T20:25:04Z 2022-02-18T20:25:04Z OWNER

Here's a useful modern spec for how existing URL percentage encoding is supposed to work: https://url.spec.whatwg.org/#percent-encoded-bytes

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045131086 https://github.com/simonw/datasette/issues/1439#issuecomment-1045131086 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-S29O simonw 9599 2022-02-18T20:22:13Z 2022-02-18T20:22:47Z OWNER

Should it encode % symbols too, since they have a special meaning in URLs and we can't guarantee that every single web server / proxy out there will round-trip them safely using percentage encoding? If so, would need to pick a different encoding character for them. Maybe % becomes -p - and in that case / could become -s too.

Is it worth expanding dash-encoding outside of just / and - and . though? Not sure.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045117304 https://github.com/simonw/datasette/issues/1439#issuecomment-1045117304 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Szl4 simonw 9599 2022-02-18T20:09:22Z 2022-02-18T20:09:22Z OWNER

Adopting this could result in supporting database files with surprising characters in their filename too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045108611 https://github.com/simonw/datasette/issues/1439#issuecomment-1045108611 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SxeD simonw 9599 2022-02-18T20:02:19Z 2022-02-18T20:08:34Z OWNER

One other potential variant: ```python def dash_encode(s): return s.replace("-", "-dash-").replace(".", "-dot-").replace("/", "-slash-")

def dash_decode(s): return s.replace("-slash-", "/").replace("-dot-", ".").replace("-dash-", "-") Except this has bugs - it doesn't round-trip safely, because it can get confused about things like `-dash-slash-` in terms of is that a `-dash-` or a `-slash-`?pycon

dash_encode("/db/table-.csv.csv") '-slash-db-slash-table-dash--dot-csv-dot-csv' dash_decode('-slash-db-slash-table-dash--dot-csv-dot-csv') '/db/table-.csv.csv' dash_encode('-slash-db-slash-table-dash--dot-csv-dot-csv') '-dash-slash-dash-db-dash-slash-dash-table-dash-dash-dash--dash-dot-dash-csv-dash-dot-dash-csv' dash_decode('-dash-slash-dash-db-dash-slash-dash-table-dash-dash-dash--dash-dot-dash-csv-dash-dot-dash-csv') '-dash/dash-db-dash/dash-table-dash--dash.dash-csv-dash.dash-csv' ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045111309 https://github.com/simonw/datasette/issues/1439#issuecomment-1045111309 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SyIN simonw 9599 2022-02-18T20:04:24Z 2022-02-18T20:05:40Z OWNER

This made me worry that my current dash_decode() implementation had unknown round-trip bugs, but thankfully this works OK: ```pycon

dash_encode("/db/table-.csv.csv") '-/db-/table---.csv-.csv' dash_encode('-/db-/table---.csv-.csv') '---/db---/table-------.csv---.csv' dash_decode('---/db---/table-------.csv---.csv') '-/db-/table---.csv-.csv' dash_decode('-/db-/table---.csv-.csv') '/db/table-.csv.csv' ``` The regex still works against that double-encoded example too:

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045099290 https://github.com/simonw/datasette/issues/1439#issuecomment-1045099290 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SvMa simonw 9599 2022-02-18T19:56:18Z 2022-02-18T19:56:30Z OWNER

```python def dash_encode(s): return s.replace("-", "--").replace(".", "-.").replace("/", "-/")

def dash_decode(s): return s.replace("-/", "/").replace("-.", ".").replace("--", "-") ```

I think dash-encoding (new name for this) is the right way forward here.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045024276 https://github.com/simonw/datasette/issues/1439#issuecomment-1045024276 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Sc4U simonw 9599 2022-02-18T19:01:42Z 2022-02-18T19:55:24Z OWNER

Maybe I should use -/ to encode forward slashes too, to defend against any ASGI servers that might not implement raw_path correctly. ```python def dash_encode(s): return s.replace("-", "--").replace(".", "-.").replace("/", "-/")

def dash_decode(s): return s.replace("-/", "/").replace("-.", ".").replace("--", "-") ```

```pycon

dash_encode("foo/bar/baz.csv") 'foo-/bar-/baz-.csv' dash_decode('foo-/bar-/baz-.csv') 'foo/bar/baz.csv' ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045095348 https://github.com/simonw/datasette/issues/1439#issuecomment-1045095348 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SuO0 simonw 9599 2022-02-18T19:53:48Z 2022-02-18T19:53:48Z OWNER

Ugh, one disadvantage I just spotted with this: Datasette already has a /-/versions.json convention where "system" URLs are namespaced under /-/ - but that could be confused under this new scheme with the -/ escaping sequence.

And I've thought about adding /db/-/special and /db/table/-/special URLs in the past too.

I don't think this matters. The new regex does indeed capture that kind of page:

But Datasette goes through configured route regular expressions in order - so I can have the regex that captures /db/-/special routes listed before the one that captures tables and formats.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045081042 https://github.com/simonw/datasette/issues/1439#issuecomment-1045081042 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SqvS simonw 9599 2022-02-18T19:44:12Z 2022-02-18T19:51:34Z OWNER

```python def dot_encode(s): return s.replace(".", "..").replace("/", "./")

def dot_decode(s): return s.replace("./", "/").replace("..", ".") ``` No need for hyphen encoding in this variant at all, which simplifies things a bit.

(Update: this is flawed, see https://github.com/simonw/datasette/issues/1439#issuecomment-1045086033)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045086033 https://github.com/simonw/datasette/issues/1439#issuecomment-1045086033 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Sr9R simonw 9599 2022-02-18T19:47:43Z 2022-02-18T19:51:11Z OWNER
  • https://datasette.io/-/asgi-scope/db/./db./table-..csv..csv
  • https://til.simonwillison.net/-/asgi-scope/db/./db./table-..csv..csv

Do both of those survive the round-trip to populate raw_path correctly?

No! In both cases the /./ bit goes missing.

It looks like this might even be a client issue - curl shows me this:

``` ~ % curl -vv -i 'https://datasette.io/-/asgi-scope/db/./db./table-..csv..csv' * Trying 216.239.32.21:443... * Connected to datasette.io (216.239.32.21) port 443 (#0) * ALPN, offering http/1.1 * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: datasette.io * Server certificate: R3 * Server certificate: ISRG Root X1

GET /-/asgi-scope/db/db./table-..csv..csv HTTP/1.1 `` Socurldecided to turn/-/asgi-scope/db/./db./tableinto/-/asgi-scope/db/db./table` before even sending the request.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045082891 https://github.com/simonw/datasette/issues/1439#issuecomment-1045082891 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SrML simonw 9599 2022-02-18T19:45:32Z 2022-02-18T19:45:32Z OWNER

```pycon

dot_encode("/db/table-.csv.csv") './db./table-..csv..csv' dot_decode('./db./table-..csv..csv') '/db/table-.csv.csv' `` I worry that web servers might treat./` in a special way though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045077590 https://github.com/simonw/datasette/issues/1439#issuecomment-1045077590 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Sp5W simonw 9599 2022-02-18T19:41:37Z 2022-02-18T19:42:41Z OWNER

Ugh, one disadvantage I just spotted with this: Datasette already has a /-/versions.json convention where "system" URLs are namespaced under /-/ - but that could be confused under this new scheme with the -/ escaping sequence.

And I've thought about adding /db/-/special and /db/table/-/special URLs in the past too.

Maybe change this system to use . as the escaping character instead of -?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045075207 https://github.com/simonw/datasette/issues/1439#issuecomment-1045075207 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-SpUH simonw 9599 2022-02-18T19:39:35Z 2022-02-18T19:40:13Z OWNER

And if for some horific reason you had a table with the name /db/table-.csv.csv (so /db/ was the first part of the actual table name in SQLite) the URLs would look like this:

* `/db/%2Fdb%2Ftable---.csv-.csv` - the HTML version
* `/db/%2Fdb%2Ftable---.csv-.csv.csv` - the CSV version
* `/db/%2Fdb%2Ftable---.csv-.csv.json` - the JSON version

Here's what those look like with the updated version of dot_dash_encode() that also encodes / as -/:

  • /db/-/db-/table---.csv-.csv - HTML
  • /db/-/db-/table---.csv-.csv.csv - CSV
  • /db/-/db-/table---.csv-.csv.json - JSON

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045059427 https://github.com/simonw/datasette/issues/1439#issuecomment-1045059427 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Sldj simonw 9599 2022-02-18T19:26:25Z 2022-02-18T19:26:25Z OWNER

With this new pattern I could probably extract out the optional .json format string as part of the initial route capturing regex too, rather than the current table_and_format hack.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045055772 https://github.com/simonw/datasette/issues/1439#issuecomment-1045055772 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Skkc simonw 9599 2022-02-18T19:23:33Z 2022-02-18T19:25:42Z OWNER

I want a match for this URL:

/db/table-/with-/slashes-.csv

Maybe this:

^/(?P<db_name>[^/]+)/(?P<table_and_format>([^/]*|(\-/)*|(\-\.)*|(\.\.)*)*$)

Here we are matching a sequence of:

([^/]*|(\-/)*|(\-\.)*|(\-\-)*)*

So a combination of not-slashes OR -/ or -. Or -- sequences

^/(?P<db_name>[^/]+)/(?P<table_and_format>([^/]*|(\-/)*|(\-\.)*|(\-\-)*)*$)

Try that with non-capturing bits:

^/(?P<db_name>[^/]+)/(?P<table_and_format>(?:[^/]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*$)

(?:[^/]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)* visualized is:

Here's the explanation on regex101.com https://regex101.com/r/CPnsIO/1

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045032377 https://github.com/simonw/datasette/issues/1439#issuecomment-1045032377 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Se25 simonw 9599 2022-02-18T19:06:50Z 2022-02-18T19:06:50Z OWNER

How does URL routing for https://latest.datasette.io/fixtures/table%2Fwith%2Fslashes.csv work?

Right now it's https://github.com/simonw/datasette/blob/7d24fd405f3c60e4c852c5d746c91aa2ba23cf5b/datasette/app.py#L1098-L1101

That's not going to capture the dot-dash encoding version of that table name: ```pycon

dot_dash_encode("table/with/slashes.csv") 'table-/with-/slashes-.csv' ``` Probably needs a fancy regex trick like a negative lookbehind assertion or similar.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1045027067 https://github.com/simonw/datasette/issues/1439#issuecomment-1045027067 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c4-Sdj7 simonw 9599 2022-02-18T19:03:26Z 2022-02-18T19:03:26Z OWNER

(If I make this change it may break some existing Datasette installations when they upgrade - I could try and build a plugin for them which triggers on 404s and checks to see if the old format would return a 200 response, then returns that.)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
1031141849 https://github.com/simonw/datasette/issues/1439#issuecomment-1031141849 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c49dfnZ simonw 9599 2022-02-07T07:11:11Z 2022-02-07T07:11:11Z OWNER

I added a Link header to solve this problem for the JSON version in: - #1533

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900715375 https://github.com/simonw/datasette/issues/1439#issuecomment-900715375 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r9Nv simonw 9599 2021-08-18T00:15:28Z 2021-08-18T00:15:28Z OWNER

Maybe I should use -/ to encode forward slashes too, to defend against any ASGI servers that might not implement raw_path correctly.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900714630 https://github.com/simonw/datasette/issues/1439#issuecomment-900714630 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r9CG simonw 9599 2021-08-18T00:13:33Z 2021-08-18T00:13:33Z OWNER

The documentation should definitely cover how table names become URLs, in case any third party code needs to be able to calculate this themselves.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900712981 https://github.com/simonw/datasette/issues/1439#issuecomment-900712981 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r8oV simonw 9599 2021-08-18T00:09:59Z 2021-08-18T00:12:32Z OWNER

So given the original examples, a table called table.csv would have the following URLs:

  • /db/table-.csv - the HTML version
  • /db/table-.csv.csv - the CSV version
  • /db/table-.csv.json - the JSON version

And if for some horific reason you had a table with the name /db/table-.csv.csv (so /db/ was the first part of the actual table name in SQLite) the URLs would look like this:

  • /db/%2Fdb%2Ftable---.csv-.csv - the HTML version
  • /db/%2Fdb%2Ftable---.csv-.csv.csv - the CSV version
  • /db/%2Fdb%2Ftable---.csv-.csv.json - the JSON version
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900711967 https://github.com/simonw/datasette/issues/1439#issuecomment-900711967 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r8Yf simonw 9599 2021-08-18T00:08:09Z 2021-08-18T00:08:09Z OWNER

Here's an alternative I just made up which I'm calling "dot dash" encoding:

```python def dot_dash_encode(s): return s.replace("-", "--").replace(".", "-.")

def dot_dash_decode(s): return s.replace("-.", ".").replace("--", "-") And some examples:python for example in ( "hello", "hello.csv", "hello-and-so-on.csv", "hello-.csv", "hello--and--so--on-.csv", "hello.csv.", "hello.csv.-", "hello.csv.--", ): print(example) print(dot_dash_encode(example)) print(example == dot_dash_decode(dot_dash_encode(example))) print() Outputs: hello hello True

hello.csv hello-.csv True

hello-and-so-on.csv hello--and--so--on-.csv True

hello-.csv hello---.csv True

hello--and--so--on-.csv hello----and----so----on---.csv True

hello.csv. hello-.csv-. True

hello.csv.- hello-.csv-.-- True

hello.csv.-- hello-.csv-.---- True ```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900709703 https://github.com/simonw/datasette/issues/1439#issuecomment-900709703 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r71H simonw 9599 2021-08-18T00:03:09Z 2021-08-18T00:03:09Z OWNER

But... what if I invent my own escaping scheme?

I actually did this once before, in https://github.com/simonw/datasette/commit/9fdb47ca952b93b7b60adddb965ea6642b1ff523 - while I was working on porting Datasette to ASGI in https://github.com/simonw/datasette/issues/272#issuecomment-494192779 because ASGI didn't yet have the raw_path mechanism.

I could bring that back - it looked like this:

"table/and/slashes" => "tableU+002FandU+002Fslashes" "~table" => "U+007Etable" "+bobcats!" => "U+002Bbobcats!" "U+007Etable" => "UU+002B007Etable" But I didn't particularly like it - it was quite verbose.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900705226 https://github.com/simonw/datasette/issues/1439#issuecomment-900705226 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r6vK simonw 9599 2021-08-17T23:50:32Z 2021-08-17T23:50:47Z OWNER

An alternative solution would be to use some form of escaping for the characters that form the name of the table.

The obvious way to do this would be URL-encoding - but it doesn't hold for . characters. The hex for that is %2E but watch what happens with that in a URL:

```

Against Cloud Run:

curl -s 'https://datasette.io/-/asgi-scope/foo/bar%2Fbaz%2E' | rg path 'path': '/-/asgi-scope/foo/bar/baz.', 'raw_path': b'/-/asgi-scope/foo/bar%2Fbaz.', 'root_path': '',

Against Vercel:

curl -s 'https://til.simonwillison.net/-/asgi-scope/foo/bar%2Fbaz%2E' | rg path 'path': '/-/asgi-scope/foo/bar%2Fbaz%2E', 'raw_path': b'/-/asgi-scope/foo/bar%2Fbaz%2E', 'root_path': '', ``` Surprisingly in this case Vercel DOES keep it intact, but Cloud Run does not.

It's still no good though: I need a solution that works on Vercel, Cloud Run and every other potential hosting provider too.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  
900699670 https://github.com/simonw/datasette/issues/1439#issuecomment-900699670 https://api.github.com/repos/simonw/datasette/issues/1439 IC_kwDOBm6k_c41r5YW simonw 9599 2021-08-17T23:34:23Z 2021-08-17T23:34:23Z OWNER

The challenge comes down to telling the difference between the following:

  • /db/table - an HTML table page
  • /db/table.csv - the CSV version of /db/table
  • /db/table.csv - no this one is actually a database table called table.csv
  • /db/table.csv.csv - the CSV version of /db/table.csv
  • /db/table.csv.csv.csv and so on...
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Rethink how .ext formats (v.s. ?_format=) works before 1.0 973139047  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 27.117ms · About: github-to-sqlite