github
rowid | body | issue_comments_fts | rank |
---|---|---|---|
338523957 | I also need to solve for weird primary keys. If it’s a single integer or a single char field that’s easy. But what if it is a compound key with more than one chat field? What delimiter can I use that will definitely be safe? Let’s say I use hyphen. Now I need to find a durable encoding for any hyphens that might exist in the key fields themselves. How about I use URLencoding for every non-alpha-numeric character? That will turn hyphens into (I think) %2D. It should also solve for unicode characters, but it means the vast majority of keys (integers) will display neatly, including a compound key of eg 5678-345 | 841 | |
338524454 | Table rendering logic needs to detect the primary key field and turn it into a hyperlink. If there is a compound primary key it should add an extra column at the start of the table which displays the compound key as a link | 841 | |
338524857 | https://stackoverflow.com/a/14468878/6083 Looks like I should order by compound primary key and implement cursor-based pagination. | 841 | |
338526148 | https://github.com/ahupp/python-magic/blob/master/README.md | 841 | |
338530389 | This means I need a good solution for these compile time options while running in development mode | 841 | |
338530480 | How about when the service starts up it checks for a compile.json file and, if it is missing, creates it using the same code we run at compile time normally | 841 | |
338530704 | Needed by https://github.com/simonw/stateless-datasets/issues/4#issuecomment-338530389 | 841 | |
338531827 | Many of the applications I want to implement with this would benefit from having permanent real URLs. So let’s have both. The sha1 urls will serve far future cache headers (and an etag derived from their path). The non sha1 URLs will serve 302 uncached redirects to the sha1 locations. We will have a setting that lets people opt out of this behavior. | 841 | |
338697223 | Now returning this: { "error": "attempt to write a readonly database", "ok": false } | 841 | |
338768860 | I could use the table-reflow mechanism demonstrated here: http://demos.jquerymobile.com/1.4.3/table-reflow/ | 841 | |
338769538 | Maybe this should be handled by views instead? https://stateless-datasets-wreplxalgu.now.sh/ lists some views https://stateless-datasets-wreplxalgu.now.sh/?sql=select%20*%20from%20%22Order%20Subtotals%22 is an example showing the content of a view. What would the URL to views be? I don't think a view can share a name with a table, so the same URL scheme could work for both. | 841 | |
338789734 | URL design: /database/table.json - redirects to /database-6753f4a/table.json So we always redirect to the version with the truncated hash in the URL. | 841 | |
338797522 | https://stackoverflow.com/a/18134919/6083 is a good answer about how many characters of the hash are needed to be unique. I say we default to 7 characters, like git does - but allow extras to be configured. | 841 | |
338799438 | Can I take advantage of HTTP/2 so even if you get redirected I start serving you the correct resource straight away? | 841 | |
338804173 | Looks like the easiest way to implement HTTP/2 server push today is to run behind Cloudflare and use this: Link: </asset/to/push.js>; rel=preload; as=script https://blog.cloudflare.com/announcing-support-for-http-2-server-push-2/ Here's the W3C draft: https://w3c.github.io/preload/ From https://w3c.github.io/preload/#as-attribute it looks like I should use `as=fetch` if the content is intended for consumption by fetch() or XMLHTTPRequest. Unclear if I should throw `as=fetch crossorigin` in there. Need to experiment on that. | 841 | |
338806718 | Here's what the homepage of cloudflare.com does (with newlines added within the link header for clarity): $ curl -i 'https://www.cloudflare.com/' HTTP/1.1 200 OK Date: Mon, 23 Oct 2017 21:45:58 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive link: </vendor/bitdashplayer.min.css>; rel=preload; as=style, </vendor/bitdash-controls.min.css>; rel=preload; as=style, </video/marketing-video/cloudflare-marketing-video.mpd>; rel=preload, </video/marketing-video/cloudflare-marketing-video.m3u8>; rel=preload, </video/marketing-video/video_0_800000/dash/init.mp4>; rel=preload; as=video, </video/marketing-video/audio_0_128000/dash/init.mp4>; rel=preload; as=video, </video/marketing-video/video_0_800000/dash/segment_0.m4s>; rel=preload; as=video, </video/marketing-video/audio_0_128000/dash/segment_0.m4s>; rel=preload; as=video, </video/new-short-optimized-22.mp4>; rel=preload; as=video, </video/marketing-video/poster.jpg>; rel=preload; as=image The original header looked like this: link: </vendor/bitdashplayer.min.css>; rel=preload; as=style, </vendor/bitdash-controls.min.css>; rel=preload; as=style, </video/marketing-video/cloudflare-marketing-video.mpd>; rel=preload, </video/marketing-video/cloudflare-marketing-video.m3u8>; rel=preload, </video/marketing-video/video_0_800000/dash/init.mp4>; rel=preload; as=video, </video/marketing-video/audio_0_128000/dash/init.mp4>; rel=preload; as=video, </video/marketing-video/video_0_800000/dash/segment_0.m4s>; rel=preload; as=video, </video/marketing-video/audio_0_128000/dash/segment_0.m4s>; rel=preload; as=video, </video/new-short-optimized-22.mp4>; rel=preload; as=video, </video/marketing-video/poster.jpg>; rel=preload; as=image | 841 | |
338834213 | If I can’t setect a primary key, I won’t provide a URL for those records | 841 | |
338852971 | I'm not going to bother with this. | 841 | |
338853083 | Fixed in 9d219140694551453bfa528e0624919eb065f9d6 | 841 | |
338854988 | /database-name/table-name?name__contains=simon&sort=id+desc Note that if there's a column called "sort" you can still do sort__exact=blah | 841 | |
338857568 | I can find the primary keys using: PRAGMA table_info(myTable) | 841 | |
338859620 | I’m going to implement everything in https://docs.djangoproject.com/en/1.11/ref/models/querysets/#field-lookups with the exception of range and the various date ones. | 841 | |
338859709 | I’m going to need to write unit tests for this, is this depends on #9 | 841 | |
338861511 | Some tables won't have primary keys, in which case I won't generate pages for individual records. | 841 | |
338863155 | I’m going to use py.test and start with all tests in a single tests.py module | 841 | |
338872286 | I'm going to use `,` as the separator between elements of a compound primary key. If those elements themselves include a comma I will use `%2C` in its place. | 841 | |
338882110 | Well, I've started it at least. | 841 | |
338882207 | Next step: generate links to these. | 841 | |
339003850 | As of b46e370ee6126aa2fa85cf789a31da38aed98496 this is done. | 841 | |
339019873 | Here's what I've got now: <img width="693" alt="localhost_8006_northwind-40d049b_categories_json" src="https://user-images.githubusercontent.com/9599/31950748-1bc85f3e-b891-11e7-8205-d89da6129f4b.png"> | 841 | |
339027711 | I have code to detect primary keys on tables... but what should I do for tables that lack primary keys? How should I even sort them? | 841 | |
339028979 | Looks like I can use the SQLite specific “rowid” in that case. It isn’t guaranteed to stay consistent across a VACUUM but that’s ok because we are immutable anyway. https://www.sqlite.org/lang_createtable.html#rowid | 841 | |
339138809 | May as well support most of https://sqlite.org/lang_expr.html | 841 | |
339186887 | Still to do: - [x] `gt`, `gte`, `lt`, `lte` - [x] `like` - [x] `glob` | 841 | |
339210353 | I'm going to call this one done for the moment. The date filters can go in a stretch goal. | 841 | |
339366612 | I had to manually set the content disposition header: return await response.file_stream( filepath, headers={ 'Content-Disposition': 'attachment; filename="{}"'.format(ilepath) } ) In the next release of Sanic I can just use the filename= argument instead: https://github.com/channelcat/sanic/commit/07e95dba4f5983afc1e673df14bdd278817288aa | 841 | |
339382054 | Could this be as simple as using the iterative JSON encoder and adding a yield statement in between each chunk? | 841 | |
339388215 | First experiment: hook up an iterative CSV dump (just because that’s a tiny bit easier to get started with than iterative a JSON). Have it execute a big select statement and then iterate through the result set 100 rows at a time using sqite fetchmany() - also have it async sleep for a second in between each batch of 100. Can this work without needing python threads? | 841 | |
339388771 | If this does work, I need to figure it what to do about the HTML view. ASsuming I can iteratively produce JSON and CSV, what to do about HTML? One option: render the first 500 rows as HTML, then hand off to an infinite scroll experience that iteratively loads more rows as JSON. | 841 | |
339389105 | The gold standard here is to be able to serve up increasingly large datasets without blocking the event loop and while using a sustainable amount of RAM | 841 | |
339389328 | Ideally we can get some serious gains from the fact that our database file is opened with the immutable option. | 841 | |
339395551 | Simplest implementation will be to create a temporary directory somewhere, copy in a Dockerfile and the databases and run “now” in it. Ideally I can use symlinks rather than copying potentially large database files around. | 841 | |
339406634 | It certainly looks like some of the stuff in https://sqlite.org/pragma.html could be used to screw around with things. Example: `PRAGMA case_sensitive_like = 1` - would that affect future queries? | 841 | |
339413825 | Could I use https://sqlparse.readthedocs.io/en/latest/ to parse incoming statements and ensure they are pure SELECTs? Would that prevent people from using a compound SELECT statement to trigger an evil PRAGMA of some sort? | 841 | |
339420462 | https://sitesforprofit.com/responsive-table-plugins-and-patterns has some useful links. I really like the pattern from https://css-tricks.com/responsive-data-tables/ /* Max width before this PARTICULAR table gets nasty This query will take effect for any screen smaller than 760px and also iPads specifically. */ @media only screen and (max-width: 760px), (min-device-width: 768px) and (max-device-width: 1024px) { /* Force table to not be like tables anymore */ table, thead, tbody, th, td, tr { display: block; } /* Hide table headers (but not display: none;, for accessibility) */ thead tr { position: absolute; top: -9999px; left: -9999px; } tr { border: 1px solid #ccc; } td { /* Behave like a "row" */ border: none; border-bottom: 1px solid #eee; position: relative; padding-left: 50%; } td:before { /* Now like a table header */ position: absolute; /* Top/left values mimic padding */ top: 6px; left: 6px; width: 45%; padding-right: 10px; white-space: nowrap; } /* Label the data */ td:nth-of-type(1):before { content: "First Name"; } td:nth-of-type(2):before { content: "Last Name"; } td:nth-of-type(3):before { content: "Job Title"; } td:nth-of-type(4):before { content: "Favorite Color"; } td:nth-of-type(5):before { content: "Wars of Trek?"; } td:nth-of-type(6):before { content: "Porn Name"; } td:nth-of-type(7):before { content: "Date of Birth"; } td:nth-of-type(8):before { content: "Dream Vacation City"; } td:nth-of-type(9):before { content: "GPA"; } td:nth-of-type(10):before { conten… | 841 | |
339510770 | It looks like I should double quote my columns and ensure they are correctly escaped https://blog.christosoft.de/2012/10/sqlite-escaping-table-acolumn-names/ - hopefully using ? placeholders for column names will work. I should use ? for tables too. | 841 | |
339514819 | I’m going to have a single command-line app that does everything. Name to be decided - options include dataset, stateless, datasite (I quite like that - it reflects SQLite and the fact that you create a website) | 841 | |
339515822 | datasite . - starts web app in current directory, serving all DB files datasite . -p 8001 - serves on custom port datasite blah.db blah2.db - serves specified files You can’t specify more than one directory. You can specify as many files as you like. If you specify two files with different oaths but the same name then they must be accessed by hash. datasite publish . - publishes current directory to the internet! Uses now by default, if it detects it on your path. Other publishers will be eventually added as plugins. datasite publish http://path-to-db.db - publishes a DB available at a URL. Works by constructing the Dockerfile with wget calls in it. datasite blah.db -m metadata.json If you specify a directory it looks for metadata.json in that directory. Otherwise you can pass an explicit metadata file oath with -m or —metadata | 841 | |
339516032 | Another potential name: datapi | 841 | |
339517846 | I’m going to use Click for this http://nvie.com/posts/writing-a-cli-in-python-in-under-60-seconds/ https://kushaldas.in/posts/building-command-line-tools-in-python-with-click.html | 841 | |
339724700 | Here’s how to make the “serve” subcommand the default if it is called with no arguments: @click.group(invoke_without_command=True) def serve(): # ... | 841 | |
339866724 | <img width="624" alt="databases" src="https://user-images.githubusercontent.com/9599/32087550-4e10b8a4-ba91-11e7-992b-e628d77114d0.png"> | 841 | |
339891755 | Deploys to Now aren't working at the moment - they aren't showing the uploaded databases, because I've broken the path handling somehow. I need to do a bit more work here. | 841 | |
340561577 | http://the-hitchhikers-guide-to-packaging.readthedocs.io/en/latest/quickstart.html describes how to package this for PyPI | 841 | |
340787868 | Here’s how I can (I think) provide safe execution of arbitrary SQL while blocking PRAGMA calls: let people use names parameters in their SQL and apply strict filtering to the SQL query but not to the parameter values. cur.execute( "select * from people where name_last=:who and age=:age", { "who": who, "age": age }) In URL form: ?sql=select...&who=Terry&age=34 Now we can apply strict, dumb validation rules to the SQL part while allowing anything in the named queries - so people can execute a search for PRAGMA without being able to execute a PRAGMA statement. | 841 | |
341938424 | Done: https://github.com/simonw/stateless-datasets/commit/edaa10587e60946e0c1935333f6b79553db33798 | 841 | |
341945420 | To simplify things a bit, I'm going to require that every database is explicitly listed in the command line. I won't support "serve everything in this directory" for the moment. | 841 | |
342030075 | ... I tried that, I don't like it. I'm going to bring back "directory serving" by allowing you to pass a directory as an argument to `datasite` (including `datasite .`). I may even make `.` the default if you don't provide anything at all. | 841 | |
342032943 | Default look with Bootstrap 4 looks like this: <img width="871" alt="flights" src="https://user-images.githubusercontent.com/9599/32423242-1b4458ce-c25a-11e7-910f-2dc1de909b8f.png"> | 841 | |
342484889 | I’m going to call this feature “count values” | 841 | |
342521344 | GDS Registries could be fun too: https://registers.cloudapps.digital/ | 841 | |
343164111 | Implemented in 31b21f5c5e15fc3acab7fabb170c1da71dc3c98c | 841 | |
343168796 | Won't fix: ujson is not compatible with the custom JSON encoder I'm using here: https://github.com/simonw/immutabase/blob/b2dee11fcd989d9e2a7bf4de1e23dbc320c05013/immutabase/app.py#L401-L416 | 841 | |
343237982 | More terms: * publish * share * docker * host * stateless I want to capture the idea of publishing an immutable database in a stateless container. | 841 | |
343238262 | The name should ideally be available on PyPI and should make sense as both a command line application and a library. | 841 | |
343239062 | This looks promising: https://github.com/esnme/ultrajson/issues/124#issuecomment-323882878 | 841 | |
343266326 | http://sanic.readthedocs.io/en/latest/sanic/testing.html | 841 | |
343281876 | How about datasette? | 841 | |
343551356 | I'm going with datasette. | 841 | |
343557070 | https://file.io/ looks like it could be good for this. It's been around since 2015, and lets you upload a temporary file which can be downloaded once. $ curl -s -F "file=@database.db" "https://file.io/?expires=1d" {"success":true,"key":"ySrl1j","link":"https://file.io/ySrl1j","expiry":"1 day"} Downloading from that URL serves up the data with a `Content-disposition` header containing the filename: simonw$ curl -vv https://file.io/ySrl1j | more % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 34.232.1.167... * Connected to file.io (34.232.1.167) port 443 (#0) * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: file.io * Server certificate: Amazon * Server certificate: Amazon Root CA 1 * Server certificate: Starfield Services Root Certificate Authority - G2 > GET /ySrl1j HTTP/1.1 > Host: file.io > User-Agent: curl/7.43.0 > Accept: */* > < HTTP/1.1 200 OK < Date: Fri, 10 Nov 2017 18:14:38 GMT < Content-Type: undefined < Transfer-Encoding: chunked < Connection: keep-alive < X-Powered-By: Express < X-RateLimit-Limit: 5 < X-RateLimit-Remaining: 4 < Access-Control-Allow-Origin: * < Access-Control-Allow-Headers: Cache-Control,X-reqed-With,x-requested-with < Content-disposition: attachment; filename=database.db ... | 841 | |
343581130 | I'm going to handle this a different way. I'm going to support a local history of your own queries stored in localStorage, but if you want to share a query you have to do it with a URL. If people really want canned query support, they can do that using custom templates - see #12 - or by adding views to their database before they publish it. | 841 | |
343581332 | I'm not going to use Sanic's mechanism for this. I'll use arguments passed to my cli instead. | 841 | |
343643332 | Here's what a table looks like now at a smaller screen size: <img width="496" alt="parlgov-development__info_data_source" src="https://user-images.githubusercontent.com/9599/32686818-7ffa691e-c662-11e7-964d-201fce0694e4.png"> | 841 | |
343644891 | I can detect something is a view like this: SELECT name from sqlite_master WHERE type ='view'; | 841 | |
343644976 | Simplest version of this: 1. Create a temporary directory 2. Write a Dockerfile into it that pulls an image and pip installs datasette 3. Add symlinks to the DBs they listed (so we don't have to copy them) 4. Shell out to "now" 5. Done! | 841 | |
343645249 | Doing this works: import os os.link('/tmp/databases/northwind.db', '/tmp/tmp-blah/northwind.db') That creates a link in tmp-blah - and then when I delete that entire directory like so: import shutil shutil.rmtree('/tmp/tmp-blah') The original database is not deleted, just the link. | 841 | |
343645327 | I can create the temporary directory like so: import tempfile t = tempfile.TemporaryDirectory() t <TemporaryDirectory '/var/folders/w9/0xm39tk94ng9h52g06z4b54c0000gp/T/tmpkym70wlp'> t.name '/var/folders/w9/0xm39tk94ng9h52g06z4b54c0000gp/T/tmpkym70wlp' And then to delete it all: t.cleanup() | 841 | |
343646740 | I'm happy with this now that I've implemented the publish command in #26 | 841 | |
343647102 | http://2016.padjo.org/tutorials/data-primer-census-acs1-demographics/ has a sqlite database: http://2016.padjo.org/files/data/starterpack/census-acs-1year/acs-1-year-2015.sqlite I tested this by deploying it here: https://datasette-fewuggrvwr.now.sh/ | 841 | |
343647300 | Still needed: - [ ] A link to the homepage from some kind of navigation bar in the header - [ ] link to github.com/simonw/datasette in the footer - [ ] Slightly better titles (maybe ditch the visited link colours for titles only? should keep those for primary key links) - [ ] Links to the .json and .jsono versions of every view | 841 | |
343675165 | The plugin system can also allow alternative providers for the `publish` command - e.g. maybe hook up hyper.sh as an option for publishing containers. | 841 | |
343676574 | See also #14 | 841 | |
343683566 | I’m going to solve this by making it an optional argument you can pass to the serve command. Then the Dockerfile can still build and use it but it won’t interfere with tests or dev. If argument is not passed, we will calculate hashes on startup and calculate table row counts on demand. | 841 | |
343690060 | "parlgov-development.db": { "url": "http://www.parlgov.org/" }, "nhsadmin.sqlite": { "url": "https://github.com/psychemedia/openHealthDataDoodles" } | 841 | |
343691342 | Closing this, opening a fresh ticket for the navigation stuff. | 841 | |
343697291 | I'm going to bundle sql and sql_params together into a query nested object like this: { "query": { "sql": "select ...", "params": { "p0": "blah" } } } | 841 | |
343698214 | I'm closing #50 - more tests will be added in the future, but the framework is neatly in place for them now. | 841 | |
343699115 | This needs to incorporate a sensible way of presenting custom SQL query results too. And let's get a textarea in there for executing SQL while we're at it. | 841 | |
343705966 | https://github.com/fivethirtyeight/data has a ton of CSVs | 841 | |
343707624 | Split the SQL thing out into #65 | 841 | |
343707676 | Here's the new design: <img width="691" alt="parlgov-development" src="https://user-images.githubusercontent.com/9599/32695161-82821226-c708-11e7-835c-b3d91850b2e0.png"> Also lists views at the bottom (refs #54): <img width="345" alt="parlgov-development" src="https://user-images.githubusercontent.com/9599/32695164-99efa7de-c708-11e7-8272-bc5f5b870b84.png"> | 841 | |
343708447 | I ditched the metadata file concept. | 841 | |
343709217 | <img width="982" alt="nhsadmin" src="https://user-images.githubusercontent.com/9599/32695392-3ea12612-c70f-11e7-873b-9e6ad2c869e7.png"> | 841 | |
343715915 | con = sqlite3.connect('existing_db.db') with open('dump.sql', 'w') as f: for line in con.iterdump(): f.write('%s\n' % line) | 841 | |
343752404 | Re-opening this - I've decided to bring back this concept, see #68 | 841 | |
343752579 | By default I'll allow LIMIT and OFFSET up to a maximum of X (where X is let's say 50,000 to start with, but can be custom configured to a larger number or set to None for no limit). | 841 | |
343752683 | Maybe SQL views should have their own Sanic view class (`ViewView` is kinda funny), subclassed from `TableView`? | 841 | |
343753999 | For initial launch, I could just support this as some optional command line arguments you pass to the publish command: datasette publish data.db --title="Title" --source="url" | 841 | |
343754058 | I’m going to store this stuff in a file called metadata.json and move the existing automatically generated metadata to a file called build.json | 841 | |
343769692 | I have created a Docker Hub public repository for this: https://hub.docker.com/r/simonwillison/datasette/ | 841 | |
343780039 | I think the only safe way to do this is using SQLite `.fetchmany(1000)` - I can't guarantee that the user has not entered SQL that will outfox a limit in some way. So instead of attempting to edit their SQL, I'll always return 1001 records and let them know if they went over 1000 or not. | 841 |