home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where issue = 447469253 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • simonw 10

issue 1

  • Improvements to table label detection · 10 ✖

author_association 1

  • OWNER 10
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1264769569 https://github.com/simonw/datasette/issues/485#issuecomment-1264769569 https://api.github.com/repos/simonw/datasette/issues/485 IC_kwDOBm6k_c5LYtoh simonw 9599 2022-10-03T00:04:42Z 2022-10-03T00:04:42Z OWNER

I love these tips - tools that can compile a simple machine learning model to a SQL query! Would be pretty cool if I could bundle a model in Datasette itself as a big in-memory SQLite SQL query:

  • https://github.com/Chryzanthemum/xgb2sql
  • https://github.com/konstantint/SKompiler
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
1264737290 https://github.com/simonw/datasette/issues/485#issuecomment-1264737290 https://api.github.com/repos/simonw/datasette/issues/485 IC_kwDOBm6k_c5LYlwK simonw 9599 2022-10-02T21:29:59Z 2022-10-02T21:29:59Z OWNER

To clarify: the feature this issue is talking about relates to the way Datasette automatically displays foreign key relationships, for example on this page: https://github-to-sqlite.dogsheep.net/github/commits

Each of those columns is a foreign key to another table. The link text that is displayed there comes from the "label column" that has either been configured or automatically detected for that other table.

I wonder if this could be handled with a tiny machine learning model that's trained to help pick the best label column?

Inputs to that model could include:

  • The names of the columns
  • The number of unique values in each column
  • The type of each column (or maybe only TEXT columns should be considered)
  • How many null values there are
  • Is the column marked as unique?
  • What's the average (or median or some other statistic) string length of values in each column?

Output would be the most likely label column, or some indicator that no likely candidates had been found.

My hunch is that this would be better solved using a few extra heuristics rather than by training a model, but it does feel like an interesting opportunity to experiment with a tiny ML model.

Asked for tips about this on Twitter: https://twitter.com/simonw/status/1576680930680262658

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
497116074 https://github.com/simonw/datasette/issues/485#issuecomment-497116074 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NzExNjA3NA== simonw 9599 2019-05-29T21:29:16Z 2019-05-29T21:29:16Z OWNER

Another good rule of thumb: look for text fields with a unique constraint?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
496367866 https://github.com/simonw/datasette/issues/485#issuecomment-496367866 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NjM2Nzg2Ng== simonw 9599 2019-05-28T05:14:06Z 2019-05-28T05:14:06Z OWNER

I'm going to generate statistics for every TEXT column.

Any column with more than 90% distinct rows (compared to the total count of rows) will be a candidate for the label.

I will then pick the candidate column with the shortest average length.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
496283728 https://github.com/simonw/datasette/issues/485#issuecomment-496283728 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NjI4MzcyOA== simonw 9599 2019-05-27T18:44:07Z 2019-05-27T18:44:07Z OWNER

This code now lives in a method on the new datasette.database.Database class, which should make it easier to write unit tests for.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
496039483 https://github.com/simonw/datasette/issues/485#issuecomment-496039483 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NjAzOTQ4Mw== simonw 9599 2019-05-26T23:22:53Z 2019-05-26T23:22:53Z OWNER

Comparing these two SQL queries (the one with union and the one without) using explain:

With union: https://latest.datasette.io/fixtures?sql=explain+select+%27name%27+as+column%2C+count+%28distinct+name%29+as+count_distinct%2C+avg%28length%28name%29%29+as+avg_length+from+roadside_attractions%0D%0A++union%0D%0Aselect+%27address%27+as+column%2C+count%28distinct+address%29+as+count_distinct%2C+avg%28length%28address%29%29+as+avg_length+from+roadside_attractions produces 52 rows

Without union: https://latest.datasette.io/fixtures?sql=explain+select%0D%0A++count+(distinct+name)+as+count_distinct_column_1%2C%0D%0A++avg(length(name))+as+avg_length_column_1%2C%0D%0A++count(distinct+address)+as+count_distinct_column_2%2C%0D%0A++avg(length(address))+as+avg_length_column_2%0D%0Afrom+roadside_attractions produces 32 rows

So I'm going to use the one without the union.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
496039267 https://github.com/simonw/datasette/issues/485#issuecomment-496039267 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NjAzOTI2Nw== simonw 9599 2019-05-26T23:19:38Z 2019-05-26T23:20:10Z OWNER

Thinking about that union query: I imagine doing this with union could encourage multiple full table scans. Maybe this query would only do one? https://latest.datasette.io/fixtures?sql=select%0D%0A++count+%28distinct+name%29+as+count_distinct_column_1%2C%0D%0A++avg%28length%28name%29%29+as+avg_length_column_1%2C%0D%0A++count%28distinct+address%29+as+count_distinct_column_2%2C%0D%0A++avg%28length%28address%29%29+as+avg_length_column_2%0D%0Afrom+roadside_attractions

select count (distinct name) as count_distinct_column_1, avg(length(name)) as avg_length_column_1, count(distinct address) as count_distinct_column_2, avg(length(address)) as avg_length_column_2 from roadside_attractions

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
495085021 https://github.com/simonw/datasette/issues/485#issuecomment-495085021 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NTA4NTAyMQ== simonw 9599 2019-05-23T06:27:57Z 2019-05-26T23:15:51Z OWNER

I could attempt to calculate the statistics needed for this in a time limited SQL query something like this one: https://latest.datasette.io/fixtures?sql=select+%27name%27+as+column%2C+count+%28distinct+name%29+as+count_distinct%2C+avg%28length%28name%29%29+as+avg_length+from+roadside_attractions%0D%0A++union%0D%0Aselect+%27address%27+as+column%2C+count%28distinct+address%29+as+count_distinct%2C+avg%28length%28address%29%29+as+avg_length+from+roadside_attractions

select 'name' as column, count (distinct name) as count_distinct, avg(length(name)) as avg_length from roadside_attractions union select 'address' as column, count(distinct address) as count_distinct, avg(length(address)) as avg_length from roadside_attractions

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
496038601 https://github.com/simonw/datasette/issues/485#issuecomment-496038601 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NjAzODYwMQ== simonw 9599 2019-05-26T23:08:41Z 2019-05-26T23:08:41Z OWNER

The code currently assumes the primary key is called "id" or "pk" - improving it to detect the primary key using database introspection should work much better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  
495083670 https://github.com/simonw/datasette/issues/485#issuecomment-495083670 https://api.github.com/repos/simonw/datasette/issues/485 MDEyOklzc3VlQ29tbWVudDQ5NTA4MzY3MA== simonw 9599 2019-05-23T06:21:52Z 2019-05-23T06:22:36Z OWNER

If a table has more than two columns we could do a betterl job at guessing the label column. A few potential tricks:

  • look for a column called name or title
  • look for the first column of type text
  • check for the text column with the most diversity in values
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Improvements to table label detection  447469253  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 36.121ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows