home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

26 rows where author_association = "OWNER", "created_at" is on date 2021-02-14 and "updated_at" is on date 2021-02-14 sorted by updated_at descending

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: reactions, created_at (date), updated_at (date)

issue 8

  • --sniff option for sniffing delimiters 8
  • --no-headers option for CSV and TSV 7
  • Hitting `_csv.Error: field larger than field limit (131072)` 3
  • fix for problem in Table.insert_all on search for columns per chunk of rows 2
  • Error reading csv files with large column data 2
  • limit=X, offset=Y parameters for more Python methods 2
  • Add fts offset docs. 1
  • .insert_all() fails if subsequent chunks contain additional columns 1

user 1

  • simonw 26

author_association 1

  • OWNER · 26 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
778854808 https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778854808 https://api.github.com/repos/simonw/sqlite-utils/issues/227 MDEyOklzc3VlQ29tbWVudDc3ODg1NDgwOA== simonw 9599 2021-02-14T22:46:54Z 2021-02-14T22:46:54Z OWNER

Fix is released in 3.5.

{
    "total_count": 1,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 1,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Error reading csv files with large column data 807174161  
778851721 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778851721 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODg1MTcyMQ== simonw 9599 2021-02-14T22:23:46Z 2021-02-14T22:23:46Z OWNER

I called this --no-headers for consistency with the existing output option: https://github.com/simonw/sqlite-utils/blob/427dace184c7da57f4a04df07b1e84cdae3261e8/sqlite_utils/cli.py#L61-L64

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  
778849394 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778849394 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODg0OTM5NA== simonw 9599 2021-02-14T22:06:53Z 2021-02-14T22:06:53Z OWNER

For the moment I think just adding --no-header - which causes column names "unknown1,unknown2,..." to be used - should be enough.

Users can import with that option, then use sqlite-utils transform --rename to rename them.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  
778844016 https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778844016 https://api.github.com/repos/simonw/sqlite-utils/issues/229 MDEyOklzc3VlQ29tbWVudDc3ODg0NDAxNg== simonw 9599 2021-02-14T21:22:45Z 2021-02-14T21:22:45Z OWNER

I'm going to use this pattern from https://stackoverflow.com/a/15063941 ```python import sys import csv maxInt = sys.maxsize

while True: # decrease the maxInt value by factor 10 # as long as the OverflowError occurs.

try:
    csv.field_size_limit(maxInt)
    break
except OverflowError:
    maxInt = int(maxInt/10)

```

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Hitting `_csv.Error: field larger than field limit (131072)` 807817197  
778843503 https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843503 https://api.github.com/repos/simonw/sqlite-utils/issues/229 MDEyOklzc3VlQ29tbWVudDc3ODg0MzUwMw== simonw 9599 2021-02-14T21:18:51Z 2021-02-14T21:18:51Z OWNER

I want to set this to the maximum allowed limit, which seems to be surprisingly hard! That StackOverflow thread is full of ideas for that, many of them involving ctypes. I'm a bit loathe to add a dependency on ctypes though - even though it's in the Python standard library I worry that it might not be available on some architectures.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Hitting `_csv.Error: field larger than field limit (131072)` 807817197  
778843362 https://github.com/simonw/sqlite-utils/issues/229#issuecomment-778843362 https://api.github.com/repos/simonw/sqlite-utils/issues/229 MDEyOklzc3VlQ29tbWVudDc3ODg0MzM2Mg== simonw 9599 2021-02-14T21:17:53Z 2021-02-14T21:17:53Z OWNER

Same issue as #227.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Hitting `_csv.Error: field larger than field limit (131072)` 807817197  
778811746 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811746 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODgxMTc0Ng== simonw 9599 2021-02-14T17:39:30Z 2021-02-14T21:16:54Z OWNER

I'm going to detach this from the #131 column types idea.

The three things I need to handle here are:

  • The CSV file doesn't have a header row at all, so I need to specify what the column names should be
  • The CSV file DOES have a header row but I want to ignore it and use alternative column names
  • The CSV doesn't have a header row at all and I want to automatically use unknown1,unknown2... so I can start exploring it as quickly as possible.

Here's a potential design that covers the first two:

--replace-header="foo,bar,baz" - ignore whatever is in the first row and pretend it was this instead --add-header="foo,bar,baz" - add a first row with these details, to use as the header

It doesn't cover the "give me unknown column names" case though.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  
778843086 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778843086 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODg0MzA4Ng== simonw 9599 2021-02-14T21:15:43Z 2021-02-14T21:15:43Z OWNER

I'm not convinced the .has_header() rules are useful for the kind of CSV files I work with: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L383

python def has_header(self, sample): # Creates a dictionary of types of data in each column. If any # column is of a single type (say, integers), *except* for the first # row, then the first row is presumed to be labels. If the type # can't be determined, it is assumed to be a string in which case # the length of the string is the determining factor: if all of the # rows except for the first are the same length, it's a header. # Finally, a 'vote' is taken at the end for each column, adding or # subtracting from the likelihood of the first row being a header.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  
778842982 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778842982 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODg0Mjk4Mg== simonw 9599 2021-02-14T21:15:11Z 2021-02-14T21:15:11Z OWNER

Implementation tip: I have code that reads the first row and uses it as headers here: https://github.com/simonw/sqlite-utils/blob/8f042ae1fd323995d966a94e8e6df85cc843b938/sqlite_utils/cli.py#L689-L691

So If I want to use unknown1,unknown2... I can do that by reading the first row, counting the number of columns, generating headers based on that range and then continuing to build that generator (maybe with itertools.chain() to replay the record we already read).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  
778841704 https://github.com/simonw/sqlite-utils/issues/227#issuecomment-778841704 https://api.github.com/repos/simonw/sqlite-utils/issues/227 MDEyOklzc3VlQ29tbWVudDc3ODg0MTcwNA== simonw 9599 2021-02-14T21:05:20Z 2021-02-14T21:05:20Z OWNER

This has also been reported in #229.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Error reading csv files with large column data 807174161  
778841547 https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778841547 https://api.github.com/repos/simonw/sqlite-utils/issues/225 MDEyOklzc3VlQ29tbWVudDc3ODg0MTU0Nw== simonw 9599 2021-02-14T21:04:13Z 2021-02-14T21:04:13Z OWNER

I added a test and fixed this in #234 - thanks for the fix.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
fix for problem in Table.insert_all on search for columns per chunk of rows 797159961  
778841278 https://github.com/simonw/sqlite-utils/issues/234#issuecomment-778841278 https://api.github.com/repos/simonw/sqlite-utils/issues/234 MDEyOklzc3VlQ29tbWVudDc3ODg0MTI3OA== simonw 9599 2021-02-14T21:02:11Z 2021-02-14T21:02:11Z OWNER

I managed to replicate this in a test: python def test_insert_all_with_extra_columns_in_later_chunks(fresh_db): chunk = [ {"record": "Record 1"}, {"record": "Record 2"}, {"record": "Record 3"}, {"record": "Record 4", "extra": 1}, ] fresh_db["t"].insert_all(chunk, batch_size=2, alter=True) assert list(fresh_db["t"].rows) == [ {"record": "Record 1", "extra": None}, {"record": "Record 2", "extra": None}, {"record": "Record 3", "extra": None}, {"record": "Record 4", "extra": 1}, ]

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
.insert_all() fails if subsequent chunks contain additional columns 808046597  
778834504 https://github.com/simonw/sqlite-utils/pull/225#issuecomment-778834504 https://api.github.com/repos/simonw/sqlite-utils/issues/225 MDEyOklzc3VlQ29tbWVudDc3ODgzNDUwNA== simonw 9599 2021-02-14T20:09:30Z 2021-02-14T20:09:30Z OWNER

Thanks for this. I'm going to try and get the test suite to run in Windows on GitHub Actions.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
fix for problem in Table.insert_all on search for columns per chunk of rows 797159961  
778829456 https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778829456 https://api.github.com/repos/simonw/sqlite-utils/issues/231 MDEyOklzc3VlQ29tbWVudDc3ODgyOTQ1Ng== simonw 9599 2021-02-14T19:37:52Z 2021-02-14T19:37:52Z OWNER

I'm going to add limit and offset to the following methods:

  • rows_where()
  • search_sql()
  • search()
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
limit=X, offset=Y parameters for more Python methods 808028757  
778828758 https://github.com/simonw/sqlite-utils/issues/231#issuecomment-778828758 https://api.github.com/repos/simonw/sqlite-utils/issues/231 MDEyOklzc3VlQ29tbWVudDc3ODgyODc1OA== simonw 9599 2021-02-14T19:33:14Z 2021-02-14T19:33:14Z OWNER

The limit= parameter is currently only available on the .search() method - it would make sense to add this to other methods as well.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
limit=X, offset=Y parameters for more Python methods 808028757  
778828495 https://github.com/simonw/sqlite-utils/pull/224#issuecomment-778828495 https://api.github.com/repos/simonw/sqlite-utils/issues/224 MDEyOklzc3VlQ29tbWVudDc3ODgyODQ5NQ== simonw 9599 2021-02-14T19:31:06Z 2021-02-14T19:31:06Z OWNER

I'm going to add a offset= parameter to support this case. Thanks for the suggestion!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Add fts offset docs. 792297010  
778827570 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778827570 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgyNzU3MA== simonw 9599 2021-02-14T19:24:20Z 2021-02-14T19:24:20Z OWNER

Here's the implementation in Python: https://github.com/python/cpython/blob/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/csv.py#L204-L225

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778824361 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778824361 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgyNDM2MQ== simonw 9599 2021-02-14T18:59:22Z 2021-02-14T18:59:22Z OWNER

I think I've got it. I can use io.BufferedReader() to get an object I can run .peek(2048) on, then wrap THAT in io.TextIOWrapper:

python encoding = encoding or "utf-8" buffered = io.BufferedReader(json_file, buffer_size=4096) decoded = io.TextIOWrapper(buffered, encoding=encoding, line_buffering=True) if pk and len(pk) == 1: pk = pk[0] if csv or tsv: if sniff: # Read first 2048 bytes and use that to detect first_bytes = buffered.peek(2048) print('first_bytes', first_bytes)

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778821403 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778821403 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgyMTQwMw== simonw 9599 2021-02-14T18:38:16Z 2021-02-14T18:38:16Z OWNER

There are two code paths here that matter:

  • For a regular file, can read the first 2048 bytes, then .seek(0) before continuing. That's easy.
  • stdin is harder. I need to read and buffer the first 2048 bytes, then pass an object to csv.reader() which will replay that chunk and then play the rest of stdin.

I'm a bit stuck on the second one. Ideally I could use something like itertools.chain() but I can't find an alternative for file-like objects.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778818639 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778818639 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgxODYzOQ== simonw 9599 2021-02-14T18:22:38Z 2021-02-14T18:22:38Z OWNER

Maybe I shouldn't be using StreamReader at all - https://www.python.org/dev/peps/pep-0400/ suggests that it should be deprecated in favour of io.TextIOWrapper. I'm using StreamReader due to this line: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L667-L668

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778817494 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778817494 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgxNzQ5NA== simonw 9599 2021-02-14T18:16:06Z 2021-02-14T18:16:06Z OWNER

Types involved: (Pdb) type(json_file.raw) <class '_io.FileIO'> (Pdb) type(json_file) <class 'encodings.utf_8.StreamReader'>

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778816333 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778816333 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgxNjMzMw== simonw 9599 2021-02-14T18:08:44Z 2021-02-14T18:08:44Z OWNER

No, you can't .seek(0) on stdin: File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py", line 678, in insert_upsert_implementation json_file.raw.seek(0) OSError: [Errno 29] Illegal seek

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778815740 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778815740 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgxNTc0MA== simonw 9599 2021-02-14T18:05:03Z 2021-02-14T18:05:03Z OWNER

The challenge here is how to read the first 2048 bytes and then reset the incoming file.

The Python docs example looks like this:

python with open('example.csv', newline='') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.reader(csvfile, dialect) Here's the relevant code in sqlite-utils: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/cli.py#L671-L679

The challenge is going to be having the --sniff option work with the progress bar. Here's how file_progress() works: https://github.com/simonw/sqlite-utils/blob/726219c3503e77440975cd15b74d006639feb0f8/sqlite_utils/utils.py#L106-L113

If file.raw is stdin can I do the equivalent of csvfile.seek(0) on it?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778812684 https://github.com/simonw/sqlite-utils/issues/230#issuecomment-778812684 https://api.github.com/repos/simonw/sqlite-utils/issues/230 MDEyOklzc3VlQ29tbWVudDc3ODgxMjY4NA== simonw 9599 2021-02-14T17:45:16Z 2021-02-14T17:45:16Z OWNER

Running this could take any CSV (or TSV) file and automatically detect the delimiter. If no header row is detected it could add unknown1,unknown2 headers:

sqlite-utils insert db.db data file.csv --sniff

(Using --sniff would imply --csv)

This could be called --sniffer instead but I like --sniff better.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--sniff option for sniffing delimiters 808008305  
778812050 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778812050 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODgxMjA1MA== simonw 9599 2021-02-14T17:41:30Z 2021-02-14T17:41:30Z OWNER

I just spotted that csv.Sniffer in the Python standard library has a .has_header(sample) method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  
778811934 https://github.com/simonw/sqlite-utils/issues/228#issuecomment-778811934 https://api.github.com/repos/simonw/sqlite-utils/issues/228 MDEyOklzc3VlQ29tbWVudDc3ODgxMTkzNA== simonw 9599 2021-02-14T17:40:48Z 2021-02-14T17:40:48Z OWNER

Another pattern that might be useful is to generate a header that is just "unknown1,unknown2,unknown3" for each of the columns in the rest of the file. This makes it easy to e.g. facet-explore within Datasette to figure out the correct names, then use sqlite-utils transform --rename to rename the columns.

I needed to do that for the https://bl.iro.bl.uk/work/ns/3037474a-761c-456d-a00c-9ef3c6773f4c example.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
--no-headers option for CSV and TSV 807437089  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 1153.797ms · About: github-to-sqlite