{"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623865250", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623865250, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg2NTI1MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T05:38:16Z", "updated_at": "2020-05-05T05:38:16Z", "author_association": "MEMBER", "body": "It looks like `groups.content_string` often has a null byte in it. I should clean this up as part of the import.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623863902", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623863902, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg2MzkwMg==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T05:31:53Z", "updated_at": "2020-05-05T05:31:53Z", "author_association": "MEMBER", "body": "Yes! Turning those `rowid` values into `id` with this script did the job:\r\n```python\r\nimport sqlite3\r\nimport sqlite_utils\r\n\r\nconn = sqlite3.connect(\r\n \"/Users/simon/Pictures/Photos Library.photoslibrary/database/search/psi.sqlite\"\r\n)\r\n\r\n\r\ndef all_rows(table):\r\n result = conn.execute(\"select rowid as id, * from {}\".format(table))\r\n cols = [c[0] for c in result.description]\r\n for row in result.fetchall():\r\n yield dict(zip(cols, row))\r\n\r\n\r\nif __name__ == \"__main__\":\r\n db = sqlite_utils.Database(\"psi_copy.db\")\r\n for table in (\"assets\", \"collections\", \"ga\", \"gc\", \"groups\"):\r\n db[table].upsert_all(all_rows(table), pk=\"id\", alter=True)\r\n```\r\nThen I ran this query:\r\n```sql\r\nselect \r\n json_object('img_src', 'https://photos.simonwillison.net/i/' || photos.sha256 || '.' || photos.ext || '?w=400') as photo,\r\n group_concat(strip_null_chars(groups.content_string), ' ') as words, assets.uuid_0, assets.uuid_1, to_uuid(assets.uuid_0, assets.uuid_1) as uuid\r\nfrom assets join ga on assets.id = ga.assetid\r\njoin groups on ga.groupid = groups.id\r\njoin photos on photos.uuid = to_uuid(assets.uuid_0, assets.uuid_1)\r\nwhere groups.category = 2024\r\ngroup by assets.id\r\norder by random() limit 10\r\n```\r\nAnd got these results!\r\n\"psi_copy__select_json_object__img_src____https___photos_simonwillison_net_i______photos_sha256___________photos_ext______w_400___as_photo__group_concat_strip_null_chars_groups_content_string________as_words__assets_uuid_0__assets_uuid_1__to\"\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623857417", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623857417, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg1NzQxNw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T05:01:47Z", "updated_at": "2020-05-05T05:01:47Z", "author_association": "MEMBER", "body": "Even that didn't work - it didn't copy across the rowid values. I'm pretty sure that's what's wrong here:\r\n```\r\nsqlite3 /Users/simon/Pictures/Photos\\ Library.photoslibrary/database/search/psi.sqlite 'select rowid, uuid_0, uuid_1 from assets limit 10' \r\n1619605|-9205353363298198838|4814875488794983828\r\n1641378|-9205348195631362269|390804289838822030\r\n1634974|-9205331524553603243|-3834026796261633148\r\n1619083|-9205326176986145401|7563404215614709654\r\n22131|-9205315724827218763|8370531509591906734\r\n1645633|-9205247376092758131|-1311540150497601346\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623855885", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623855885, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg1NTg4NQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T04:54:39Z", "updated_at": "2020-05-05T04:54:53Z", "author_association": "MEMBER", "body": "Trying this import mechanism instead:\r\n`sqlite3 /Users/simon/Pictures/Photos\\ Library.photoslibrary/database/search/psi.sqlite .dump | grep -v 'CREATE INDEX' | grep -v 'CREATE TRIGGER' | grep -v 'CREATE VIRTUAL TABLE' | sqlite3 search.db`", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623855841", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623855841, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg1NTg0MQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T04:54:28Z", "updated_at": "2020-05-05T04:54:28Z", "author_association": "MEMBER", "body": "Things were not matching up for me correctly:\r\n\r\n\"search__select_json_object__img_src____https___photos_simonwillison_net_i______photos_sha256___________photos_ext______w_400___as_photo__groups_content_string__assets_uuid_0__assets_uuid_1__to_uuid_assets_uuid_0__assets_uuid_1__as_uuid__pho\"\r\n\r\nI think that's because my import script didn't correctly import the existing `rowid` values.", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623846880", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623846880, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg0Njg4MA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T04:06:08Z", "updated_at": "2020-05-05T04:06:08Z", "author_association": "MEMBER", "body": "This function seems to convert them into UUIDs that match my photos:\r\n```python\r\ndef to_uuid(uuid_0, uuid_1):\r\n b = uuid_0.to_bytes(8, 'little', signed=True) + uuid_1.to_bytes(8, 'little', signed=True)\r\n return str(uuid.UUID(bytes=b)).upper()\r\n```", "reactions": "{\"total_count\": 1, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 1, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623845014", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623845014, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzg0NTAxNA==", "user": {"value": 41546558, "label": "RhetTbull"}, "created_at": "2020-05-05T03:55:14Z", "updated_at": "2020-05-05T03:56:24Z", "author_association": "CONTRIBUTOR", "body": "I'm traveling w/o access to my Mac so can't help with any code right now. I suspected ZSCENEIDENTIFIER was a foreign key into one of these psi.sqlite tables. But looks like you're on to something connecting groups to assets. As for the UUID, I think there's two ints because each is 64-bits but UUIDs are 128-bits. Thus they need to be combined to get the 128 bit UUID. You might be able to use Apple's [NSUUID](https://developer.apple.com/documentation/foundation/nsuuid?language=objc), for example, by wrapping with pyObjC. Here's one [example](https://github.com/ronaldoussoren/pyobjc/blob/881c82a7ba90f193934b52b44143360c80dce5e5/pyobjc-framework-Cocoa/PyObjCTest/test_nsuuid.py) of using this in PyObjC's test suite. Interesting it's stored this way instead of a UUIDString as in Photos.sqlite. Perhaps it for faster indexing.\r\n\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623811131", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623811131, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzgxMTEzMQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T03:16:18Z", "updated_at": "2020-05-05T03:16:18Z", "author_association": "MEMBER", "body": "Here's how to convert two integers unto a UUID using Java. Not sure if it's the solution I need though (or how to do the same thing in Python):\r\n\r\nhttps://repl.it/repls/EuphoricSomberClasslibrary\r\n\r\n\"Repl_it_-_EuphoricSomberClasslibrary\"\r\n\r\n```java\r\nimport java.util.UUID;\r\n\r\nclass Main {\r\n public static void main(String[] args) {\r\n java.util.UUID uuid = new java.util.UUID(\r\n 2544182952487526660L,\r\n -3640314103732024685L\r\n );\r\n System.out.println(\r\n uuid\r\n );\r\n }\r\n}\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623807568", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623807568, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzgwNzU2OA==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T02:56:06Z", "updated_at": "2020-05-05T02:56:06Z", "author_association": "MEMBER", "body": "I'm pretty sure this is what I'm after. The `groups` table has what looks like identified labels in the rows with category = 2025:\r\n\r\n\"words__groups__2_528_rows_where_where_category___2025\"\r\n\r\nThen there's a `ga` table that maps groups to assets:\r\n\r\n\"words__ga__633_653_rows\"\r\n\r\nAnd an `assets` table which looks like it has one row for every one of my photos:\r\n\r\n\"words__assets__40_419_rows\"\r\n\r\nOne major challenge: these UUIDs are split into two integer numbers, `uuid_0` and `uuid_1` - but the main photos database uses regular UUIDs like this:\r\n\r\n![image](https://user-images.githubusercontent.com/9599/81031481-39164280-8e41-11ea-983b-005ced641a18.png)\r\n\r\nI need to figure out how to match up these two different UUID representations. I asked on Twitter if anyone has any ideas: https://twitter.com/simonw/status/1257500689019703296", "reactions": "{\"total_count\": 1, \"+1\": 1, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623806687", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623806687, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzgwNjY4Nw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T02:51:16Z", "updated_at": "2020-05-05T02:51:16Z", "author_association": "MEMBER", "body": "Running datasette against it directly doesn't work:\r\n```\r\nsimon@Simons-MacBook-Pro search % datasette psi.sqlite\r\nServe! files=('psi.sqlite',) (immutables=()) on port 8001\r\nUsage: datasette serve [OPTIONS] [FILES]...\r\n\r\nError: Connection to psi.sqlite failed check: no such tokenizer: PSITokenizer\r\n```\r\nInstead, I created a new SQLite database with a copy of some of the key tables, like this:\r\n```\r\nsqlite-utils rows psi.sqlite groups | sqlite-utils insert /tmp/search.db groups -\r\nsqlite-utils rows psi.sqlite assets | sqlite-utils insert /tmp/search.db assets -\r\nsqlite-utils rows psi.sqlite ga | sqlite-utils insert /tmp/search.db ga -\r\nsqlite-utils rows psi.sqlite collections | sqlite-utils insert /tmp/search.db collections -\r\nsqlite-utils rows psi.sqlite gc | sqlite-utils insert /tmp/search.db gc -\r\nsqlite-utils rows psi.sqlite lookup | sqlite-utils insert /tmp/search.db lookup -\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623806533", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623806533, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzgwNjUzMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T02:50:16Z", "updated_at": "2020-05-05T02:50:16Z", "author_association": "MEMBER", "body": "I figured there must be a separate database that Photos uses to store the text of the identified labels.\r\n\r\nI used \"Open Files and Ports\" in Activity Monitor against the Photos app to try and spot candidates... and found `/Users/simon/Pictures/Photos Library.photoslibrary/database/search/psi.sqlite` - a 53MB SQLite database file.\r\n\r\n\"Item-0_and_Item-0_and_Item-0_and_Item-0\"\r\n\r\nHere's the schema of that file:\r\n```\r\n$ sqlite3 psi.sqlite .schema\r\nCREATE TABLE word_embedding(word TEXT, extended_word TEXT, score DOUBLE);\r\nCREATE INDEX word_embedding_index ON word_embedding(word);\r\nCREATE VIRTUAL TABLE word_embedding_prefix USING fts5(extended_word)\r\n/* word_embedding_prefix(extended_word) */;\r\nCREATE TABLE IF NOT EXISTS 'word_embedding_prefix_data'(id INTEGER PRIMARY KEY, block BLOB);\r\nCREATE TABLE IF NOT EXISTS 'word_embedding_prefix_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;\r\nCREATE TABLE IF NOT EXISTS 'word_embedding_prefix_content'(id INTEGER PRIMARY KEY, c0);\r\nCREATE TABLE IF NOT EXISTS 'word_embedding_prefix_docsize'(id INTEGER PRIMARY KEY, sz BLOB);\r\nCREATE TABLE IF NOT EXISTS 'word_embedding_prefix_config'(k PRIMARY KEY, v) WITHOUT ROWID;\r\nCREATE TABLE groups(category INT2, owning_groupid INT, content_string TEXT, normalized_string TEXT, lookup_identifier TEXT, token_ranges_0 INT8, token_ranges_1 INT8, UNIQUE(category, owning_groupid, content_string, lookup_identifier, token_ranges_0, token_ranges_1));\r\nCREATE TABLE assets(uuid_0 INT, uuid_1 INT, creationDate INT, UNIQUE(uuid_0, uuid_1));\r\nCREATE TABLE ga(groupid INT, assetid INT, PRIMARY KEY(groupid, assetid));\r\nCREATE TABLE collections(uuid_0 INT, uuid_1 INT, startDate INT, endDate INT, title TEXT, subtitle TEXT, keyAssetUUID_0 INT, keyAssetUUID_1 INT, typeAndNumberOfAssets INT32, sortDate DOUBLE, UNIQUE(uuid_0, uuid_1));\r\nCREATE TABLE gc(groupid INT, collectionid INT, PRIMARY KEY(groupid, collectionid));\r\nCREATE VIRTUAL TABLE prefix USING fts5(content='groups', normalized_string, category UNINDEXED, tokenize = 'PSITokenizer');\r\nCREATE TABLE IF NOT EXISTS 'prefix_data'(id INTEGER PRIMARY KEY, block BLOB);\r\nCREATE TABLE IF NOT EXISTS 'prefix_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;\r\nCREATE TABLE IF NOT EXISTS 'prefix_docsize'(id INTEGER PRIMARY KEY, sz BLOB);\r\nCREATE TABLE IF NOT EXISTS 'prefix_config'(k PRIMARY KEY, v) WITHOUT ROWID;\r\nCREATE TABLE lookup(identifier TEXT PRIMARY KEY, category INT2);\r\nCREATE TRIGGER trigger_groups_insert AFTER INSERT ON groups BEGIN INSERT INTO prefix(rowid, normalized_string, category) VALUES (new.rowid, new.normalized_string, new.category); END;\r\nCREATE TRIGGER trigger_groups_delete AFTER DELETE ON groups BEGIN INSERT INTO prefix(prefix, rowid, normalized_string, category) VALUES('delete', old.rowid, old.normalized_string, old.category); END;\r\nCREATE INDEX group_pk ON groups(category, content_string, normalized_string, lookup_identifier);\r\nCREATE INDEX asset_pk ON assets(uuid_0, uuid_1);\r\nCREATE INDEX ga_assetid ON ga(assetid, groupid);\r\nCREATE INDEX collection_pk ON collections(uuid_0, uuid_1);\r\nCREATE INDEX gc_collectionid ON gc(collectionid);\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623806085", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623806085, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzgwNjA4NQ==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T02:47:18Z", "updated_at": "2020-05-05T02:47:18Z", "author_association": "MEMBER", "body": "In https://github.com/RhetTbull/osxphotos/issues/121#issuecomment-623249263 Rhet Turnbull spotted a table called `ZSCENEIDENTIFIER` which looked like it might have the right data, but the columns in it aren't particularly helpful:\r\n```\r\nZ_PK,Z_ENT,Z_OPT,ZSCENEIDENTIFIER,ZASSETATTRIBUTES,ZCONFIDENCE\r\n8,49,1,731,5,0.11834716796875\r\n9,49,1,684,6,0.0233648251742125\r\n10,49,1,1702,1,0.026153564453125\r\n```\r\nI love the look of those confidence scores, but what do the numbers mean?", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null} {"html_url": "https://github.com/dogsheep/dogsheep-photos/issues/16#issuecomment-623805823", "issue_url": "https://api.github.com/repos/dogsheep/dogsheep-photos/issues/16", "id": 623805823, "node_id": "MDEyOklzc3VlQ29tbWVudDYyMzgwNTgyMw==", "user": {"value": 9599, "label": "simonw"}, "created_at": "2020-05-05T02:45:56Z", "updated_at": "2020-05-05T02:45:56Z", "author_association": "MEMBER", "body": "I filed an issue with `osxphotos` about this here: https://github.com/RhetTbull/osxphotos/issues/121", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 612287234, "label": "Import machine-learning detected labels (dog, llama etc) from Apple Photos"}, "performed_via_github_app": null}