{"html_url": "https://github.com/simonw/sqlite-utils/issues/491#issuecomment-1264218914", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/491", "id": 1264218914, "node_id": "IC_kwDOCGYnMM5LWnMi", "user": {"value": 7908073, "label": "chapmanjacobd"}, "created_at": "2022-10-01T03:18:36Z", "updated_at": "2023-06-14T22:14:24Z", "author_association": "CONTRIBUTOR", "body": "> some good concrete use-cases in mind\r\n\r\nI actually found myself wanting something like this the past couple days. The use-case was databases with slightly different schema but same table names.\r\n\r\nhere is a full script:\r\n\r\n```\r\nimport argparse\r\nfrom pathlib import Path\r\n\r\nfrom sqlite_utils import Database\r\n\r\n\r\ndef connect(args, conn=None, **kwargs) -> Database:\r\n db = Database(conn or args.database, **kwargs)\r\n with db.conn:\r\n db.conn.execute(\"PRAGMA main.cache_size = 8000\")\r\n return db\r\n\r\n\r\ndef parse_args() -> argparse.Namespace:\r\n parser = argparse.ArgumentParser()\r\n parser.add_argument(\"database\")\r\n parser.add_argument(\"dbs_folder\")\r\n parser.add_argument(\"--db\", \"-db\", help=argparse.SUPPRESS)\r\n parser.add_argument(\"--verbose\", \"-v\", action=\"count\", default=0)\r\n args = parser.parse_args()\r\n\r\n if args.db:\r\n args.database = args.db\r\n Path(args.database).touch()\r\n args.db = connect(args)\r\n\r\n return args\r\n\r\n\r\ndef merge_db(args, source_db):\r\n source_db = str(Path(source_db).resolve())\r\n\r\n s_db = connect(argparse.Namespace(database=source_db, verbose = args.verbose))\r\n for table in s_db.table_names():\r\n data = s_db[table].rows\r\n args.db[table].insert_all(data, alter=True, replace=True)\r\n\r\n args.db.conn.commit()\r\n\r\n\r\ndef merge_directory():\r\n args = parse_args()\r\n source_dbs = list(Path(args.dbs_folder).glob('*.db'))\r\n for s_db in source_dbs:\r\n merge_db(args, s_db)\r\n\r\n\r\nif __name__ == '__main__':\r\n merge_directory()\r\n```\r\n\r\nedit: I've made some improvements to this and put it on PyPI:\r\n\r\n```\r\n$ pip install xklb\r\n$ lb merge-db -h\r\nusage: library merge-dbs DEST_DB SOURCE_DB ... [--only-target-columns] [--only-new-rows] [--upsert] [--pk PK ...] [--table TABLE ...]\r\n\r\n Merge-DBs will insert new rows from source dbs to target db, table by table. If primary key(s) are provided,\r\n and there is an existing row with the same PK, the default action is to delete the existing row and insert the new row\r\n replacing all existing fields.\r\n\r\n Upsert mode will update matching PK rows such that if a source row has a NULL field and\r\n the destination row has a value then the value will be preserved instead of changed to the source row's NULL value.\r\n\r\n Ignore mode (--only-new-rows) will insert only rows which don't already exist in the destination db\r\n\r\n Test first by using temp databases as the destination db.\r\n Try out different modes / flags until you are satisfied with the behavior of the program\r\n\r\n library merge-dbs --pk path (mktemp --suffix .db) tv.db movies.db\r\n\r\n Merge database data and tables\r\n\r\n library merge-dbs --upsert --pk path video.db tv.db movies.db\r\n library merge-dbs --only-target-columns --only-new-rows --table media,playlists --pk path audio-fts.db audio.db\r\n\r\n library merge-dbs --pk id --only-tables subreddits reddit/81_New_Music.db audio.db\r\n library merge-dbs --only-new-rows --pk subreddit,path --only-tables reddit_posts reddit/81_New_Music.db audio.db -v\r\n\r\npositional arguments:\r\n database\r\n source_dbs\r\n```\r\n\r\nAlso if you want to dedupe a table based on a \"business key\" which isn't explicitly your primary key(s) you can run this:\r\n\r\n```\r\n$ lb dedupe-db -h\r\nusage: library dedupe-dbs DATABASE TABLE --bk BUSINESS_KEYS [--pk PRIMARY_KEYS] [--only-columns COLUMNS]\r\n\r\n Dedupe your database (not to be confused with the dedupe subcommand)\r\n\r\n It should not need to be said but *backup* your database before trying this tool!\r\n\r\n Dedupe-DB will help remove duplicate rows based on non-primary-key business keys\r\n\r\n library dedupe-db ./video.db media --bk path\r\n\r\n If --primary-keys is not provided table metadata primary keys will be used\r\n If --only-columns is not provided all non-primary and non-business key columns will be upserted\r\n\r\npositional arguments:\r\n database\r\n table\r\n\r\noptions:\r\n -h, --help show this help message and exit\r\n --skip-0\r\n --only-columns ONLY_COLUMNS\r\n Comma separated column names to upsert\r\n --primary-keys PRIMARY_KEYS, --pk PRIMARY_KEYS\r\n Comma separated primary keys\r\n --business-keys BUSINESS_KEYS, --bk BUSINESS_KEYS\r\n Comma separated business keys\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1383646615, "label": "Ability to merge databases and tables"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/491#issuecomment-1258712931", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/491", "id": 1258712931, "node_id": "IC_kwDOCGYnMM5LBm9j", "user": {"value": 25778, "label": "eyeseast"}, "created_at": "2022-09-26T22:31:58Z", "updated_at": "2022-09-26T22:31:58Z", "author_association": "CONTRIBUTOR", "body": "Right. The backup command will copy tables completely, but in the case of conflicting table names, the destination gets overwritten silently. That might not be what you want here. ", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1383646615, "label": "Ability to merge databases and tables"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/491#issuecomment-1258508215", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/491", "id": 1258508215, "node_id": "IC_kwDOCGYnMM5LA0-3", "user": {"value": 25778, "label": "eyeseast"}, "created_at": "2022-09-26T19:22:14Z", "updated_at": "2022-09-26T19:22:14Z", "author_association": "CONTRIBUTOR", "body": "This might be fairly straightforward using SQLite's backup utility: https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.backup\r\n\r\n", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1383646615, "label": "Ability to merge databases and tables"}, "performed_via_github_app": null} {"html_url": "https://github.com/simonw/sqlite-utils/issues/491#issuecomment-1256858763", "issue_url": "https://api.github.com/repos/simonw/sqlite-utils/issues/491", "id": 1256858763, "node_id": "IC_kwDOCGYnMM5K6iSL", "user": {"value": 7908073, "label": "chapmanjacobd"}, "created_at": "2022-09-24T04:50:59Z", "updated_at": "2022-09-24T04:52:08Z", "author_association": "CONTRIBUTOR", "body": "Instead of outputting binary data to stdout the interface might be better like this\r\n\r\n```\r\nsqlite-utils merge animals.db cats.db dogs.db\r\n```\r\n\r\nsimilar to `zip`, `ogr2ogr`, etc\r\n\r\nActually I think this might already be possible within `ogr2ogr`. I don't believe spatial data is a requirement though it might add an `ogc_id` column or something\r\n\r\n```\r\ncp cats.db animals.db\r\nogr2ogr -append animals.db dogs.db\r\nogr2ogr -append animals.db another.db\r\n```", "reactions": "{\"total_count\": 0, \"+1\": 0, \"-1\": 0, \"laugh\": 0, \"hooray\": 0, \"confused\": 0, \"heart\": 0, \"rocket\": 0, \"eyes\": 0}", "issue": {"value": 1383646615, "label": "Ability to merge databases and tables"}, "performed_via_github_app": null}