)]}'
{
  "commit": "52fe41ff1cd6e1f0b67d4e864e718d949e225f30",
  "tree": "eec2fdbd162e5797eb509ce0d39cdfa21a5345a8",
  "parents": [
    "efdd2f0d4c4d6b1b1090171b6428919038bd2980"
  ],
  "author": {
    "name": "Derrick Stolee",
    "email": "dstolee@microsoft.com",
    "time": "Fri Sep 25 12:33:36 2020 +0000"
  },
  "committer": {
    "name": "Junio C Hamano",
    "email": "gitster@pobox.com",
    "time": "Fri Sep 25 10:53:04 2020 -0700"
  },
  "message": "maintenance: add incremental-repack task\n\nThe previous change cleaned up loose objects using the\n\u0027loose-objects\u0027 that can be run safely in the background. Add a\nsimilar job that performs similar cleanups for pack-files.\n\nOne issue with running \u0027git repack\u0027 is that it is designed to\nrepack all pack-files into a single pack-file. While this is the\nmost space-efficient way to store object data, it is not time or\nmemory efficient. This becomes extremely important if the repo is\nso large that a user struggles to store two copies of the pack on\ntheir disk.\n\nInstead, perform an \"incremental\" repack by collecting a few small\npack-files into a new pack-file. The multi-pack-index facilitates\nthis process ever since \u0027git multi-pack-index expire\u0027 was added in\n19575c7 (multi-pack-index: implement \u0027expire\u0027 subcommand,\n2019-06-10) and \u0027git multi-pack-index repack\u0027 was added in ce1e4a1\n(midx: implement midx_repack(), 2019-06-10).\n\nThe \u0027incremental-repack\u0027 task runs the following steps:\n\n1. \u0027git multi-pack-index write\u0027 creates a multi-pack-index file if\n   one did not exist, and otherwise will update the multi-pack-index\n   with any new pack-files that appeared since the last write. This\n   is particularly relevant with the background fetch job.\n\n   When the multi-pack-index sees two copies of the same object, it\n   stores the offset data into the newer pack-file. This means that\n   some old pack-files could become \"unreferenced\" which I will use\n   to mean \"a pack-file that is in the pack-file list of the\n   multi-pack-index but none of the objects in the multi-pack-index\n   reference a location inside that pack-file.\"\n\n2. \u0027git multi-pack-index expire\u0027 deletes any unreferenced pack-files\n   and updaes the multi-pack-index to drop those pack-files from the\n   list. This is safe to do as concurrent Git processes will see the\n   multi-pack-index and not open those packs when looking for object\n   contents. (Similar to the \u0027loose-objects\u0027 job, there are some Git\n   commands that open pack-files regardless of the multi-pack-index,\n   but they are rarely used. Further, a user that self-selects to\n   use background operations would likely refrain from using those\n   commands.)\n\n3. \u0027git multi-pack-index repack --bacth-size\u003d\u003csize\u003e\u0027 collects a set\n   of pack-files that are listed in the multi-pack-index and creates\n   a new pack-file containing the objects whose offsets are listed\n   by the multi-pack-index to be in those objects. The set of pack-\n   files is selected greedily by sorting the pack-files by modified\n   time and adding a pack-file to the set if its \"expected size\" is\n   smaller than the batch size until the total expected size of the\n   selected pack-files is at least the batch size. The \"expected\n   size\" is calculated by taking the size of the pack-file divided\n   by the number of objects in the pack-file and multiplied by the\n   number of objects from the multi-pack-index with offset in that\n   pack-file. The expected size approximates how much data from that\n   pack-file will contribute to the resulting pack-file size. The\n   intention is that the resulting pack-file will be close in size\n   to the provided batch size.\n\n   The next run of the incremental-repack task will delete these\n   repacked pack-files during the \u0027expire\u0027 step.\n\n   In this version, the batch size is set to \"0\" which ignores the\n   size restrictions when selecting the pack-files. It instead\n   selects all pack-files and repacks all packed objects into a\n   single pack-file. This will be updated in the next change, but\n   it requires doing some calculations that are better isolated to\n   a separate change.\n\nThese steps are based on a similar background maintenance step in\nScalar (and VFS for Git) [1]. This was incredibly effective for\nusers of the Windows OS repository. After using the same VFS for Git\nrepository for over a year, some users had _thousands_ of pack-files\nthat combined to up to 250 GB of data. We noticed a few users were\nrunning into the open file descriptor limits (due in part to a bug\nin the multi-pack-index fixed by af96fe3 (midx: add packs to\npacked_git linked list, 2019-04-29).\n\nThese pack-files were mostly small since they contained the commits\nand trees that were pushed to the origin in a given hour. The GVFS\nprotocol includes a \"prefetch\" step that asks for pre-computed pack-\nfiles containing commits and trees by timestamp. These pack-files\nwere grouped into \"daily\" pack-files once a day for up to 30 days.\nIf a user did not request prefetch packs for over 30 days, then they\nwould get the entire history of commits and trees in a new, large\npack-file. This led to a large number of pack-files that had poor\ndelta compression.\n\nBy running this pack-file maintenance step once per day, these repos\nwith thousands of packs spanning 200+ GB dropped to dozens of pack-\nfiles spanning 30-50 GB. This was done all without removing objects\nfrom the system and using a constant batch size of two gigabytes.\nOnce the work was done to reduce the pack-files to small sizes, the\nbatch size of two gigabytes means that not every run triggers a\nrepack operation, so the following run will not expire a pack-file.\nThis has kept these repos in a \"clean\" state.\n\n[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs\n\nSigned-off-by: Derrick Stolee \u003cdstolee@microsoft.com\u003e\nSigned-off-by: Junio C Hamano \u003cgitster@pobox.com\u003e\n",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "fc95eb594f9908adb069989cc043c00edf39e0df",
      "old_mode": 33188,
      "old_path": "Documentation/git-maintenance.txt",
      "new_id": "3f5d8946b4d9ba25c39c21dbfe20f8ef64c8f983",
      "new_mode": 33188,
      "new_path": "Documentation/git-maintenance.txt"
    },
    {
      "type": "modify",
      "old_id": "4403827481592417af431dc99f5cd4735dc28b9e",
      "old_mode": 33188,
      "old_path": "builtin/gc.c",
      "new_id": "5f877b097ad15e4861f935fd0f0bafce10b1c326",
      "new_mode": 33188,
      "new_path": "builtin/gc.c"
    },
    {
      "type": "modify",
      "old_id": "ec87f616c6a9b2488914b64f1ff8337cd705d735",
      "old_mode": 33261,
      "old_path": "t/t5319-multi-pack-index.sh",
      "new_id": "2f942ee1fa4ccf556ed76c1db0826e75b8917a47",
      "new_mode": 33261,
      "new_path": "t/t5319-multi-pack-index.sh"
    },
    {
      "type": "modify",
      "old_id": "27565c55a2b3b9506320467c4698aef417dac5a5",
      "old_mode": 33261,
      "old_path": "t/t7900-maintenance.sh",
      "new_id": "a2db2291b0bd23f2e13ada83983ed6c9fc82099c",
      "new_mode": 33261,
      "new_path": "t/t7900-maintenance.sh"
    }
  ]
}
