)]}'
{
  "commit": "d5d2e93577e2b9f4a251f97116259346f0bead1e",
  "tree": "856743cea4c9501323aa66c995c4600a08c9c670",
  "parents": [
    "4f6d26b16703e59e009fe5dde923b87793c5f561"
  ],
  "author": {
    "name": "Derrick Stolee",
    "email": "dstolee@microsoft.com",
    "time": "Wed Jan 16 10:25:59 2019 -0800"
  },
  "committer": {
    "name": "Junio C Hamano",
    "email": "gitster@pobox.com",
    "time": "Thu Jan 17 13:44:42 2019 -0800"
  },
  "message": "revision: implement sparse algorithm\n\nWhen enumerating objects to place in a pack-file during \u0027git\npack-objects --revs\u0027, we discover the \"frontier\" of commits\nthat we care about and the boundary with commit we find\nuninteresting. From that point, we walk trees to discover which\ntrees and blobs are uninteresting. Finally, we walk trees from the\ninteresting commits to find the interesting objects that are\nplaced in the pack.\n\nThis commit introduces a new, \"sparse\" way to discover the\nuninteresting trees. We use the perspective of a single user trying\nto push their topic to a large repository. That user likely changed\na very small fraction of the paths in their working directory, but\nwe spend a lot of time walking all reachable trees.\n\nThe way to switch the logic to work in this sparse way is to start\ncaring about which paths introduce new trees. While it is not\npossible to generate a diff between the frontier boundary and all\nof the interesting commits, we can simulate that behavior by\ninspecting all of the root trees as a whole, then recursing down\nto the set of trees at each path.\n\nWe already had taken the first step by passing an oidset to\nmark_trees_uninteresting_sparse(). We now create a dictionary\nwhose keys are paths and values are oidsets. We consider the set\nof trees that appear at each path. While we inspect a tree, we\nadd its subtrees to the oidsets corresponding to the tree entry\u0027s\npath. We also mark trees as UNINTERESTING if the tree we are\nparsing is UNINTERESTING.\n\nTo actually improve the performance, we need to terminate our\nrecursion. If the oidset contains only UNINTERESTING trees, then\nwe do not continue the recursion. This avoids walking trees that\nare likely to not be reachable from interesting trees. If the\noidset contains only interesting trees, then we will walk these\ntrees in the final stage that collects the intersting objects to\nplace in the pack. Thus, we only recurse if the oidset contains\nboth interesting and UNINITERESTING trees.\n\nThere are a few ways that this is not a universally better option.\n\nFirst, we can pack extra objects. If someone copies a subtree\nfrom one tree to another, the first tree will appear UNINTERESTING\nand we will not recurse to see that the subtree should also be\nUNINTERESTING. We will walk the new tree and see the subtree as\na \"new\" object and add it to the pack. A test is modified to\ndemonstrate this behavior and to verify that the new logic is\nbeing exercised.\n\nSecond, we can have extra memory pressure. If instead of being a\nsingle user pushing a small topic we are a server sending new\nobjects from across the entire working directory, then we will\ngain very little (the recursion will rarely terminate early) but\nwill spend extra time maintaining the path-oidset dictionaries.\n\nDespite these potential drawbacks, the benefits of the algorithm\nare clear. By adding a counter to \u0027add_children_by_path\u0027 and\n\u0027mark_tree_contents_uninteresting\u0027, I measured the number of\nparsed trees for the two algorithms in a variety of repos.\n\nFor git.git, I used the following input:\n\n\tv2.19.0\n\t^v2.19.0~10\n\n Objects to pack: 550\nWalked (old alg): 282\nWalked (new alg): 130\n\nFor the Linux repo, I used the following input:\n\n\tv4.18\n\t^v4.18~10\n\n Objects to pack:   518\nWalked (old alg): 4,836\nWalked (new alg):   188\n\nThe two repos above are rather \"wide and flat\" compared to\nother repos that I have used in the past. As a comparison,\nI tested an old topic branch in the Azure DevOps repo, which\nhas a much deeper folder structure than the Linux repo.\n\n Objects to pack:    220\nWalked (old alg): 22,804\nWalked (new alg):    129\n\nI used the number of walked trees the main metric above because\nit is consistent across multiple runs. When I ran my tests, the\nperformance of the pack-objects command with the same options\ncould change the end-to-end time by 10x depending on the file\nsystem being warm. However, by repeating the same test on repeat\nI could get more consistent timing results. The git.git and\nLinux tests were too fast overall (less than 0.5s) to measure\nan end-to-end difference. The Azure DevOps case was slow enough\nto see the time improve from 15s to 1s in the warm case. The\ncold case was 90s to 9s in my testing.\n\nThese improvements will have even larger benefits in the super-\nlarge Windows repository. In our experiments, we see the\n\"Enumerate objects\" phase of pack-objects taking 60-80% of the\nend-to-end time of non-trivial pushes, taking longer than the\nnetwork time to send the pack and the server time to verify the\npack.\n\nSigned-off-by: Derrick Stolee \u003cdstolee@microsoft.com\u003e\nSigned-off-by: Junio C Hamano \u003cgitster@pobox.com\u003e\n",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "60421f3a106de5d700aac775abbda0dbd33b8a7d",
      "old_mode": 33188,
      "old_path": "revision.c",
      "new_id": "5de739384a0bfc9b61f7c0052debca0283c6f1b5",
      "new_mode": 33188,
      "new_path": "revision.c"
    },
    {
      "type": "modify",
      "old_id": "30aef6498af9e76abac668aad99e92f7105b26a6",
      "old_mode": 33261,
      "old_path": "t/t5322-pack-objects-sparse.sh",
      "new_id": "9f2a6e5d31e3a53877851e8893da3d1e3e1cf6d2",
      "new_mode": 33261,
      "new_path": "t/t5322-pack-objects-sparse.sh"
    }
  ]
}
