|  | Git Sparse-Index Design Document | 
|  | ================================ | 
|  |  | 
|  | The sparse-checkout feature allows users to focus a working directory on | 
|  | a subset of the files at HEAD. The cone mode patterns, enabled by | 
|  | `core.sparseCheckoutCone`, allow for very fast pattern matching to | 
|  | discover which files at HEAD belong in the sparse-checkout cone. | 
|  |  | 
|  | Three important scale dimensions for a Git working directory are: | 
|  |  | 
|  | * `HEAD`: How many files are present at `HEAD`? | 
|  |  | 
|  | * Populated: How many files are within the sparse-checkout cone. | 
|  |  | 
|  | * Modified: How many files has the user modified in the working directory? | 
|  |  | 
|  | We will use big-O notation -- O(X) -- to denote how expensive certain | 
|  | operations are in terms of these dimensions. | 
|  |  | 
|  | These dimensions are ordered by their magnitude: users (typically) modify | 
|  | fewer files than are populated, and we can only populate files at `HEAD`. | 
|  |  | 
|  | Problems occur if there is an extreme imbalance in these dimensions. For | 
|  | example, if `HEAD` contains millions of paths but the populated set has | 
|  | only tens of thousands, then commands like `git status` and `git add` can | 
|  | be dominated by operations that require O(`HEAD`) operations instead of | 
|  | O(Populated). Primarily, the cost is in parsing and rewriting the index, | 
|  | which is filled primarily with files at `HEAD` that are marked with the | 
|  | `SKIP_WORKTREE` bit. | 
|  |  | 
|  | The sparse-index intends to take these commands that read and modify the | 
|  | index from O(`HEAD`) to O(Populated). To do this, we need to modify the | 
|  | index format in a significant way: add "sparse directory" entries. | 
|  |  | 
|  | With cone mode patterns, it is possible to detect when an entire | 
|  | directory will have its contents outside of the sparse-checkout definition. | 
|  | Instead of listing all of the files it contains as individual entries, a | 
|  | sparse-index contains an entry with the directory name, referencing the | 
|  | object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit. | 
|  | If we need to discover the details for paths within that directory, we | 
|  | can parse trees to find that list. | 
|  |  | 
|  | At time of writing, sparse-directory entries violate expectations about the | 
|  | index format and its in-memory data structure. There are many consumers in | 
|  | the codebase that expect to iterate through all of the index entries and | 
|  | see only files. In fact, these loops expect to see a reference to every | 
|  | staged file. One way to handle this is to parse trees to replace a | 
|  | sparse-directory entry with all of the files within that tree as the index | 
|  | is loaded. However, parsing trees is slower than parsing the index format, | 
|  | so that is a slower operation than if we left the index alone. The plan is | 
|  | to make all of these integrations "sparse aware" so this expansion through | 
|  | tree parsing is unnecessary and they use fewer resources than when using a | 
|  | full index. | 
|  |  | 
|  | The implementation plan below follows four phases to slowly integrate with | 
|  | the sparse-index. The intention is to incrementally update Git commands to | 
|  | interact safely with the sparse-index without significant slowdowns. This | 
|  | may not always be possible, but the hope is that the primary commands that | 
|  | users need in their daily work are dramatically improved. | 
|  |  | 
|  | Phase I: Format and initial speedups | 
|  | ------------------------------------ | 
|  |  | 
|  | During this phase, Git learns to enable the sparse-index and safely parse | 
|  | one. Protections are put in place so that every consumer of the in-memory | 
|  | data structure can operate with its current assumption of every file at | 
|  | `HEAD`. | 
|  |  | 
|  | At first, every index parse will call a helper method, | 
|  | `ensure_full_index()`, which scans the index for sparse-directory entries | 
|  | (pointing to trees) and replaces them with the full list of paths (with | 
|  | blob contents) by parsing tree objects. This will be slower in all cases. | 
|  | The only noticeable change in behavior will be that the serialized index | 
|  | file contains sparse-directory entries. | 
|  |  | 
|  | To start, we use a new required index extension, `sdir`, to allow | 
|  | inserting sparse-directory entries into indexes with file format | 
|  | versions 2, 3, and 4. This prevents Git versions that do not understand | 
|  | the sparse-index from operating on one, while allowing tools that do not | 
|  | understand the sparse-index to operate on repositories as long as they do | 
|  | not interact with the index. A new format, index v5, will be introduced | 
|  | that includes sparse-directory entries by default. It might also | 
|  | introduce other features that have been considered for improving the | 
|  | index, as well. | 
|  |  | 
|  | Next, consumers of the index will be guarded against operating on a | 
|  | sparse-index by inserting calls to `ensure_full_index()` or | 
|  | `expand_index_to_path()`. If a specific path is requested, then those will | 
|  | be protected from within the `index_file_exists()` and `index_name_pos()` | 
|  | API calls: they will call `ensure_full_index()` if necessary. The | 
|  | intention here is to preserve existing behavior when interacting with a | 
|  | sparse-checkout. We don't want a change to happen by accident, without | 
|  | tests. Many of these locations may not need any change before removing the | 
|  | guards, but we should not do so without tests to ensure the expected | 
|  | behavior happens. | 
|  |  | 
|  | It may be desirable to _change_ the behavior of some commands in the | 
|  | presence of a sparse index or more generally in any sparse-checkout | 
|  | scenario. In such cases, these should be carefully communicated and | 
|  | tested. No such behavior changes are intended during this phase. | 
|  |  | 
|  | During a scan of the codebase, not every iteration of the cache entries | 
|  | needs an `ensure_full_index()` check. The basic reasons include: | 
|  |  | 
|  | 1. The loop is scanning for entries with non-zero stage. These entries | 
|  | are not collapsed into a sparse-directory entry. | 
|  |  | 
|  | 2. The loop is scanning for submodules. These entries are not collapsed | 
|  | into a sparse-directory entry. | 
|  |  | 
|  | 3. The loop is part of the index API, especially around reading or | 
|  | writing the format. | 
|  |  | 
|  | 4. The loop is checking for correct order of cache entries and that is | 
|  | correct if and only if the sparse-directory entries are in the correct | 
|  | location. | 
|  |  | 
|  | 5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is | 
|  | otherwise already aware of sparse directory entries. | 
|  |  | 
|  | 6. The sparse-index is disabled at this point when using the split-index | 
|  | feature, so no effort is made to protect the split-index API. | 
|  |  | 
|  | Even after inserting these guards, we will keep expanding sparse-indexes | 
|  | for most Git commands using the `command_requires_full_index` repository | 
|  | setting. This setting will be on by default and disabled one builtin at a | 
|  | time until we have sufficient confidence that all of the index operations | 
|  | are properly guarded. | 
|  |  | 
|  | To complete this phase, the commands `git status` and `git add` will be | 
|  | integrated with the sparse-index so that they operate with O(Populated) | 
|  | performance. They will be carefully tested for operations within and | 
|  | outside the sparse-checkout definition. | 
|  |  | 
|  | Phase II: Careful integrations | 
|  | ------------------------------ | 
|  |  | 
|  | This phase focuses on ensuring that all index extensions and APIs work | 
|  | well with a sparse-index. This requires significant increases to our test | 
|  | coverage, especially for operations that interact with the working | 
|  | directory outside of the sparse-checkout definition. Some of these | 
|  | behaviors may not be the desirable ones, such as some tests already | 
|  | marked for failure in `t1092-sparse-checkout-compatibility.sh`. | 
|  |  | 
|  | The index extensions that may require special integrations are: | 
|  |  | 
|  | * FS Monitor | 
|  | * Untracked cache | 
|  |  | 
|  | While integrating with these features, we should look for patterns that | 
|  | might lead to better APIs for interacting with the index. Coalescing | 
|  | common usage patterns into an API call can reduce the number of places | 
|  | where sparse-directories need to be handled carefully. | 
|  |  | 
|  | Phase III: Important command speedups | 
|  | ------------------------------------- | 
|  |  | 
|  | At this point, the patterns for testing and implementing sparse-directory | 
|  | logic should be relatively stable. This phase focuses on updating some of | 
|  | the most common builtins that use the index to operate as O(Populated). | 
|  | Here is a potential list of commands that could be valuable to integrate | 
|  | at this point: | 
|  |  | 
|  | * `git commit` | 
|  | * `git checkout` | 
|  | * `git merge` | 
|  | * `git rebase` | 
|  |  | 
|  | Hopefully, commands such as `git merge` and `git rebase` can benefit | 
|  | instead from merge algorithms that do not use the index as a data | 
|  | structure, such as the merge-ORT strategy. As these topics mature, we | 
|  | may enable the ORT strategy by default for repositories using the | 
|  | sparse-index feature. | 
|  |  | 
|  | Along with `git status` and `git add`, these commands cover the majority | 
|  | of users' interactions with the working directory. In addition, we can | 
|  | integrate with these commands: | 
|  |  | 
|  | * `git grep` | 
|  | * `git rm` | 
|  |  | 
|  | These have been proposed as some whose behavior could change when in a | 
|  | repo with a sparse-checkout definition. It would be good to include this | 
|  | behavior automatically when using a sparse-index. Some clarity is needed | 
|  | to make the behavior switch clear to the user. | 
|  |  | 
|  | This phase is the first where parallel work might be possible without too | 
|  | much conflicts between topics. | 
|  |  | 
|  | Phase IV: The long tail | 
|  | ----------------------- | 
|  |  | 
|  | This last phase is less a "phase" and more "the new normal" after all of | 
|  | the previous work. | 
|  |  | 
|  | To start, the `command_requires_full_index` option could be removed in | 
|  | favor of expanding only when hitting an API guard. | 
|  |  | 
|  | There are many Git commands that could use special attention to operate as | 
|  | O(Populated), while some might be so rare that it is acceptable to leave | 
|  | them with additional overhead when a sparse-index is present. | 
|  |  | 
|  | Here are some commands that might be useful to update: | 
|  |  | 
|  | * `git sparse-checkout set` | 
|  | * `git am` | 
|  | * `git clean` | 
|  | * `git stash` |