Documentation/technical/sparse-index.adoc - git - Git at Google

 Git Sparse-Index Design Document
 ================================

 The sparse-checkout feature allows users to focus a working directory on
 a subset of the files at HEAD. The cone mode patterns, enabled by
 `core.sparseCheckoutCone`, allow for very fast pattern matching to
 discover which files at HEAD belong in the sparse-checkout cone.

 Three important scale dimensions for a Git working directory are:

 * `HEAD`: How many files are present at `HEAD`?

 * Populated: How many files are within the sparse-checkout cone.

 * Modified: How many files has the user modified in the working directory?

 We will use big-O notation -- O(X) -- to denote how expensive certain
 operations are in terms of these dimensions.

 These dimensions are ordered by their magnitude: users (typically) modify
 fewer files than are populated, and we can only populate files at `HEAD`.

 Problems occur if there is an extreme imbalance in these dimensions. For
 example, if `HEAD` contains millions of paths but the populated set has
 only tens of thousands, then commands like `git status` and `git add` can
 be dominated by operations that require O(`HEAD`) operations instead of
 O(Populated). Primarily, the cost is in parsing and rewriting the index,
 which is filled primarily with files at `HEAD` that are marked with the
 `SKIP_WORKTREE` bit.

 The sparse-index intends to take these commands that read and modify the
 index from O(`HEAD`) to O(Populated). To do this, we need to modify the
 index format in a significant way: add "sparse directory" entries.

 With cone mode patterns, it is possible to detect when an entire
 directory will have its contents outside of the sparse-checkout definition.
 Instead of listing all of the files it contains as individual entries, a
 sparse-index contains an entry with the directory name, referencing the
 object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
 If we need to discover the details for paths within that directory, we
 can parse trees to find that list.

 At time of writing, sparse-directory entries violate expectations about the
 index format and its in-memory data structure. There are many consumers in
 the codebase that expect to iterate through all of the index entries and
 see only files. In fact, these loops expect to see a reference to every
 staged file. One way to handle this is to parse trees to replace a
 sparse-directory entry with all of the files within that tree as the index
 is loaded. However, parsing trees is slower than parsing the index format,
 so that is a slower operation than if we left the index alone. The plan is
 to make all of these integrations "sparse aware" so this expansion through
 tree parsing is unnecessary and they use fewer resources than when using a
 full index.

 The implementation plan below follows four phases to slowly integrate with
 the sparse-index. The intention is to incrementally update Git commands to
 interact safely with the sparse-index without significant slowdowns. This
 may not always be possible, but the hope is that the primary commands that
 users need in their daily work are dramatically improved.

 Phase I: Format and initial speedups
 ------------------------------------

 During this phase, Git learns to enable the sparse-index and safely parse
 one. Protections are put in place so that every consumer of the in-memory
 data structure can operate with its current assumption of every file at
 `HEAD`.

 At first, every index parse will call a helper method,
 `ensure_full_index()`, which scans the index for sparse-directory entries
 (pointing to trees) and replaces them with the full list of paths (with
 blob contents) by parsing tree objects. This will be slower in all cases.
 The only noticeable change in behavior will be that the serialized index
 file contains sparse-directory entries.

 To start, we use a new required index extension, `sdir`, to allow
 inserting sparse-directory entries into indexes with file format
 versions 2, 3, and 4. This prevents Git versions that do not understand
 the sparse-index from operating on one, while allowing tools that do not
 understand the sparse-index to operate on repositories as long as they do
 not interact with the index. A new format, index v5, will be introduced
 that includes sparse-directory entries by default. It might also
 introduce other features that have been considered for improving the
 index, as well.

 Next, consumers of the index will be guarded against operating on a
 sparse-index by inserting calls to `ensure_full_index()` or
 `expand_index_to_path()`. If a specific path is requested, then those will
 be protected from within the `index_file_exists()` and `index_name_pos()`
 API calls: they will call `ensure_full_index()` if necessary. The
 intention here is to preserve existing behavior when interacting with a
 sparse-checkout. We don't want a change to happen by accident, without
 tests. Many of these locations may not need any change before removing the
 guards, but we should not do so without tests to ensure the expected
 behavior happens.

 It may be desirable to _change_ the behavior of some commands in the
 presence of a sparse index or more generally in any sparse-checkout
 scenario. In such cases, these should be carefully communicated and
 tested. No such behavior changes are intended during this phase.

 During a scan of the codebase, not every iteration of the cache entries
 needs an `ensure_full_index()` check. The basic reasons include:

 1. The loop is scanning for entries with non-zero stage. These entries
    are not collapsed into a sparse-directory entry.

 2. The loop is scanning for submodules. These entries are not collapsed
    into a sparse-directory entry.

 3. The loop is part of the index API, especially around reading or
    writing the format.

 4. The loop is checking for correct order of cache entries and that is
    correct if and only if the sparse-directory entries are in the correct
    location.

 5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
    otherwise already aware of sparse directory entries.

 6. The sparse-index is disabled at this point when using the split-index
    feature, so no effort is made to protect the split-index API.

 Even after inserting these guards, we will keep expanding sparse-indexes
 for most Git commands using the `command_requires_full_index` repository
 setting. This setting will be on by default and disabled one builtin at a
 time until we have sufficient confidence that all of the index operations
 are properly guarded.

 To complete this phase, the commands `git status` and `git add` will be
 integrated with the sparse-index so that they operate with O(Populated)
 performance. They will be carefully tested for operations within and
 outside the sparse-checkout definition.

 Phase II: Careful integrations
 ------------------------------

 This phase focuses on ensuring that all index extensions and APIs work
 well with a sparse-index. This requires significant increases to our test
 coverage, especially for operations that interact with the working
 directory outside of the sparse-checkout definition. Some of these
 behaviors may not be the desirable ones, such as some tests already
 marked for failure in `t1092-sparse-checkout-compatibility.sh`.

 The index extensions that may require special integrations are:

 * FS Monitor
 * Untracked cache

 While integrating with these features, we should look for patterns that
 might lead to better APIs for interacting with the index. Coalescing
 common usage patterns into an API call can reduce the number of places
 where sparse-directories need to be handled carefully.

 Phase III: Important command speedups
 -------------------------------------

 At this point, the patterns for testing and implementing sparse-directory
 logic should be relatively stable. This phase focuses on updating some of
 the most common builtins that use the index to operate as O(Populated).
 Here is a potential list of commands that could be valuable to integrate
 at this point:

 * `git commit`
 * `git checkout`
 * `git merge`
 * `git rebase`

 Hopefully, commands such as `git merge` and `git rebase` can benefit
 instead from merge algorithms that do not use the index as a data
 structure, such as the merge-ORT strategy. As these topics mature, we
 may enable the ORT strategy by default for repositories using the
 sparse-index feature.

 Along with `git status` and `git add`, these commands cover the majority
 of users' interactions with the working directory. In addition, we can
 integrate with these commands:

 * `git grep`
 * `git rm`

 These have been proposed as some whose behavior could change when in a
 repo with a sparse-checkout definition. It would be good to include this
 behavior automatically when using a sparse-index. Some clarity is needed
 to make the behavior switch clear to the user.

 This phase is the first where parallel work might be possible without too
 much conflicts between topics.

 Phase IV: The long tail
 -----------------------

 This last phase is less a "phase" and more "the new normal" after all of
 the previous work.

 To start, the `command_requires_full_index` option could be removed in
 favor of expanding only when hitting an API guard.

 There are many Git commands that could use special attention to operate as
 O(Populated), while some might be so rare that it is acceptable to leave
 them with additional overhead when a sparse-index is present.

 Here are some commands that might be useful to update:

 * `git sparse-checkout set`
 * `git am`
 * `git clean`
 * `git stash`
	Git Sparse-Index Design Document
	================================

	The sparse-checkout feature allows users to focus a working directory on
	a subset of the files at HEAD. The cone mode patterns, enabled by
	`core.sparseCheckoutCone`, allow for very fast pattern matching to
	discover which files at HEAD belong in the sparse-checkout cone.

	Three important scale dimensions for a Git working directory are:

	* `HEAD`: How many files are present at `HEAD`?

	* Populated: How many files are within the sparse-checkout cone.

	* Modified: How many files has the user modified in the working directory?

	We will use big-O notation -- O(X) -- to denote how expensive certain
	operations are in terms of these dimensions.

	These dimensions are ordered by their magnitude: users (typically) modify
	fewer files than are populated, and we can only populate files at `HEAD`.

	Problems occur if there is an extreme imbalance in these dimensions. For
	example, if `HEAD` contains millions of paths but the populated set has
	only tens of thousands, then commands like `git status` and `git add` can
	be dominated by operations that require O(`HEAD`) operations instead of
	O(Populated). Primarily, the cost is in parsing and rewriting the index,
	which is filled primarily with files at `HEAD` that are marked with the
	`SKIP_WORKTREE` bit.

	The sparse-index intends to take these commands that read and modify the
	index from O(`HEAD`) to O(Populated). To do this, we need to modify the
	index format in a significant way: add "sparse directory" entries.

	With cone mode patterns, it is possible to detect when an entire
	directory will have its contents outside of the sparse-checkout definition.
	Instead of listing all of the files it contains as individual entries, a
	sparse-index contains an entry with the directory name, referencing the
	object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
	If we need to discover the details for paths within that directory, we
	can parse trees to find that list.

	At time of writing, sparse-directory entries violate expectations about the
	index format and its in-memory data structure. There are many consumers in
	the codebase that expect to iterate through all of the index entries and
	see only files. In fact, these loops expect to see a reference to every
	staged file. One way to handle this is to parse trees to replace a
	sparse-directory entry with all of the files within that tree as the index
	is loaded. However, parsing trees is slower than parsing the index format,
	so that is a slower operation than if we left the index alone. The plan is
	to make all of these integrations "sparse aware" so this expansion through
	tree parsing is unnecessary and they use fewer resources than when using a
	full index.

	The implementation plan below follows four phases to slowly integrate with
	the sparse-index. The intention is to incrementally update Git commands to
	interact safely with the sparse-index without significant slowdowns. This
	may not always be possible, but the hope is that the primary commands that
	users need in their daily work are dramatically improved.

	Phase I: Format and initial speedups
	------------------------------------

	During this phase, Git learns to enable the sparse-index and safely parse
	one. Protections are put in place so that every consumer of the in-memory
	data structure can operate with its current assumption of every file at
	`HEAD`.

	At first, every index parse will call a helper method,
	`ensure_full_index()`, which scans the index for sparse-directory entries
	(pointing to trees) and replaces them with the full list of paths (with
	blob contents) by parsing tree objects. This will be slower in all cases.
	The only noticeable change in behavior will be that the serialized index
	file contains sparse-directory entries.

	To start, we use a new required index extension, `sdir`, to allow
	inserting sparse-directory entries into indexes with file format
	versions 2, 3, and 4. This prevents Git versions that do not understand
	the sparse-index from operating on one, while allowing tools that do not
	understand the sparse-index to operate on repositories as long as they do
	not interact with the index. A new format, index v5, will be introduced
	that includes sparse-directory entries by default. It might also
	introduce other features that have been considered for improving the
	index, as well.

	Next, consumers of the index will be guarded against operating on a
	sparse-index by inserting calls to `ensure_full_index()` or
	`expand_index_to_path()`. If a specific path is requested, then those will
	be protected from within the `index_file_exists()` and `index_name_pos()`
	API calls: they will call `ensure_full_index()` if necessary. The
	intention here is to preserve existing behavior when interacting with a
	sparse-checkout. We don't want a change to happen by accident, without
	tests. Many of these locations may not need any change before removing the
	guards, but we should not do so without tests to ensure the expected
	behavior happens.

	It may be desirable to _change_ the behavior of some commands in the
	presence of a sparse index or more generally in any sparse-checkout
	scenario. In such cases, these should be carefully communicated and
	tested. No such behavior changes are intended during this phase.

	During a scan of the codebase, not every iteration of the cache entries
	needs an `ensure_full_index()` check. The basic reasons include:

	1. The loop is scanning for entries with non-zero stage. These entries
	are not collapsed into a sparse-directory entry.

	2. The loop is scanning for submodules. These entries are not collapsed
	into a sparse-directory entry.

	3. The loop is part of the index API, especially around reading or
	writing the format.

	4. The loop is checking for correct order of cache entries and that is
	correct if and only if the sparse-directory entries are in the correct
	location.

	5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
	otherwise already aware of sparse directory entries.

	6. The sparse-index is disabled at this point when using the split-index
	feature, so no effort is made to protect the split-index API.

	Even after inserting these guards, we will keep expanding sparse-indexes
	for most Git commands using the `command_requires_full_index` repository
	setting. This setting will be on by default and disabled one builtin at a
	time until we have sufficient confidence that all of the index operations
	are properly guarded.

	To complete this phase, the commands `git status` and `git add` will be
	integrated with the sparse-index so that they operate with O(Populated)
	performance. They will be carefully tested for operations within and
	outside the sparse-checkout definition.

	Phase II: Careful integrations
	------------------------------

	This phase focuses on ensuring that all index extensions and APIs work
	well with a sparse-index. This requires significant increases to our test
	coverage, especially for operations that interact with the working
	directory outside of the sparse-checkout definition. Some of these
	behaviors may not be the desirable ones, such as some tests already
	marked for failure in `t1092-sparse-checkout-compatibility.sh`.

	The index extensions that may require special integrations are:

	* FS Monitor
	* Untracked cache

	While integrating with these features, we should look for patterns that
	might lead to better APIs for interacting with the index. Coalescing
	common usage patterns into an API call can reduce the number of places
	where sparse-directories need to be handled carefully.

	Phase III: Important command speedups
	-------------------------------------

	At this point, the patterns for testing and implementing sparse-directory
	logic should be relatively stable. This phase focuses on updating some of
	the most common builtins that use the index to operate as O(Populated).
	Here is a potential list of commands that could be valuable to integrate
	at this point:

	* `git commit`
	* `git checkout`
	* `git merge`
	* `git rebase`

	Hopefully, commands such as `git merge` and `git rebase` can benefit
	instead from merge algorithms that do not use the index as a data
	structure, such as the merge-ORT strategy. As these topics mature, we
	may enable the ORT strategy by default for repositories using the
	sparse-index feature.

	Along with `git status` and `git add`, these commands cover the majority
	of users' interactions with the working directory. In addition, we can
	integrate with these commands:

	* `git grep`
	* `git rm`

	These have been proposed as some whose behavior could change when in a
	repo with a sparse-checkout definition. It would be good to include this
	behavior automatically when using a sparse-index. Some clarity is needed
	to make the behavior switch clear to the user.

	This phase is the first where parallel work might be possible without too
	much conflicts between topics.

	Phase IV: The long tail
	-----------------------

	This last phase is less a "phase" and more "the new normal" after all of
	the previous work.

	To start, the `command_requires_full_index` option could be removed in
	favor of expanding only when hitting an API guard.

	There are many Git commands that could use special attention to operate as
	O(Populated), while some might be so rare that it is acceptable to leave
	them with additional overhead when a sparse-index is present.

	Here are some commands that might be useful to update:

	* `git sparse-checkout set`
	* `git am`
	* `git clean`
	* `git stash`