|  | Parallel Checkout Design Notes | 
|  | ============================== | 
|  |  | 
|  | The "Parallel Checkout" feature attempts to use multiple processes to | 
|  | parallelize the work of uncompressing the blobs, applying in-core | 
|  | filters, and writing the resulting contents to the working tree during a | 
|  | checkout operation. It can be used by all checkout-related commands, | 
|  | such as `clone`, `checkout`, `reset`, `sparse-checkout`, and others. | 
|  |  | 
|  | These commands share the following basic structure: | 
|  |  | 
|  | * Step 1: Read the current index file into memory. | 
|  |  | 
|  | * Step 2: Modify the in-memory index based upon the command, and | 
|  | temporarily mark all cache entries that need to be updated. | 
|  |  | 
|  | * Step 3: Populate the working tree to match the new candidate index. | 
|  | This includes iterating over all of the to-be-updated cache entries | 
|  | and delete, create, or overwrite the associated files in the working | 
|  | tree. | 
|  |  | 
|  | * Step 4: Write the new index to disk. | 
|  |  | 
|  | Step 3 is the focus of the "parallel checkout" effort described here. | 
|  |  | 
|  | Sequential Implementation | 
|  | ------------------------- | 
|  |  | 
|  | For the purposes of discussion here, the current sequential | 
|  | implementation of Step 3 is divided in 3 parts, each one implemented in | 
|  | its own function: | 
|  |  | 
|  | * Step 3a: `unpack-trees.c:check_updates()` contains a series of | 
|  | sequential loops iterating over the `cache_entry`'s array. The main | 
|  | loop in this function calls the Step 3b function for each of the | 
|  | to-be-updated entries. | 
|  |  | 
|  | * Step 3b: `entry.c:checkout_entry()` examines the existing working tree | 
|  | for file conflicts, collisions, and unsaved changes. It removes files | 
|  | and creates leading directories as necessary. It calls the Step 3c | 
|  | function for each entry to be written. | 
|  |  | 
|  | * Step 3c: `entry.c:write_entry()` loads the blob into memory, smudges | 
|  | it if necessary, creates the file in the working tree, writes the | 
|  | smudged contents, calls `fstat()` or `lstat()`, and updates the | 
|  | associated `cache_entry` struct with the stat information gathered. | 
|  |  | 
|  | It wouldn't be safe to perform Step 3b in parallel, as there could be | 
|  | race conditions between file creations and removals. Instead, the | 
|  | parallel checkout framework lets the sequential code handle Step 3b, | 
|  | and uses parallel workers to replace the sequential | 
|  | `entry.c:write_entry()` calls from Step 3c. | 
|  |  | 
|  | Rejected Multi-Threaded Solution | 
|  | -------------------------------- | 
|  |  | 
|  | The most "straightforward" implementation would be to spread the set of | 
|  | to-be-updated cache entries across multiple threads. But due to the | 
|  | thread-unsafe functions in the object database code, we would have to use locks to | 
|  | coordinate the parallel operation. An early prototype of this solution | 
|  | showed that the multi-threaded checkout would bring performance | 
|  | improvements over the sequential code, but there was still too much lock | 
|  | contention. A `perf` profiling indicated that around 20% of the runtime | 
|  | during a local Linux clone (on an SSD) was spent in locking functions. | 
|  | For this reason this approach was rejected in favor of using multiple | 
|  | child processes, which led to better performance. | 
|  |  | 
|  | Multi-Process Solution | 
|  | ---------------------- | 
|  |  | 
|  | Parallel checkout alters the aforementioned Step 3 to use multiple | 
|  | `checkout--worker` background processes to distribute the work. The | 
|  | long-running worker processes are controlled by the foreground Git | 
|  | command using the existing run-command API. | 
|  |  | 
|  | Overview | 
|  | ~~~~~~~~ | 
|  |  | 
|  | Step 3b is only slightly altered; for each entry to be checked out, the | 
|  | main process performs the following steps: | 
|  |  | 
|  | * M1: Check whether there is any untracked or unclean file in the | 
|  | working tree which would be overwritten by this entry, and decide | 
|  | whether to proceed (removing the file(s)) or not. | 
|  |  | 
|  | * M2: Create the leading directories. | 
|  |  | 
|  | * M3: Load the conversion attributes for the entry's path. | 
|  |  | 
|  | * M4: Check, based on the entry's type and conversion attributes, | 
|  | whether the entry is eligible for parallel checkout (more on this | 
|  | later). If it is eligible, enqueue the entry and the loaded | 
|  | attributes to later write the entry in parallel. If not, write the | 
|  | entry right away, using the default sequential code. | 
|  |  | 
|  | Note: we save the conversion attributes associated with each entry | 
|  | because the workers don't have access to the main process' index state, | 
|  | so they can't load the attributes by themselves (and the attributes are | 
|  | needed to properly smudge the entry). Additionally, this has a positive | 
|  | impact on performance as (1) we don't need to load the attributes twice | 
|  | and (2) the attributes machinery is optimized to handle paths in | 
|  | sequential order. | 
|  |  | 
|  | After all entries have passed through the above steps, the main process | 
|  | checks if the number of enqueued entries is sufficient to spread among | 
|  | the workers. If not, it just writes them sequentially. Otherwise, it | 
|  | spawns the workers and distributes the queued entries uniformly in | 
|  | continuous chunks. This aims to minimize the chances of two workers | 
|  | writing to the same directory simultaneously, which could increase lock | 
|  | contention in the kernel. | 
|  |  | 
|  | Then, for each assigned item, each worker: | 
|  |  | 
|  | * W1: Checks if there is any non-directory file in the leading part of | 
|  | the entry's path or if there already exists a file at the entry' path. | 
|  | If so, mark the entry with `PC_ITEM_COLLIDED` and skip it (more on | 
|  | this later). | 
|  |  | 
|  | * W2: Creates the file (with O_CREAT and O_EXCL). | 
|  |  | 
|  | * W3: Loads the blob into memory (inflating and delta reconstructing | 
|  | it). | 
|  |  | 
|  | * W4: Applies any required in-process filter, like end-of-line | 
|  | conversion and re-encoding. | 
|  |  | 
|  | * W5: Writes the result to the file descriptor opened at W2. | 
|  |  | 
|  | * W6: Calls `fstat()` or `lstat()` on the just-written path, and sends | 
|  | the result back to the main process, together with the end status of | 
|  | the operation and the item's identification number. | 
|  |  | 
|  | Note that, when possible, steps W3 to W5 are delegated to the streaming | 
|  | machinery, removing the need to keep the entire blob in memory. | 
|  |  | 
|  | If the worker fails to read the blob or to write it to the working tree, | 
|  | it removes the created file to avoid leaving empty files behind. This is | 
|  | the *only* time a worker is allowed to remove a file. | 
|  |  | 
|  | As mentioned earlier, it is the responsibility of the main process to | 
|  | remove any file that blocks the checkout operation (or abort if the | 
|  | removal(s) would cause data loss and the user didn't ask to `--force`). | 
|  | This is crucial to avoid race conditions and also to properly detect | 
|  | path collisions at Step W1. | 
|  |  | 
|  | After the workers finish writing the items and sending back the required | 
|  | information, the main process handles the results in two steps: | 
|  |  | 
|  | - First, it updates the in-memory index with the `lstat()` information | 
|  | sent by the workers. (This must be done first as this information | 
|  | might be required in the following step.) | 
|  |  | 
|  | - Then it writes the items which collided on disk (i.e. items marked | 
|  | with `PC_ITEM_COLLIDED`). More on this below. | 
|  |  | 
|  | Path Collisions | 
|  | --------------- | 
|  |  | 
|  | Path collisions happen when two different paths correspond to the same | 
|  | entry in the file system. E.g. the paths 'a' and 'A' would collide in a | 
|  | case-insensitive file system. | 
|  |  | 
|  | The sequential checkout deals with collisions in the same way that it | 
|  | deals with files that were already present in the working tree before | 
|  | checkout. Basically, it checks if the path that it wants to write | 
|  | already exists on disk, makes sure the existing file doesn't have | 
|  | unsaved data, and then overwrites it. (To be more pedantic: it deletes | 
|  | the existing file and creates the new one.) So, if there are multiple | 
|  | colliding files to be checked out, the sequential code will write each | 
|  | one of them but only the last will actually survive on disk. | 
|  |  | 
|  | Parallel checkout aims to reproduce the same behavior. However, we | 
|  | cannot let the workers racily write to the same file on disk. Instead, | 
|  | the workers detect when the entry that they want to check out would | 
|  | collide with an existing file, and mark it with `PC_ITEM_COLLIDED`. | 
|  | Later, the main process can sequentially feed these entries back to | 
|  | `checkout_entry()` without the risk of race conditions. On clone, this | 
|  | also has the effect of marking the colliding entries to later emit a | 
|  | warning for the user, like the classic sequential checkout does. | 
|  |  | 
|  | The workers are able to detect both collisions among the entries being | 
|  | concurrently written and collisions between a parallel-eligible entry | 
|  | and an ineligible entry. The general idea for collision detection is | 
|  | quite straightforward: for each parallel-eligible entry, the main | 
|  | process must remove all files that prevent this entry from being written | 
|  | (before enqueueing it). This includes any non-directory file in the | 
|  | leading path of the entry. Later, when a worker gets assigned the entry, | 
|  | it looks again for the non-directory files and for an already existing | 
|  | file at the entry's path. If any of these checks finds something, the | 
|  | worker knows that there was a path collision. | 
|  |  | 
|  | Because parallel checkout can distinguish path collisions from the case | 
|  | where the file was already present in the working tree before checkout, | 
|  | we could alternatively choose to skip the checkout of colliding entries. | 
|  | However, each entry that doesn't get written would have NULL `lstat()` | 
|  | fields on the index. This could cause performance penalties for | 
|  | subsequent commands that need to refresh the index, as they would have | 
|  | to go to the file system to see if the entry is dirty. Thus, if we have | 
|  | N entries in a colliding group and we decide to write and `lstat()` only | 
|  | one of them, every subsequent `git-status` will have to read, convert, | 
|  | and hash the written file N - 1 times. By checking out all colliding | 
|  | entries (like the sequential code does), we only pay the overhead once, | 
|  | during checkout. | 
|  |  | 
|  | Eligible Entries for Parallel Checkout | 
|  | -------------------------------------- | 
|  |  | 
|  | As previously mentioned, not all entries passed to `checkout_entry()` | 
|  | will be considered eligible for parallel checkout. More specifically, we | 
|  | exclude: | 
|  |  | 
|  | - Symbolic links; to avoid race conditions that, in combination with | 
|  | path collisions, could cause workers to write files at the wrong | 
|  | place. For example, if we were to concurrently check out a symlink | 
|  | 'a' -> 'b' and a regular file 'A/f' in a case-insensitive file system, | 
|  | we could potentially end up writing the file 'A/f' at 'a/f', due to a | 
|  | race condition. | 
|  |  | 
|  | - Regular files that require external filters (either "one shot" filters | 
|  | or long-running process filters). These filters are black-boxes to Git | 
|  | and may have their own internal locking or non-concurrent assumptions. | 
|  | So it might not be safe to run multiple instances in parallel. | 
|  | + | 
|  | Besides, long-running filters may use the delayed checkout feature to | 
|  | postpone the return of some filtered blobs. The delayed checkout queue | 
|  | and the parallel checkout queue are not compatible and should remain | 
|  | separate. | 
|  | + | 
|  | Note: regular files that only require internal filters, like end-of-line | 
|  | conversion and re-encoding, are eligible for parallel checkout. | 
|  |  | 
|  | Ineligible entries are checked out by the classic sequential codepath | 
|  | *before* spawning workers. | 
|  |  | 
|  | Note: submodules' files are also eligible for parallel checkout (as | 
|  | long as they don't fall into any of the excluding categories mentioned | 
|  | above). But since each submodule is checked out in its own child | 
|  | process, we don't mix the superproject's and the submodules' files in | 
|  | the same parallel checkout process or queue. | 
|  |  | 
|  | The API | 
|  | ------- | 
|  |  | 
|  | The parallel checkout API was designed with the goal of minimizing | 
|  | changes to the current users of the checkout machinery. This means that | 
|  | they don't have to call a different function for sequential or parallel | 
|  | checkout. As already mentioned, `checkout_entry()` will automatically | 
|  | insert the given entry in the parallel checkout queue when this feature | 
|  | is enabled and the entry is eligible; otherwise, it will just write the | 
|  | entry right away, using the sequential code. In general, callers of the | 
|  | parallel checkout API should look similar to this: | 
|  |  | 
|  | ---------------------------------------------- | 
|  | int pc_workers, pc_threshold, err = 0; | 
|  | struct checkout state; | 
|  |  | 
|  | get_parallel_checkout_configs(&pc_workers, &pc_threshold); | 
|  |  | 
|  | /* | 
|  | * This check is not strictly required, but it | 
|  | * should save some time in sequential mode. | 
|  | */ | 
|  | if (pc_workers > 1) | 
|  | init_parallel_checkout(); | 
|  |  | 
|  | for (each cache_entry ce to-be-updated) | 
|  | err |= checkout_entry(ce, &state, NULL, NULL); | 
|  |  | 
|  | err |= run_parallel_checkout(&state, pc_workers, pc_threshold, NULL, NULL); | 
|  | ---------------------------------------------- |