blob: e513e391ea82ca991cda64fcfbc0b276eea4f4e8 [file] [log] [blame]
Jeff Hostetler637fc442017-12-14 21:09:23 +00001Partial Clone Design Notes
2==========================
3
4The "Partial Clone" feature is a performance optimization for Git that
5allows Git to function without having a complete copy of the repository.
Elijah Newren89363522023-10-08 06:45:07 +00006The goal of this work is to allow Git to better handle extremely large
Jeff Hostetler637fc442017-12-14 21:09:23 +00007repositories.
8
9During clone and fetch operations, Git downloads the complete contents
10and history of the repository. This includes all commits, trees, and
11blobs for the complete life of the repository. For extremely large
12repositories, clones can take hours (or days) and consume 100+GiB of disk
13space.
14
15Often in these repositories there are many blobs and trees that the user
16does not need such as:
17
18 1. files outside of the user's work area in the tree. For example, in
19 a repository with 500K directories and 3.5M files in every commit,
20 we can avoid downloading many objects if the user only needs a
21 narrow "cone" of the source tree.
22
23 2. large binary assets. For example, in a repository where large build
24 artifacts are checked into the tree, we can avoid downloading all
25 previous versions of these non-mergeable binary assets and only
26 download versions that are actually referenced.
27
28Partial clone allows us to avoid downloading such unneeded objects *in
29advance* during clone and fetch operations and thereby reduce download
30times and disk usage. Missing objects can later be "demand fetched"
31if/when needed.
32
Christian Couder7e154ba2019-06-25 15:40:35 +020033A remote that can later provide the missing objects is called a
34promisor remote, as it promises to send the objects when
Elijah Newren031fd4b2019-11-05 17:07:20 +000035requested. Initially Git supported only one promisor remote, the origin
Christian Couder7e154ba2019-06-25 15:40:35 +020036remote from which the user cloned and that was configured in the
37"extensions.partialClone" config option. Later support for more than
38one promisor remote has been implemented.
39
Jeff Hostetler637fc442017-12-14 21:09:23 +000040Use of partial clone requires that the user be online and the origin
Christian Couder7e154ba2019-06-25 15:40:35 +020041remote or other promisor remotes be available for on-demand fetching
42of missing objects. This may or may not be problematic for the user.
43For example, if the user can stay within the pre-selected subset of
44the source tree, they may not encounter any missing objects.
45Alternatively, the user could try to pre-fetch various objects if they
46know that they are going offline.
Jeff Hostetler637fc442017-12-14 21:09:23 +000047
48
49Non-Goals
50---------
51
52Partial clone is a mechanism to limit the number of blobs and trees downloaded
53*within* a given range of commits -- and is therefore independent of and not
54intended to conflict with existing DAG-level mechanisms to limit the set of
55requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
56
57
58Design Overview
59---------------
60
61Partial clone logically consists of the following parts:
62
63- A mechanism for the client to describe unneeded or unwanted objects to
64 the server.
65
66- A mechanism for the server to omit such unwanted objects from packfiles
67 sent to the client.
68
69- A mechanism for the client to gracefully handle missing objects (that
70 were previously omitted by the server).
71
72- A mechanism for the client to backfill missing objects as needed.
73
74
75Design Details
76--------------
77
78- A new pack-protocol capability "filter" is added to the fetch-pack and
79 upload-pack negotiation.
Jonathan Nieder5641eb92018-08-14 15:28:46 -070080+
81This uses the existing capability discovery mechanism.
Ævar Arnfjörð Bjarmason5db92102022-08-04 18:28:36 +020082See "filter" in linkgit:gitprotocol-pack[5].
Jeff Hostetler637fc442017-12-14 21:09:23 +000083
84- Clients pass a "filter-spec" to clone and fetch which is passed to the
85 server to request filtering during packfile construction.
Jonathan Nieder5641eb92018-08-14 15:28:46 -070086+
87There are various filters available to accommodate different situations.
Todd Zullinger59d92802025-03-03 15:44:08 -050088See "--filter=<filter-spec>" in Documentation/rev-list-options.adoc.
Jeff Hostetler637fc442017-12-14 21:09:23 +000089
90- On the server pack-objects applies the requested filter-spec as it
91 creates "filtered" packfiles for the client.
Jonathan Nieder5641eb92018-08-14 15:28:46 -070092+
93These filtered packfiles are *incomplete* in the traditional sense because
94they may contain objects that reference objects not contained in the
95packfile and that the client doesn't already have. For example, the
96filtered packfile may contain trees or tags that reference missing blobs
97or commits that reference missing trees.
Jeff Hostetler637fc442017-12-14 21:09:23 +000098
99- On the client these incomplete packfiles are marked as "promisor packfiles"
100 and treated differently by various commands.
101
102- On the client a repository extension is added to the local config to
103 prevent older versions of git from failing mid-operation because of
104 missing objects that they cannot handle.
Caleb White19f5ce02024-10-22 00:08:49 +0000105 See `extensions.partialClone` in linkgit:git-config[1].
Jeff Hostetler637fc442017-12-14 21:09:23 +0000106
107
108Handling Missing Objects
109------------------------
110
Christian Couder7e154ba2019-06-25 15:40:35 +0200111- An object may be missing due to a partial clone or fetch, or missing
112 due to repository corruption. To differentiate these cases, the
113 local repository specially indicates such filtered packfiles
114 obtained from promisor remotes as "promisor packfiles".
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700115+
116These promisor packfiles consist of a "<name>.promisor" file with
117arbitrary contents (like the "<name>.keep" files), in addition to
118their "<name>.pack" and "<name>.idx" files.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000119
120- The local repository considers a "promisor object" to be an object that
Christian Couder7e154ba2019-06-25 15:40:35 +0200121 it knows (to the best of its ability) that promisor remotes have promised
122 that they have, either because the local repository has that object in one of
Jeff Hostetler637fc442017-12-14 21:09:23 +0000123 its promisor packfiles, or because another promisor object refers to it.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700124+
Christian Couder17471252019-01-14 07:10:52 +0100125When Git encounters a missing object, Git can see if it is a promisor object
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700126and handle it appropriately. If not, Git can report a corruption.
127+
128This means that there is no need for the client to explicitly maintain an
129expensive-to-modify list of missing objects.[a]
Jeff Hostetler637fc442017-12-14 21:09:23 +0000130
131- Since almost all Git code currently expects any referenced object to be
132 present locally and because we do not want to force every command to do
133 a dry-run first, a fallback mechanism is added to allow Git to attempt
Christian Couder7e154ba2019-06-25 15:40:35 +0200134 to dynamically fetch missing objects from promisor remotes.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700135+
136When the normal object lookup fails to find an object, Git invokes
Christian Couder7e154ba2019-06-25 15:40:35 +0200137promisor_remote_get_direct() to try to get the object from a promisor
138remote and then retry the object lookup. This allows objects to be
139"faulted in" without complicated prediction algorithms.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700140+
141For efficiency reasons, no check as to whether the missing object is
142actually a promisor object is performed.
143+
144Dynamic object fetching tends to be slow as objects are fetched one at
145a time.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000146
147- `checkout` (and any other command using `unpack-trees`) has been taught
148 to bulk pre-fetch all required missing blobs in a single batch.
149
150- `rev-list` has been taught to print missing objects.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700151+
152This can be used by other commands to bulk prefetch objects.
153For example, a "git log -p A..B" may internally want to first do
154something like "git rev-list --objects --quiet --missing=print A..B"
155and prefetch those objects in bulk.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000156
157- `fsck` has been updated to be fully aware of promisor objects.
158
159- `repack` in GC has been updated to not touch promisor packfiles at all,
160 and to only repack other objects.
161
162- The global variable "fetch_if_missing" is used to control whether an
163 object lookup will attempt to dynamically fetch a missing object or
164 report an error.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700165+
166We are not happy with this global variable and would like to remove it,
167but that requires significant refactoring of the object code to pass an
Christian Couder7e154ba2019-06-25 15:40:35 +0200168additional flag.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000169
170
171Fetching Missing Objects
172------------------------
173
Jonathan Tan7ca3c0a2020-08-17 21:01:36 -0700174- Fetching of objects is done by invoking a "git fetch" subprocess.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000175
176- The local repository sends a request with the hashes of all requested
Jonathan Tan7ca3c0a2020-08-17 21:01:36 -0700177 objects, and does not perform any packfile negotiation.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000178 It then receives a packfile.
179
Jonathan Tan7ca3c0a2020-08-17 21:01:36 -0700180- Because we are reusing the existing fetch mechanism, fetching
Jeff Hostetler637fc442017-12-14 21:09:23 +0000181 currently fetches all objects referred to by the requested objects, even
182 though they are not necessary.
183
Robert Coup4963d3e2022-03-28 14:02:11 +0000184- Fetching with `--refetch` will request a complete new filtered packfile from
185 the remote, which can be used to change a filter without needing to
186 dynamically fetch missing objects.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000187
Christian Couder7e154ba2019-06-25 15:40:35 +0200188Using many promisor remotes
189---------------------------
190
191Many promisor remotes can be configured and used.
192
193This allows for example a user to have multiple geographically-close
194cache servers for fetching missing blobs while continuing to do
195filtered `git-fetch` commands from the central server.
196
197When fetching objects, promisor remotes are tried one after the other
198until all the objects have been fetched.
199
200Remotes that are considered "promisor" remotes are those specified by
201the following configuration variables:
202
203- `extensions.partialClone = <name>`
204
205- `remote.<name>.promisor = true`
206
207- `remote.<name>.partialCloneFilter = ...`
208
209Only one promisor remote can be configured using the
210`extensions.partialClone` config variable. This promisor remote will
211be the last one tried when fetching objects.
212
213We decided to make it the last one we try, because it is likely that
214someone using many promisor remotes is doing so because the other
215promisor remotes are better for some reason (maybe they are closer or
216faster for some kind of objects) than the origin, and the origin is
217likely to be the remote specified by extensions.partialClone.
218
219This justification is not very strong, but one choice had to be made,
220and anyway the long term plan should be to make the order somehow
221fully configurable.
222
223For now though the other promisor remotes will be tried in the order
224they appear in the config file.
225
Jeff Hostetler637fc442017-12-14 21:09:23 +0000226Current Limitations
227-------------------
228
Christian Couder7e154ba2019-06-25 15:40:35 +0200229- It is not possible to specify the order in which the promisor
230 remotes are tried in other ways than the order in which they appear
231 in the config file.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700232+
Christian Couder7e154ba2019-06-25 15:40:35 +0200233It is also not possible to specify an order to be used when fetching
234from one remote and a different order when fetching from another
235remote.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000236
Christian Couder7e154ba2019-06-25 15:40:35 +0200237- It is not possible to push only specific objects to a promisor
238 remote.
239+
240It is not possible to push at the same time to multiple promisor
241remote in a specific order.
242
243- Dynamic object fetching will only ask promisor remotes for missing
244 objects. We assume that promisor remotes have a complete view of the
Jeff Hostetler637fc442017-12-14 21:09:23 +0000245 repository and can satisfy all such requests.
246
247- Repack essentially treats promisor and non-promisor packfiles as 2
Tao Klerksace6d8e2021-06-02 11:47:26 +0000248 distinct partitions and does not mix them.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000249
250- Dynamic object fetching invokes fetch-pack once *for each item*
251 because most algorithms stumble upon a missing object and need to have
252 it resolved before continuing their work. This may incur significant
253 overhead -- and multiple authentication requests -- if many objects are
254 needed.
255
256- Dynamic object fetching currently uses the existing pack protocol V0
257 which means that each object is requested via fetch-pack. The server
258 will send a full set of info/refs when the connection is established.
Elijah Newren0a4f0512023-10-08 06:45:17 +0000259 If there are a large number of refs, this may incur significant overhead.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000260
261
262Future Work
263-----------
264
Christian Couder7e154ba2019-06-25 15:40:35 +0200265- Improve the way to specify the order in which promisor remotes are
266 tried.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700267+
Elijah Newrencf6cac22023-10-08 06:45:03 +0000268For example this could allow specifying explicitly something like:
Christian Couder7e154ba2019-06-25 15:40:35 +0200269"When fetching from this remote, I want to use these promisor remotes
270in this order, though, when pushing or fetching to that remote, I want
271to use those promisor remotes in that order."
272
273- Allow pushing to promisor remotes.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700274+
Christian Couder7e154ba2019-06-25 15:40:35 +0200275The user might want to work in a triangular work flow with multiple
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700276promisor remotes that each have an incomplete view of the repository.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000277
Jeff Hostetler637fc442017-12-14 21:09:23 +0000278- Allow non-pathname-based filters to make use of packfile bitmaps (when
279 present). This was just an omission during the initial implementation.
280
281- Investigate use of a long-running process to dynamically fetch a series
282 of objects, such as proposed in [5,6] to reduce process startup and
283 overhead costs.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700284+
285It would be nice if pack protocol V2 could allow that long-running
286process to make a series of requests over a single long-running
287connection.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000288
289- Investigate pack protocol V2 to avoid the info/refs broadcast on
290 each connection with the server to dynamically fetch missing objects.
291
292- Investigate the need to handle loose promisor objects.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700293+
294Objects in promisor packfiles are allowed to reference missing objects
295that can be dynamically fetched from the server. An assumption was
296made that loose objects are only created locally and therefore should
297not reference a missing object. We may need to revisit that assumption
298if, for example, we dynamically fetch a missing tree and store it as a
299loose object rather than a single object packfile.
300+
301This does not necessarily mean we need to mark loose objects as promisor;
302it may be sufficient to relax the object lookup or is-promisor functions.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000303
304
305Non-Tasks
306---------
307
308- Every time the subject of "demand loading blobs" comes up it seems
309 that someone suggests that the server be allowed to "guess" and send
310 additional objects that may be related to the requested objects.
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700311+
312No work has gone into actually doing that; we're just documenting that
313it is a common suggestion. We're not sure how it would work and have
314no plans to work on it.
315+
316It is valid for the server to send more objects than requested (even
317for a dynamic object fetch), but we are not building on that.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000318
319
320Footnotes
321---------
322
323[a] expensive-to-modify list of missing objects: Earlier in the design of
324 partial clone we discussed the need for a single list of missing objects.
Elijah Newren859a6d62023-10-08 06:45:08 +0000325 This would essentially be a sorted linear list of OIDs that were
Jeff Hostetler637fc442017-12-14 21:09:23 +0000326 omitted by the server during a clone or subsequent fetches.
327
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700328This file would need to be loaded into memory on every object lookup.
329It would need to be read, updated, and re-written (like the .git/index)
330on every explicit "git fetch" command *and* on any dynamic object fetch.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000331
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700332The cost to read, update, and write this file could add significant
333overhead to every command if there are many missing objects. For example,
334if there are 100M missing blobs, this file would be at least 2GiB on disk.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000335
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700336With the "promisor" concept, we *infer* a missing object based upon the
337type of packfile that references it.
Jeff Hostetler637fc442017-12-14 21:09:23 +0000338
339
340Related Links
341-------------
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700342[0] https://crbug.com/git/2
343 Bug#2: Partial Clone
Jeff Hostetler637fc442017-12-14 21:09:23 +0000344
Jeff King3eae30e2019-11-27 07:54:04 -0500345[1] https://lore.kernel.org/git/20170113155253.1644-1-benpeart@microsoft.com/ +
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700346 Subject: [RFC] Add support for downloading blobs on demand +
Jeff Hostetler637fc442017-12-14 21:09:23 +0000347 Date: Fri, 13 Jan 2017 10:52:53 -0500
348
Jeff King3eae30e2019-11-27 07:54:04 -0500349[2] https://lore.kernel.org/git/cover.1506714999.git.jonathantanmy@google.com/ +
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700350 Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) +
Jeff Hostetler637fc442017-12-14 21:09:23 +0000351 Date: Fri, 29 Sep 2017 13:11:36 -0700
352
Jeff King3eae30e2019-11-27 07:54:04 -0500353[3] https://lore.kernel.org/git/20170426221346.25337-1-jonathantanmy@google.com/ +
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700354 Subject: Proposal for missing blob support in Git repos +
Jeff Hostetler637fc442017-12-14 21:09:23 +0000355 Date: Wed, 26 Apr 2017 15:13:46 -0700
356
Jeff King3eae30e2019-11-27 07:54:04 -0500357[4] https://lore.kernel.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ +
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700358 Subject: [PATCH 00/10] RFC Partial Clone and Fetch +
Jeff Hostetler637fc442017-12-14 21:09:23 +0000359 Date: Wed, 8 Mar 2017 18:50:29 +0000
360
Jeff King3eae30e2019-11-27 07:54:04 -0500361[5] https://lore.kernel.org/git/20170505152802.6724-1-benpeart@microsoft.com/ +
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700362 Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module +
Jeff Hostetler637fc442017-12-14 21:09:23 +0000363 Date: Fri, 5 May 2017 11:27:52 -0400
364
Jeff King3eae30e2019-11-27 07:54:04 -0500365[6] https://lore.kernel.org/git/20170714132651.170708-1-benpeart@microsoft.com/ +
Jonathan Nieder5641eb92018-08-14 15:28:46 -0700366 Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand +
Jeff Hostetler637fc442017-12-14 21:09:23 +0000367 Date: Fri, 14 Jul 2017 09:26:50 -0400