| Git hash function transition | 
 | ============================ | 
 |  | 
 | Objective | 
 | --------- | 
 | Migrate Git from SHA-1 to a stronger hash function. | 
 |  | 
 | Background | 
 | ---------- | 
 | At its core, the Git version control system is a content addressable | 
 | filesystem. It uses the SHA-1 hash function to name content. For | 
 | example, files, directories, and revisions are referred to by hash | 
 | values unlike in other traditional version control systems where files | 
 | or versions are referred to via sequential numbers. The use of a hash | 
 | function to address its content delivers a few advantages: | 
 |  | 
 | * Integrity checking is easy. Bit flips, for example, are easily | 
 |   detected, as the hash of corrupted content does not match its name. | 
 | * Lookup of objects is fast. | 
 |  | 
 | Using a cryptographically secure hash function brings additional | 
 | advantages: | 
 |  | 
 | * Object names can be signed and third parties can trust the hash to | 
 |   address the signed object and all objects it references. | 
 | * Communication using Git protocol and out of band communication | 
 |   methods have a short reliable string that can be used to reliably | 
 |   address stored content. | 
 |  | 
 | Over time some flaws in SHA-1 have been discovered by security | 
 | researchers. On 23 February 2017 the SHAttered attack | 
 | (https://shattered.io) demonstrated a practical SHA-1 hash collision. | 
 |  | 
 | Git v2.13.0 and later subsequently moved to a hardened SHA-1 | 
 | implementation by default, which isn't vulnerable to the SHAttered | 
 | attack. | 
 |  | 
 | Thus Git has in effect already migrated to a new hash that isn't SHA-1 | 
 | and doesn't share its vulnerabilities, its new hash function just | 
 | happens to produce exactly the same output for all known inputs, | 
 | except two PDFs published by the SHAttered researchers, and the new | 
 | implementation (written by those researchers) claims to detect future | 
 | cryptanalytic collision attacks. | 
 |  | 
 | Regardless, it's considered prudent to move past any variant of SHA-1 | 
 | to a new hash. There's no guarantee that future attacks on SHA-1 won't | 
 | be published in the future, and those attacks may not have viable | 
 | mitigations. | 
 |  | 
 | If SHA-1 and its variants were to be truly broken, Git's hash function | 
 | could not be considered cryptographically secure any more. This would | 
 | impact the communication of hash values because we could not trust | 
 | that a given hash value represented the known good version of content | 
 | that the speaker intended. | 
 |  | 
 | SHA-1 still possesses the other properties such as fast object lookup | 
 | and safe error checking, but other hash functions are equally suitable | 
 | that are believed to be cryptographically secure. | 
 |  | 
 | Goals | 
 | ----- | 
 | 1. The transition to SHA-256 can be done one local repository at a time. | 
 |    a. Requiring no action by any other party. | 
 |    b. A SHA-256 repository can communicate with SHA-1 Git servers | 
 |       (push/fetch). | 
 |    c. Users can use SHA-1 and SHA-256 identifiers for objects | 
 |       interchangeably (see "Object names on the command line", below). | 
 |    d. New signed objects make use of a stronger hash function than | 
 |       SHA-1 for their security guarantees. | 
 | 2. Allow a complete transition away from SHA-1. | 
 |    a. Local metadata for SHA-1 compatibility can be removed from a | 
 |       repository if compatibility with SHA-1 is no longer needed. | 
 | 3. Maintainability throughout the process. | 
 |    a. The object format is kept simple and consistent. | 
 |    b. Creation of a generalized repository conversion tool. | 
 |  | 
 | Non-Goals | 
 | --------- | 
 | 1. Add SHA-256 support to Git protocol. This is valuable and the | 
 |    logical next step but it is out of scope for this initial design. | 
 | 2. Transparently improving the security of existing SHA-1 signed | 
 |    objects. | 
 | 3. Intermixing objects using multiple hash functions in a single | 
 |    repository. | 
 | 4. Taking the opportunity to fix other bugs in Git's formats and | 
 |    protocols. | 
 | 5. Shallow clones and fetches into a SHA-256 repository. (This will | 
 |    change when we add SHA-256 support to Git protocol.) | 
 | 6. Skip fetching some submodules of a project into a SHA-256 | 
 |    repository. (This also depends on SHA-256 support in Git | 
 |    protocol.) | 
 |  | 
 | Overview | 
 | -------- | 
 | We introduce a new repository format extension. Repositories with this | 
 | extension enabled use SHA-256 instead of SHA-1 to name their objects. | 
 | This affects both object names and object content --- both the names | 
 | of objects and all references to other objects within an object are | 
 | switched to the new hash function. | 
 |  | 
 | SHA-256 repositories cannot be read by older versions of Git. | 
 |  | 
 | Alongside the packfile, a SHA-256 repository stores a bidirectional | 
 | mapping between SHA-256 and SHA-1 object names. The mapping is generated | 
 | locally and can be verified using "git fsck". Object lookups use this | 
 | mapping to allow naming objects using either their SHA-1 and SHA-256 names | 
 | interchangeably. | 
 |  | 
 | "git cat-file" and "git hash-object" gain options to display an object | 
 | in its sha1 form and write an object given its sha1 form. This | 
 | requires all objects referenced by that object to be present in the | 
 | object database so that they can be named using the appropriate name | 
 | (using the bidirectional hash mapping). | 
 |  | 
 | Fetches from a SHA-1 based server convert the fetched objects into | 
 | SHA-256 form and record the mapping in the bidirectional mapping table | 
 | (see below for details). Pushes to a SHA-1 based server convert the | 
 | objects being pushed into sha1 form so the server does not have to be | 
 | aware of the hash function the client is using. | 
 |  | 
 | Detailed Design | 
 | --------------- | 
 | Repository format extension | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | A SHA-256 repository uses repository format version `1` (see | 
 | Documentation/technical/repository-version.txt) with extensions | 
 | `objectFormat` and `compatObjectFormat`: | 
 |  | 
 | 	[core] | 
 | 		repositoryFormatVersion = 1 | 
 | 	[extensions] | 
 | 		objectFormat = sha256 | 
 | 		compatObjectFormat = sha1 | 
 |  | 
 | The combination of setting `core.repositoryFormatVersion=1` and | 
 | populating `extensions.*` ensures that all versions of Git later than | 
 | `v0.99.9l` will die instead of trying to operate on the SHA-256 | 
 | repository, instead producing an error message. | 
 |  | 
 | 	# Between v0.99.9l and v2.7.0 | 
 | 	$ git status | 
 | 	fatal: Expected git repo version <= 0, found 1 | 
 | 	# After v2.7.0 | 
 | 	$ git status | 
 | 	fatal: unknown repository extensions found: | 
 | 		objectformat | 
 | 		compatobjectformat | 
 |  | 
 | See the "Transition plan" section below for more details on these | 
 | repository extensions. | 
 |  | 
 | Object names | 
 | ~~~~~~~~~~~~ | 
 | Objects can be named by their 40 hexadecimal digit sha1-name or 64 | 
 | hexadecimal digit sha256-name, plus names derived from those (see | 
 | gitrevisions(7)). | 
 |  | 
 | The sha1-name of an object is the SHA-1 of the concatenation of its | 
 | type, length, a nul byte, and the object's sha1-content. This is the | 
 | traditional <sha1> used in Git to name objects. | 
 |  | 
 | The sha256-name of an object is the SHA-256 of the concatenation of its | 
 | type, length, a nul byte, and the object's sha256-content. | 
 |  | 
 | Object format | 
 | ~~~~~~~~~~~~~ | 
 | The content as a byte sequence of a tag, commit, or tree object named | 
 | by sha1 and sha256 differ because an object named by sha256-name refers to | 
 | other objects by their sha256-names and an object named by sha1-name | 
 | refers to other objects by their sha1-names. | 
 |  | 
 | The sha256-content of an object is the same as its sha1-content, except | 
 | that objects referenced by the object are named using their sha256-names | 
 | instead of sha1-names. Because a blob object does not refer to any | 
 | other object, its sha1-content and sha256-content are the same. | 
 |  | 
 | The format allows round-trip conversion between sha256-content and | 
 | sha1-content. | 
 |  | 
 | Object storage | 
 | ~~~~~~~~~~~~~~ | 
 | Loose objects use zlib compression and packed objects use the packed | 
 | format described in Documentation/technical/pack-format.txt, just like | 
 | today. The content that is compressed and stored uses sha256-content | 
 | instead of sha1-content. | 
 |  | 
 | Pack index | 
 | ~~~~~~~~~~ | 
 | Pack index (.idx) files use a new v3 format that supports multiple | 
 | hash functions. They have the following format (all integers are in | 
 | network byte order): | 
 |  | 
 | - A header appears at the beginning and consists of the following: | 
 |   - The 4-byte pack index signature: '\377t0c' | 
 |   - 4-byte version number: 3 | 
 |   - 4-byte length of the header section, including the signature and | 
 |     version number | 
 |   - 4-byte number of objects contained in the pack | 
 |   - 4-byte number of object formats in this pack index: 2 | 
 |   - For each object format: | 
 |     - 4-byte format identifier (e.g., 'sha1' for SHA-1) | 
 |     - 4-byte length in bytes of shortened object names. This is the | 
 |       shortest possible length needed to make names in the shortened | 
 |       object name table unambiguous. | 
 |     - 4-byte integer, recording where tables relating to this format | 
 |       are stored in this index file, as an offset from the beginning. | 
 |   - 4-byte offset to the trailer from the beginning of this file. | 
 |   - Zero or more additional key/value pairs (4-byte key, 4-byte | 
 |     value). Only one key is supported: 'PSRC'. See the "Loose objects | 
 |     and unreachable objects" section for supported values and how this | 
 |     is used.  All other keys are reserved. Readers must ignore | 
 |     unrecognized keys. | 
 | - Zero or more NUL bytes. This can optionally be used to improve the | 
 |   alignment of the full object name table below. | 
 | - Tables for the first object format: | 
 |   - A sorted table of shortened object names.  These are prefixes of | 
 |     the names of all objects in this pack file, packed together | 
 |     without offset values to reduce the cache footprint of the binary | 
 |     search for a specific object name. | 
 |  | 
 |   - A table of full object names in pack order. This allows resolving | 
 |     a reference to "the nth object in the pack file" (from a | 
 |     reachability bitmap or from the next table of another object | 
 |     format) to its object name. | 
 |  | 
 |   - A table of 4-byte values mapping object name order to pack order. | 
 |     For an object in the table of sorted shortened object names, the | 
 |     value at the corresponding index in this table is the index in the | 
 |     previous table for that same object. | 
 |  | 
 |     This can be used to look up the object in reachability bitmaps or | 
 |     to look up its name in another object format. | 
 |  | 
 |   - A table of 4-byte CRC32 values of the packed object data, in the | 
 |     order that the objects appear in the pack file. This is to allow | 
 |     compressed data to be copied directly from pack to pack during | 
 |     repacking without undetected data corruption. | 
 |  | 
 |   - A table of 4-byte offset values. For an object in the table of | 
 |     sorted shortened object names, the value at the corresponding | 
 |     index in this table indicates where that object can be found in | 
 |     the pack file. These are usually 31-bit pack file offsets, but | 
 |     large offsets are encoded as an index into the next table with the | 
 |     most significant bit set. | 
 |  | 
 |   - A table of 8-byte offset entries (empty for pack files less than | 
 |     2 GiB). Pack files are organized with heavily used objects toward | 
 |     the front, so most object references should not need to refer to | 
 |     this table. | 
 | - Zero or more NUL bytes. | 
 | - Tables for the second object format, with the same layout as above, | 
 |   up to and not including the table of CRC32 values. | 
 | - Zero or more NUL bytes. | 
 | - The trailer consists of the following: | 
 |   - A copy of the 20-byte SHA-256 checksum at the end of the | 
 |     corresponding packfile. | 
 |  | 
 |   - 20-byte SHA-256 checksum of all of the above. | 
 |  | 
 | Loose object index | 
 | ~~~~~~~~~~~~~~~~~~ | 
 | A new file $GIT_OBJECT_DIR/loose-object-idx contains information about | 
 | all loose objects. Its format is | 
 |  | 
 |   # loose-object-idx | 
 |   (sha256-name SP sha1-name LF)* | 
 |  | 
 | where the object names are in hexadecimal format. The file is not | 
 | sorted. | 
 |  | 
 | The loose object index is protected against concurrent writes by a | 
 | lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose | 
 | object: | 
 |  | 
 | 1. Write the loose object to a temporary file, like today. | 
 | 2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. | 
 | 3. Rename the loose object into place. | 
 | 4. Open loose-object-idx with O_APPEND and write the new object | 
 | 5. Unlink loose-object-idx.lock to release the lock. | 
 |  | 
 | To remove entries (e.g. in "git pack-refs" or "git-prune"): | 
 |  | 
 | 1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the | 
 |    lock. | 
 | 2. Write the new content to loose-object-idx.lock. | 
 | 3. Unlink any loose objects being removed. | 
 | 4. Rename to replace loose-object-idx, releasing the lock. | 
 |  | 
 | Translation table | 
 | ~~~~~~~~~~~~~~~~~ | 
 | The index files support a bidirectional mapping between sha1-names | 
 | and sha256-names. The lookup proceeds similarly to ordinary object | 
 | lookups. For example, to convert a sha1-name to a sha256-name: | 
 |  | 
 |  1. Look for the object in idx files. If a match is present in the | 
 |     idx's sorted list of truncated sha1-names, then: | 
 |     a. Read the corresponding entry in the sha1-name order to pack | 
 |        name order mapping. | 
 |     b. Read the corresponding entry in the full sha1-name table to | 
 |        verify we found the right object. If it is, then | 
 |     c. Read the corresponding entry in the full sha256-name table. | 
 |        That is the object's sha256-name. | 
 |  2. Check for a loose object. Read lines from loose-object-idx until | 
 |     we find a match. | 
 |  | 
 | Step (1) takes the same amount of time as an ordinary object lookup: | 
 | O(number of packs * log(objects per pack)). Step (2) takes O(number of | 
 | loose objects) time. To maintain good performance it will be necessary | 
 | to keep the number of loose objects low. See the "Loose objects and | 
 | unreachable objects" section below for more details. | 
 |  | 
 | Since all operations that make new objects (e.g., "git commit") add | 
 | the new objects to the corresponding index, this mapping is possible | 
 | for all objects in the object store. | 
 |  | 
 | Reading an object's sha1-content | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | The sha1-content of an object can be read by converting all sha256-names | 
 | its sha256-content references to sha1-names using the translation table. | 
 |  | 
 | Fetch | 
 | ~~~~~ | 
 | Fetching from a SHA-1 based server requires translating between SHA-1 | 
 | and SHA-256 based representations on the fly. | 
 |  | 
 | SHA-1s named in the ref advertisement that are present on the client | 
 | can be translated to SHA-256 and looked up as local objects using the | 
 | translation table. | 
 |  | 
 | Negotiation proceeds as today. Any "have"s generated locally are | 
 | converted to SHA-1 before being sent to the server, and SHA-1s | 
 | mentioned by the server are converted to SHA-256 when looking them up | 
 | locally. | 
 |  | 
 | After negotiation, the server sends a packfile containing the | 
 | requested objects. We convert the packfile to SHA-256 format using | 
 | the following steps: | 
 |  | 
 | 1. index-pack: inflate each object in the packfile and compute its | 
 |    SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against | 
 |    objects the client has locally. These objects can be looked up | 
 |    using the translation table and their sha1-content read as | 
 |    described above to resolve the deltas. | 
 | 2. topological sort: starting at the "want"s from the negotiation | 
 |    phase, walk through objects in the pack and emit a list of them, | 
 |    excluding blobs, in reverse topologically sorted order, with each | 
 |    object coming later in the list than all objects it references. | 
 |    (This list only contains objects reachable from the "wants". If the | 
 |    pack from the server contained additional extraneous objects, then | 
 |    they will be discarded.) | 
 | 3. convert to sha256: open a new (sha256) packfile. Read the topologically | 
 |    sorted list just generated. For each object, inflate its | 
 |    sha1-content, convert to sha256-content, and write it to the sha256 | 
 |    pack. Record the new sha1<->sha256 mapping entry for use in the idx. | 
 | 4. sort: reorder entries in the new pack to match the order of objects | 
 |    in the pack the server generated and include blobs. Write a sha256 idx | 
 |    file | 
 | 5. clean up: remove the SHA-1 based pack file, index, and | 
 |    topologically sorted list obtained from the server in steps 1 | 
 |    and 2. | 
 |  | 
 | Step 3 requires every object referenced by the new object to be in the | 
 | translation table. This is why the topological sort step is necessary. | 
 |  | 
 | As an optimization, step 1 could write a file describing what non-blob | 
 | objects each object it has inflated from the packfile references. This | 
 | makes the topological sort in step 2 possible without inflating the | 
 | objects in the packfile for a second time. The objects need to be | 
 | inflated again in step 3, for a total of two inflations. | 
 |  | 
 | Step 4 is probably necessary for good read-time performance. "git | 
 | pack-objects" on the server optimizes the pack file for good data | 
 | locality (see Documentation/technical/pack-heuristics.txt). | 
 |  | 
 | Details of this process are likely to change. It will take some | 
 | experimenting to get this to perform well. | 
 |  | 
 | Push | 
 | ~~~~ | 
 | Push is simpler than fetch because the objects referenced by the | 
 | pushed objects are already in the translation table. The sha1-content | 
 | of each object being pushed can be read as described in the "Reading | 
 | an object's sha1-content" section to generate the pack written by git | 
 | send-pack. | 
 |  | 
 | Signed Commits | 
 | ~~~~~~~~~~~~~~ | 
 | We add a new field "gpgsig-sha256" to the commit object format to allow | 
 | signing commits without relying on SHA-1. It is similar to the | 
 | existing "gpgsig" field. Its signed payload is the sha256-content of the | 
 | commit object with any "gpgsig" and "gpgsig-sha256" fields removed. | 
 |  | 
 | This means commits can be signed | 
 | 1. using SHA-1 only, as in existing signed commit objects | 
 | 2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig | 
 |    fields. | 
 | 3. using only SHA-256, by only using the gpgsig-sha256 field. | 
 |  | 
 | Old versions of "git verify-commit" can verify the gpgsig signature in | 
 | cases (1) and (2) without modifications and view case (3) as an | 
 | ordinary unsigned commit. | 
 |  | 
 | Signed Tags | 
 | ~~~~~~~~~~~ | 
 | We add a new field "gpgsig-sha256" to the tag object format to allow | 
 | signing tags without relying on SHA-1. Its signed payload is the | 
 | sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP | 
 | SIGNATURE-----" delimited in-body signature removed. | 
 |  | 
 | This means tags can be signed | 
 | 1. using SHA-1 only, as in existing signed tag objects | 
 | 2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body | 
 |    signature. | 
 | 3. using only SHA-256, by only using the gpgsig-sha256 field. | 
 |  | 
 | Mergetag embedding | 
 | ~~~~~~~~~~~~~~~~~~ | 
 | The mergetag field in the sha1-content of a commit contains the | 
 | sha1-content of a tag that was merged by that commit. | 
 |  | 
 | The mergetag field in the sha256-content of the same commit contains the | 
 | sha256-content of the same tag. | 
 |  | 
 | Submodules | 
 | ~~~~~~~~~~ | 
 | To convert recorded submodule pointers, you need to have the converted | 
 | submodule repository in place. The translation table of the submodule | 
 | can be used to look up the new hash. | 
 |  | 
 | Loose objects and unreachable objects | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | Fast lookups in the loose-object-idx require that the number of loose | 
 | objects not grow too high. | 
 |  | 
 | "git gc --auto" currently waits for there to be 6700 loose objects | 
 | present before consolidating them into a packfile. We will need to | 
 | measure to find a more appropriate threshold for it to use. | 
 |  | 
 | "git gc --auto" currently waits for there to be 50 packs present | 
 | before combining packfiles. Packing loose objects more aggressively | 
 | may cause the number of pack files to grow too quickly. This can be | 
 | mitigated by using a strategy similar to Martin Fick's exponential | 
 | rolling garbage collection script: | 
 | https://gerrit-review.googlesource.com/c/gerrit/+/35215 | 
 |  | 
 | "git gc" currently expels any unreachable objects it encounters in | 
 | pack files to loose objects in an attempt to prevent a race when | 
 | pruning them (in case another process is simultaneously writing a new | 
 | object that refers to the about-to-be-deleted object). This leads to | 
 | an explosion in the number of loose objects present and disk space | 
 | usage due to the objects in delta form being replaced with independent | 
 | loose objects.  Worse, the race is still present for loose objects. | 
 |  | 
 | Instead, "git gc" will need to move unreachable objects to a new | 
 | packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see | 
 | below). To avoid the race when writing new objects referring to an | 
 | about-to-be-deleted object, code paths that write new objects will | 
 | need to copy any objects from UNREACHABLE_GARBAGE packs that they | 
 | refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). | 
 | UNREACHABLE_GARBAGE are then safe to delete if their creation time (as | 
 | indicated by the file's mtime) is long enough ago. | 
 |  | 
 | To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be | 
 | combined under certain circumstances. If "gc.garbageTtl" is set to | 
 | greater than one day, then packs created within a single calendar day, | 
 | UTC, can be coalesced together. The resulting packfile would have an | 
 | mtime before midnight on that day, so this makes the effective maximum | 
 | ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, | 
 | then we divide the calendar day into intervals one-third of that ttl | 
 | in duration. Packs created within the same interval can be coalesced | 
 | together. The resulting packfile would have an mtime before the end of | 
 | the interval, so this makes the effective maximum ttl equal to the | 
 | garbageTtl * 4/3. | 
 |  | 
 | This rule comes from Thirumala Reddy Mutchukota's JGit change | 
 | https://git.eclipse.org/r/90465. | 
 |  | 
 | The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack | 
 | index. More generally, that field indicates where a pack came from: | 
 |  | 
 |  - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network | 
 |  - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight | 
 |    "gc --auto" operation | 
 |  - 3 (PACK_SOURCE_GC) for a pack created by a full gc | 
 |  - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage | 
 |    discovered by gc | 
 |  - 5 (PACK_SOURCE_INSERT) for locally created objects that were | 
 |    written directly to a pack file, e.g. from "git add ." | 
 |  | 
 | This information can be useful for debugging and for "gc --auto" to | 
 | make appropriate choices about which packs to coalesce. | 
 |  | 
 | Caveats | 
 | ------- | 
 | Invalid objects | 
 | ~~~~~~~~~~~~~~~ | 
 | The conversion from sha1-content to sha256-content retains any | 
 | brokenness in the original object (e.g., tree entry modes encoded with | 
 | leading 0, tree objects whose paths are not sorted correctly, and | 
 | commit objects without an author or committer). This is a deliberate | 
 | feature of the design to allow the conversion to round-trip. | 
 |  | 
 | More profoundly broken objects (e.g., a commit with a truncated "tree" | 
 | header line) cannot be converted but were not usable by current Git | 
 | anyway. | 
 |  | 
 | Shallow clone and submodules | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | Because it requires all referenced objects to be available in the | 
 | locally generated translation table, this design does not support | 
 | shallow clone or unfetched submodules. Protocol improvements might | 
 | allow lifting this restriction. | 
 |  | 
 | Alternates | 
 | ~~~~~~~~~~ | 
 | For the same reason, a sha256 repository cannot borrow objects from a | 
 | sha1 repository using objects/info/alternates or | 
 | $GIT_ALTERNATE_OBJECT_REPOSITORIES. | 
 |  | 
 | git notes | 
 | ~~~~~~~~~ | 
 | The "git notes" tool annotates objects using their sha1-name as key. | 
 | This design does not describe a way to migrate notes trees to use | 
 | sha256-names. That migration is expected to happen separately (for | 
 | example using a file at the root of the notes tree to describe which | 
 | hash it uses). | 
 |  | 
 | Server-side cost | 
 | ~~~~~~~~~~~~~~~~ | 
 | Until Git protocol gains SHA-256 support, using SHA-256 based storage | 
 | on public-facing Git servers is strongly discouraged. Once Git | 
 | protocol gains SHA-256 support, SHA-256 based servers are likely not | 
 | to support SHA-1 compatibility, to avoid what may be a very expensive | 
 | hash re-encode during clone and to encourage peers to modernize. | 
 |  | 
 | The design described here allows fetches by SHA-1 clients of a | 
 | personal SHA-256 repository because it's not much more difficult than | 
 | allowing pushes from that repository. This support needs to be guarded | 
 | by a configuration option --- servers like git.kernel.org that serve a | 
 | large number of clients would not be expected to bear that cost. | 
 |  | 
 | Meaning of signatures | 
 | ~~~~~~~~~~~~~~~~~~~~~ | 
 | The signed payload for signed commits and tags does not explicitly | 
 | name the hash used to identify objects. If some day Git adopts a new | 
 | hash function with the same length as the current SHA-1 (40 | 
 | hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the | 
 | intent behind the PGP signed payload in an object signature is | 
 | unclear: | 
 |  | 
 | 	object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 | 
 | 	type commit | 
 | 	tag v2.12.0 | 
 | 	tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 | 
 |  | 
 | 	Git 2.12 | 
 |  | 
 | Does this mean Git v2.12.0 is the commit with sha1-name | 
 | e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with | 
 | new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? | 
 |  | 
 | Fortunately SHA-256 and SHA-1 have different lengths. If Git starts | 
 | using another hash with the same length to name objects, then it will | 
 | need to change the format of signed payloads using that hash to | 
 | address this issue. | 
 |  | 
 | Object names on the command line | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | To support the transition (see Transition plan below), this design | 
 | supports four different modes of operation: | 
 |  | 
 |  1. ("dark launch") Treat object names input by the user as SHA-1 and | 
 |     convert any object names written to output to SHA-1, but store | 
 |     objects using SHA-256.  This allows users to test the code with no | 
 |     visible behavior change except for performance.  This allows | 
 |     allows running even tests that assume the SHA-1 hash function, to | 
 |     sanity-check the behavior of the new mode. | 
 |  | 
 |  2. ("early transition") Allow both SHA-1 and SHA-256 object names in | 
 |     input. Any object names written to output use SHA-1. This allows | 
 |     users to continue to make use of SHA-1 to communicate with peers | 
 |     (e.g. by email) that have not migrated yet and prepares for mode 3. | 
 |  | 
 |  3. ("late transition") Allow both SHA-1 and SHA-256 object names in | 
 |     input. Any object names written to output use SHA-256. In this | 
 |     mode, users are using a more secure object naming method by | 
 |     default.  The disruption is minimal as long as most of their peers | 
 |     are in mode 2 or mode 3. | 
 |  | 
 |  4. ("post-transition") Treat object names input by the user as | 
 |     SHA-256 and write output using SHA-256. This is safer than mode 3 | 
 |     because there is less risk that input is incorrectly interpreted | 
 |     using the wrong hash function. | 
 |  | 
 | The mode is specified in configuration. | 
 |  | 
 | The user can also explicitly specify which format to use for a | 
 | particular revision specifier and for output, overriding the mode. For | 
 | example: | 
 |  | 
 | git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} | 
 |  | 
 | Choice of Hash | 
 | -------------- | 
 | In early 2005, around the time that Git was written, Xiaoyun Wang, | 
 | Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 | 
 | collisions in 2^69 operations. In August they published details. | 
 | Luckily, no practical demonstrations of a collision in full SHA-1 were | 
 | published until 10 years later, in 2017. | 
 |  | 
 | Git v2.13.0 and later subsequently moved to a hardened SHA-1 | 
 | implementation by default that mitigates the SHAttered attack, but | 
 | SHA-1 is still believed to be weak. | 
 |  | 
 | The hash to replace this hardened SHA-1 should be stronger than SHA-1 | 
 | was: we would like it to be trustworthy and useful in practice for at | 
 | least 10 years. | 
 |  | 
 | Some other relevant properties: | 
 |  | 
 | 1. A 256-bit hash (long enough to match common security practice; not | 
 |    excessively long to hurt performance and disk usage). | 
 |  | 
 | 2. High quality implementations should be widely available (e.g., in | 
 |    OpenSSL and Apple CommonCrypto). | 
 |  | 
 | 3. The hash function's properties should match Git's needs (e.g. Git | 
 |    requires collision and 2nd preimage resistance and does not require | 
 |    length extension resistance). | 
 |  | 
 | 4. As a tiebreaker, the hash should be fast to compute (fortunately | 
 |    many contenders are faster than SHA-1). | 
 |  | 
 | We choose SHA-256. | 
 |  | 
 | Transition plan | 
 | --------------- | 
 | Some initial steps can be implemented independently of one another: | 
 | - adding a hash function API (vtable) | 
 | - teaching fsck to tolerate the gpgsig-sha256 field | 
 | - excluding gpgsig-* from the fields copied by "git commit --amend" | 
 | - annotating tests that depend on SHA-1 values with a SHA1 test | 
 |   prerequisite | 
 | - using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ | 
 |   consistently instead of "unsigned char *" and the hardcoded | 
 |   constants 20 and 40. | 
 | - introducing index v3 | 
 | - adding support for the PSRC field and safer object pruning | 
 |  | 
 |  | 
 | The first user-visible change is the introduction of the objectFormat | 
 | extension (without compatObjectFormat). This requires: | 
 | - teaching fsck about this mode of operation | 
 | - using the hash function API (vtable) when computing object names | 
 | - signing objects and verifying signatures | 
 | - rejecting attempts to fetch from or push to an incompatible | 
 |   repository | 
 |  | 
 | Next comes introduction of compatObjectFormat: | 
 | - implementing the loose-object-idx | 
 | - translating object names between object formats | 
 | - translating object content between object formats | 
 | - generating and verifying signatures in the compat format | 
 | - adding appropriate index entries when adding a new object to the | 
 |   object store | 
 | - --output-format option | 
 | - ^{sha1} and ^{sha256} revision notation | 
 | - configuration to specify default input and output format (see | 
 |   "Object names on the command line" above) | 
 |  | 
 | The next step is supporting fetches and pushes to SHA-1 repositories: | 
 | - allow pushes to a repository using the compat format | 
 | - generate a topologically sorted list of the SHA-1 names of fetched | 
 |   objects | 
 | - convert the fetched packfile to sha256 format and generate an idx | 
 |   file | 
 | - re-sort to match the order of objects in the fetched packfile | 
 |  | 
 | The infrastructure supporting fetch also allows converting an existing | 
 | repository. In converted repositories and new clones, end users can | 
 | gain support for the new hash function without any visible change in | 
 | behavior (see "dark launch" in the "Object names on the command line" | 
 | section). In particular this allows users to verify SHA-256 signatures | 
 | on objects in the repository, and it should ensure the transition code | 
 | is stable in production in preparation for using it more widely. | 
 |  | 
 | Over time projects would encourage their users to adopt the "early | 
 | transition" and then "late transition" modes to take advantage of the | 
 | new, more futureproof SHA-256 object names. | 
 |  | 
 | When objectFormat and compatObjectFormat are both set, commands | 
 | generating signatures would generate both SHA-1 and SHA-256 signatures | 
 | by default to support both new and old users. | 
 |  | 
 | In projects using SHA-256 heavily, users could be encouraged to adopt | 
 | the "post-transition" mode to avoid accidentally making implicit use | 
 | of SHA-1 object names. | 
 |  | 
 | Once a critical mass of users have upgraded to a version of Git that | 
 | can verify SHA-256 signatures and have converted their existing | 
 | repositories to support verifying them, we can add support for a | 
 | setting to generate only SHA-256 signatures. This is expected to be at | 
 | least a year later. | 
 |  | 
 | That is also a good moment to advertise the ability to convert | 
 | repositories to use SHA-256 only, stripping out all SHA-1 related | 
 | metadata. This improves performance by eliminating translation | 
 | overhead and security by avoiding the possibility of accidentally | 
 | relying on the safety of SHA-1. | 
 |  | 
 | Updating Git's protocols to allow a server to specify which hash | 
 | functions it supports is also an important part of this transition. It | 
 | is not discussed in detail in this document but this transition plan | 
 | assumes it happens. :) | 
 |  | 
 | Alternatives considered | 
 | ----------------------- | 
 | Upgrading everyone working on a particular project on a flag day | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | Projects like the Linux kernel are large and complex enough that | 
 | flipping the switch for all projects based on the repository at once | 
 | is infeasible. | 
 |  | 
 | Not only would all developers and server operators supporting | 
 | developers have to switch on the same flag day, but supporting tooling | 
 | (continuous integration, code review, bug trackers, etc) would have to | 
 | be adapted as well. This also makes it difficult to get early feedback | 
 | from some project participants testing before it is time for mass | 
 | adoption. | 
 |  | 
 | Using hash functions in parallel | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | (e.g. https://lore.kernel.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) | 
 | Objects newly created would be addressed by the new hash, but inside | 
 | such an object (e.g. commit) it is still possible to address objects | 
 | using the old hash function. | 
 | * You cannot trust its history (needed for bisectability) in the | 
 |   future without further work | 
 | * Maintenance burden as the number of supported hash functions grows | 
 |   (they will never go away, so they accumulate). In this proposal, by | 
 |   comparison, converted objects lose all references to SHA-1. | 
 |  | 
 | Signed objects with multiple hashes | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | Instead of introducing the gpgsig-sha256 field in commit and tag objects | 
 | for sha256-content based signatures, an earlier version of this design | 
 | added "hash sha256 <sha256-name>" fields to strengthen the existing | 
 | sha1-content based signatures. | 
 |  | 
 | In other words, a single signature was used to attest to the object | 
 | content using both hash functions. This had some advantages: | 
 | * Using one signature instead of two speeds up the signing process. | 
 | * Having one signed payload with both hashes allows the signer to | 
 |   attest to the sha1-name and sha256-name referring to the same object. | 
 | * All users consume the same signature. Broken signatures are likely | 
 |   to be detected quickly using current versions of git. | 
 |  | 
 | However, it also came with disadvantages: | 
 | * Verifying a signed object requires access to the sha1-names of all | 
 |   objects it references, even after the transition is complete and | 
 |   translation table is no longer needed for anything else. To support | 
 |   this, the design added fields such as "hash sha1 tree <sha1-name>" | 
 |   and "hash sha1 parent <sha1-name>" to the sha256-content of a signed | 
 |   commit, complicating the conversion process. | 
 | * Allowing signed objects without a sha1 (for after the transition is | 
 |   complete) complicated the design further, requiring a "nohash sha1" | 
 |   field to suppress including "hash sha1" fields in the sha256-content | 
 |   and signed payload. | 
 |  | 
 | Lazily populated translation table | 
 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
 | Some of the work of building the translation table could be deferred to | 
 | push time, but that would significantly complicate and slow down pushes. | 
 | Calculating the sha1-name at object creation time at the same time it is | 
 | being streamed to disk and having its sha256-name calculated should be | 
 | an acceptable cost. | 
 |  | 
 | Document History | 
 | ---------------- | 
 |  | 
 | 2017-03-03 | 
 | bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, | 
 | sbeller@google.com | 
 |  | 
 | Initial version sent to | 
 | http://lore.kernel.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com | 
 |  | 
 | 2017-03-03 jrnieder@gmail.com | 
 | Incorporated suggestions from jonathantanmy and sbeller: | 
 | * describe purpose of signed objects with each hash type | 
 | * redefine signed object verification using object content under the | 
 |   first hash function | 
 |  | 
 | 2017-03-06 jrnieder@gmail.com | 
 | * Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] | 
 | * Make sha3-based signatures a separate field, avoiding the need for | 
 |   "hash" and "nohash" fields (thanks to peff[3]). | 
 | * Add a sorting phase to fetch (thanks to Junio for noticing the need | 
 |   for this). | 
 | * Omit blobs from the topological sort during fetch (thanks to peff). | 
 | * Discuss alternates, git notes, and git servers in the caveats | 
 |   section (thanks to Junio Hamano, brian m. carlson[4], and Shawn | 
 |   Pearce). | 
 | * Clarify language throughout (thanks to various commenters, | 
 |   especially Junio). | 
 |  | 
 | 2017-09-27 jrnieder@gmail.com, sbeller@google.com | 
 | * use placeholder NewHash instead of SHA3-256 | 
 | * describe criteria for picking a hash function. | 
 | * include a transition plan (thanks especially to Brandon Williams | 
 |   for fleshing these ideas out) | 
 | * define the translation table (thanks, Shawn Pearce[5], Jonathan | 
 |   Tan, and Masaya Suzuki) | 
 | * avoid loose object overhead by packing more aggressively in | 
 |   "git gc --auto" | 
 |  | 
 | Later history: | 
 |  | 
 |  See the history of this file in git.git for the history of subsequent | 
 |  edits. This document history is no longer being maintained as it | 
 |  would now be superfluous to the commit log | 
 |  | 
 | [1] http://lore.kernel.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ | 
 | [2] http://lore.kernel.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ | 
 | [3] http://lore.kernel.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ | 
 | [4] http://lore.kernel.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net | 
 | [5] https://lore.kernel.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ |