When manipulating a PACK file, it may be necessary to aggregate certain information before extracting objects (such as the size of the buffers that need to be allocated to extract objects). Analysis of the PACK file in the form of a stream is therefore possible and is implemented in this First_pass module.
More concretely, when a user uploads the state of a Git repository (git fetch), a PACK file is transmitted. This analysis can be applied when the PACK file is received (at the same time) and the state can then be saved in the .git folder.
A PACK file is always associated with a signature that verifies the integrity of the entire PACK file. As far as Git is concerned, this signature uses the SHA1 hash algorithm. Carton allows another algorithm to be used if required, provided that the user gives it a digest value allowing verification of the integrity of the PACK file segment by segment.
Here's an example of how to propose an algorithm to Carton with Digestif:
let sha1 =
let open Digestif.SHA1 in
let feed bstr ctx = feed_bigstring ctx bstr in
{ feed; serialize= get; length= digest_size }
let sha1 = Digest (sha1, Digestif.SHA1.empty)
An object stored in a PACK file can be identified by a unique reference. In the case of Git, this reference is a SHA1 hash resulting from the type, size and content of the object. For the first phase of analysis, it is possible to identify certain objects (more specifically "base" objects).
Here's an example of how to calculate the identifier of a Git object:
let identify =
let open Digestif in
let kind_to_string = function
| `A -> "commit"
| `B -> "tree"
| `C -> "blob"
| `D -> "tag"
in
let init kind (len : Carton.Size.t) =
let hdr =
Format.kasprintf "%s %d\000" (kind_to_string kind) (len :> int)
in
let ctx = SHA1.empty in
SHA1.feed_string ctx hdr
in
let feed bstr ctx = SHA1.feed_bigstring ctx bstr in
let serialize ctx =
SHA1.get ctx |> SHA1.to_raw_string |> Carton.Uid.unsafe_of_string
in
{ Carton.First_pass.init; feed; serialize }
an object that can be built using another object (which may ultimately be a base or another object that needs another source)
For the second category, the source can be found via a cursor (OBJ_OFS_DELTA) or a unique identifier (OBJ_REF_DELTA).
An Ofs type entry is a patch that requires a source to be built. This source is available upstream of the entry, and its position can be calculated using the current position of the entry minus the sub value. The patch gives information about the actual size of the object target and the expected size of the source.
A Ref type entry is a patch that also needs a source to build itself. The patch informs you of the actual size of the object target and the size of the expected source. The source can be found thanks to the ptr value given, which corresponds to the unique identifier of the source object (as far as Git is concerned, this identifier corresponds to what git hash-object can give).
A delta-object, an object which requires a source.
As explained above, entries can be a simple compression of the object or a "patch" requiring the source to be an object. The entry in a PACK file can refer to its source using its position or a unique identifier.
The case where an entry depends on a reference only arises for thin PACK files. These sources are often available elsewhere than in the PACK file. A registered PACK file should not contain references, but it is possible to transmit a PACK file with references to objects existing in other PACK files. One step in recording a PACK file is to canonicalise it: in other words, to ensure that the PACK file is sufficient in itself to extract all the objects.
The first pass is useful for identifying whether a PACK file is thin or not. Objects are not extracted but identified. It is then up to the user to decide whether or not to canonicalise the PACK file.
Check-sum of the entry (header plus the deflated object).
*)
}
Type of a PACK entries.
Note: The size given by the input is not necessarily the actual size of the object. It is when the object is a Base. However, if the object is a patch (Ofs or Ref), the size of the patch is given. Ofs and Ref give the real size of the object via the target field.
of_seq ~output ~allocate ~ref_length ~digest seq analyses a PACK stream given by seq and returns a stream of all the entries in the given PACK stream as well as the final signature of the PACK stream. Several values are expected:
output is a temporary buffer used to decompress the inputs
allocate is a function used to allocate a window needed for decompression
ref_length is the size (in bytes) of the unique identifier of an object (for Git, this size is 20, the size of a SHA1 hash)
digest is the algorithm used to check the integrity of the stream PACK (for Git, the algorithm is SHA1, see hash)