Library
Module
Module type
Parameter
Class
Class type
When manipulating a PACK file, it may be necessary to aggregate certain information before extracting objects (such as the size of the buffers that need to be allocated to extract objects). Analysis of the PACK file in the form of a stream is therefore possible and is implemented in this First_pass
module.
More concretely, when a user uploads the state of a Git repository (git fetch
), a PACK file is transmitted. This analysis can be applied when the PACK file is received (at the same time) and the state can then be saved in the .git
folder.
$ git clone ...
remote: Enumerating objects: 105, done.
remote: Counting objects: 100% (105/105), done.
remote: Compressing objects: 100% (81/81), done.
remote: Total 305 (delta 41), reused 75 (delta 23), pack-reused 200
Receiving objects: 100% (305/305), 100.00 KiB | 0 bytes/s, done. <-- first pass
The advantage of this module is that it can aggregate information at the same time as receiving the PACK file from the network.
type 'ctx hash = {
feed_bytes : bytes -> off:int -> len:int -> 'ctx -> 'ctx;
feed_bigstring : De.bigstring -> 'ctx -> 'ctx;
serialize : 'ctx -> string;
length : int;
}
A PACK file is always associated with a signature that verifies the integrity of the entire PACK file. As far as Git is concerned, this signature uses the SHA1 hash algorithm. Carton allows another algorithm to be used if required, provided that the user gives it a digest value allowing verification of the integrity of the PACK file segment by segment.
$ head -c 20 pack-2d9d562730d25620c12799c0bf0d5baf9fd00896.pack|sha1sum
2d9d562730d25620c12799c0bf0d5baf9fd00896 -
$ xxd -p -l 20 -seek -20 pack-2d9d562730d25620c12799c0bf0d5baf9fd00896.pack
2d9d562730d25620c12799c0bf0d5baf9fd00896
Here's an example of how to propose an algorithm to Carton with Digestif:
let sha1 =
let open Digestif.SHA1 in
let feed bstr ctx = feed_bigstring ctx bstr in
{ feed; serialize= get; length= digest_size }
let sha1 = Digest (sha1, Digestif.SHA1.empty)
type 'ctx identify = {
init : Kind.t -> Size.t -> 'ctx;
feed : De.bigstring -> 'ctx -> 'ctx;
serialize : 'ctx -> Uid.t;
}
An object stored in a PACK file can be identified by a unique reference. In the case of Git, this reference is a SHA1 hash resulting from the type, size and content of the object. For the first phase of analysis, it is possible to identify certain objects (more specifically "base" objects).
Here's an example of how to calculate the identifier of a Git object:
let identify =
let open Digestif in
let kind_to_string = function
| `A -> "commit"
| `B -> "tree"
| `C -> "blob"
| `D -> "tag"
in
let init kind (len : Carton.Size.t) =
let hdr =
Format.kasprintf "%s %d\000" (kind_to_string kind) (len :> int)
in
let ctx = SHA1.empty in
SHA1.feed_string ctx hdr
in
let feed bstr ctx = SHA1.feed_bigstring ctx bstr in
let serialize ctx =
SHA1.get ctx |> SHA1.to_raw_string |> Carton.Uid.unsafe_of_string
in
{ Carton.First_pass.init; feed; serialize }
Type of PACK entries.
Entries in a PACK file can be:
For the second category, the source can be found via a cursor (OBJ_OFS_DELTA
) or a unique identifier (OBJ_REF_DELTA
).
An Ofs
type entry is a patch that requires a source to be built. This source is available upstream of the entry, and its position can be calculated using the current position of the entry minus the sub
value. The patch gives information about the actual size of the object target
and the expected size of the source.
A Ref
type entry is a patch that also needs a source to build itself. The patch informs you of the actual size of the object target
and the size of the expected source. The source can be found thanks to the ptr
value given, which corresponds to the unique identifier of the source object (as far as Git is concerned, this identifier corresponds to what git hash-object
can give).
As explained above, entries can be a simple compression of the object or a "patch" requiring the source to be an object. The entry in a PACK file can refer to its source using its position or a unique identifier.
The case where an entry depends on a reference only arises for thin PACK files. These sources are often available elsewhere than in the PACK file. A registered PACK file should not contain references, but it is possible to transmit a PACK file with references to objects existing in other PACK files. One step in recording a PACK file is to canonicalise it: in other words, to ensure that the PACK file is sufficient in itself to extract all the objects.
The first pass is useful for identifying whether a PACK file is thin or not. Objects are not extracted but identified. It is then up to the user to decide whether or not to canonicalise the PACK file.
type entry = {
offset : int;
Absolute offset into the given PACK file.
*)kind : kind;
Kind of the object.
*)size : Size.t;
Length of the inflated object.
*)consumed : int;
Length of the deflated object (as it is into the PACK file).
*)crc : Optint.t;
Check-sum of the entry (header plus the deflated object).
*)}
The type for input sources. With a `Manual
source the client must provide input with src
.
val number_of_objects : decoder -> int
number_of_objects decoder
returns the number of objects available into the PACK file.
val version : decoder -> int
version decoder
return the version of the PACK file (should be 2
).
val counter_of_objects : decoder -> int
counter_of_objects decoder
returns the actual entry processed by the decoder.
hash decoder
returns the actual (and computed by the decoder) signature of the PACK file.
val src_rem : decoder -> int
src_rem
returns how many byte(s) are not yet processed by the given decoder
.
val src : decoder -> De.bigstring -> int -> int -> decoder
val of_seq :
output:De.bigstring ->
allocate:(int -> De.window) ->
ref_length:int ->
digest:digest ->
identify:'ctx identify ->
string Seq.t ->
[ `Number of int | `Entry of entry | `Hash of string ] Seq.t
of_seq ~output ~allocate ~ref_length ~digest seq
analyses a PACK stream given by seq
and returns a stream of all the entries in the given PACK stream as well as the final signature of the PACK stream. Several values are expected:
output
is a temporary buffer used to decompress the inputsallocate
is a function used to allocate a window needed for decompressionref_length
is the size (in bytes) of the unique identifier of an object (for Git, this size is 20
, the size of a SHA1 hash)digest
is the algorithm used to check the integrity of the stream PACK (for Git, the algorithm is SHA1, see hash
)