carton 1.0.0 · OCaml Package

First-pass of a PACK file.

When manipulating a PACK file, it may be necessary to aggregate certain information before extracting objects (such as the size of the buffers that need to be allocated to extract objects). Analysis of the PACK file in the form of a stream is therefore possible and is implemented in this First_pass module.

More concretely, when a user uploads the state of a Git repository (git fetch), a PACK file is transmitted. This analysis can be applied when the PACK file is received (at the same time) and the state can then be saved in the .git folder.

$ git clone ...
remote: Enumerating objects: 105, done.
remote: Counting objects: 100% (105/105), done.
remote: Compressing objects: 100% (81/81), done.
remote: Total 305 (delta 41), reused 75 (delta 23), pack-reused 200
Receiving objects: 100% (305/305), 100.00 KiB | 0 bytes/s, done. <-- first pass

The advantage of this module is that it can aggregate information at the same time as receiving the PACK file from the network.

type 'ctx hash = {

feed_bytes : bytes -> off:int -> len:int -> 'ctx -> 'ctx;
feed_bigstring : De.bigstring -> 'ctx -> 'ctx;
serialize : 'ctx -> string;
length : int;

}

A PACK file is always associated with a signature that verifies the integrity of the entire PACK file. As far as Git is concerned, this signature uses the SHA1 hash algorithm. Carton allows another algorithm to be used if required, provided that the user gives it a digest value allowing verification of the integrity of the PACK file segment by segment.

$ head -c 20 pack-2d9d562730d25620c12799c0bf0d5baf9fd00896.pack|sha1sum
2d9d562730d25620c12799c0bf0d5baf9fd00896  -
$ xxd -p -l 20 -seek -20 pack-2d9d562730d25620c12799c0bf0d5baf9fd00896.pack
2d9d562730d25620c12799c0bf0d5baf9fd00896

Here's an example of how to propose an algorithm to Carton with Digestif:

let sha1 =
  let open Digestif.SHA1 in
  let feed bstr ctx = feed_bigstring ctx bstr in
  { feed; serialize= get; length= digest_size }

let sha1 = Digest (sha1, Digestif.SHA1.empty)

type digest =

| Digest : 'ctx hash * 'ctx -> digest

type 'ctx identify = {

init : Kind.t -> Size.t -> 'ctx;
feed : De.bigstring -> 'ctx -> 'ctx;
serialize : 'ctx -> Uid.t;

}

An object stored in a PACK file can be identified by a unique reference. In the case of Git, this reference is a SHA1 hash resulting from the type, size and content of the object. For the first phase of analysis, it is possible to identify certain objects (more specifically "base" objects).

Here's an example of how to calculate the identifier of a Git object:

let identify =
  let open Digestif in
  let kind_to_string = function
    | `A -> "commit"
    | `B -> "tree"
    | `C -> "blob"
    | `D -> "tag"
  in
  let init kind (len : Carton.Size.t) =
    let hdr =
      Format.kasprintf "%s %d\000" (kind_to_string kind) (len :> int)
    in
    let ctx = SHA1.empty in
    SHA1.feed_string ctx hdr
  in
  let feed bstr ctx = SHA1.feed_bigstring ctx bstr in
  let serialize ctx =
    SHA1.get ctx |> SHA1.to_raw_string |> Carton.Uid.unsafe_of_string
  in
  { Carton.First_pass.init; feed; serialize }

type kind =

| Base of Kind.t * Uid.t
| Ofs of {
1. sub : int;
2. source : Size.t;
3. target : Size.t;
}
| Ref of {
1. ptr : Uid.t;
2. source : Size.t;
3. target : Size.t;
}

Type of PACK entries.

Entries in a PACK file can be:

a compressed object as a base with its type
an object that can be built using another object (which may ultimately be a base or another object that needs another source)

For the second category, the source can be found via a cursor (OBJ_OFS_DELTA) or a unique identifier (OBJ_REF_DELTA).

An Ofs type entry is a patch that requires a source to be built. This source is available upstream of the entry, and its position can be calculated using the current position of the entry minus the sub value. The patch gives information about the actual size of the object target and the expected size of the source.

A Ref type entry is a patch that also needs a source to build itself. The patch informs you of the actual size of the object target and the size of the expected source. The source can be found thanks to the ptr value given, which corresponds to the unique identifier of the source object (as far as Git is concerned, this identifier corresponds to what git hash-object can give).

A delta-object, an object which requires a source.

As explained above, entries can be a simple compression of the object or a "patch" requiring the source to be an object. The entry in a PACK file can refer to its source using its position or a unique identifier.

The case where an entry depends on a reference only arises for thin PACK files. These sources are often available elsewhere than in the PACK file. A registered PACK file should not contain references, but it is possible to transmit a PACK file with references to objects existing in other PACK files. One step in recording a PACK file is to canonicalise it: in other words, to ensure that the PACK file is sufficient in itself to extract all the objects.

The first pass is useful for identifying whether a PACK file is thin or not. Objects are not extracted but identified. It is then up to the user to decide whether or not to canonicalise the PACK file.

type entry = {

offset : int;
(*
Absolute offset into the given PACK file.
*)
kind : kind;
(*
Kind of the object.
*)
size : Size.t;
(*
Length of the inflated object.
*)
consumed : int;
(*
Length of the deflated object (as it is into the PACK file).
*)
crc : Optint.t;
(*
Check-sum of the entry (header plus the deflated object).
*)

}

Type of a PACK entries.

Note: The size given by the input is not necessarily the actual size of the object. It is when the object is a Base. However, if the object is a patch (Ofs or Ref), the size of the patch is given. Ofs and Ref give the real size of the object via the target field.

type decoder

The type for decoders.

type src = [

| `Channel of in_channel
| `String of string
| `Manual

]

The type for input sources. With a `Manual source the client must provide input with src.

type decode = [

| `Await of decoder
| `Peek of decoder
| `Entry of entry * decoder
| `End of string
| `Malformed of string

]

val decoder : 
  output:De.bigstring ->
  allocate:(int -> De.window) ->
  ref_length:int ->
  digest:digest ->
  identify:'ctx identify ->
  src ->
  decoder

val decode : decoder -> decode

val number_of_objects : decoder -> int

number_of_objects decoder returns the number of objects available into the PACK file.

val version : decoder -> int

version decoder return the version of the PACK file (should be 2).

val counter_of_objects : decoder -> int

counter_of_objects decoder returns the actual entry processed by the decoder.

val hash : decoder -> digest

hash decoder returns the actual (and computed by the decoder) signature of the PACK file.

val src_rem : decoder -> int

src_rem returns how many byte(s) are not yet processed by the given decoder.

val src : decoder -> De.bigstring -> int -> int -> decoder

val of_seq : 
  output:De.bigstring ->
  allocate:(int -> De.window) ->
  ref_length:int ->
  digest:digest ->
  identify:'ctx identify ->
  string Seq.t ->
  [ `Number of int | `Entry of entry | `Hash of string ] Seq.t

of_seq ~output ~allocate ~ref_length ~digest seq analyses a PACK stream given by seq and returns a stream of all the entries in the given PACK stream as well as the final signature of the PACK stream. Several values are expected:

output is a temporary buffer used to decompress the inputs
allocate is a function used to allocate a window needed for decompression
ref_length is the size (in bytes) of the unique identifier of an object (for Git, this size is 20, the size of a SHA1 hash)
digest is the algorithm used to check the integrity of the stream PACK (for Git, the algorithm is SHA1, see hash)

package carton

First-pass of a PACK file.

A delta-object, an object which requires a source.