package carton

  1. Overview
  2. Docs

Decoder of a PACK file.

This module implements what is needed to decode a PACKv2 file. It is independent of any scheduler. For cooperation issues, we recommend that you refer to the documentation for Cachet, the library used to read a PACK file. More specifically, Carton is based on the use of Unix.map_file (or equivalent). Access to a block-device or file does not block, but it can take time. In short, the cooperation points have to be added by the user. As an atomic operation such as reading the PACK file (via Unix.map_file) cannot be interleaved by cooperation points.

The module is divided into 3 parts:

  • the first consists of a First_pass module for analysing a PACK stream. This is useful when the PACK file is transmitted over the network. In such a case, this analysis can be applied.
  • the second consists of extracting objects from a PACK file, the contents of which can be made available via a Unix.map_file function.
  • the third consists of a few types that may be useful for checking a PACK file. Documentation is provided to explain how to use these types.
module H = H
module Zh = Zh
val bigstring_of_string : string -> Cachet.bigstring
module Kind : sig ... end

A PACK file contains several types of object. According to Git, it contains commits (`A), trees (`B), blobs (`C) and tags (`D). Carton is far enough removed from Git to abstract itself from the actual type of these objects.

module Size : sig ... end

The size is a non-negative number which corresponds to the size of a Blob in memory in bytes.

module Uid : sig ... end

An object can be identified by a unique identifier that needs to be calculated by an algorithm such as a hash algorithm. These identifiers can be used to refer to a possible source when we have a First_pass.kind.Ref entry.

module First_pass : sig ... end

Extracting objects from a PACK file.

Once it is possible to use Unix.map_file (or equivalent) on a PACK file (i.e. once it is available in a file system), it is possible to extract all the objects in this PACK file.

Extraction consists of either:

  • decompressing a First_pass.kind.Base entry
  • decompressing a patch and reconstructing the object from a source

In both cases, we use bigstrings. The advantage of the latter is that they are not relocated by the OCaml GC. The disadvantage is their allocation (via malloc()), which can take a long time.

Memory usage is also a disadvantage. If an object is 1 Go in size, we are obliged to allocate a bigstring of 1 Go (or more). It is not possible to stream-out all objects - only First_pass.kind.Base objects can be streamed-out.

To limit the use of bigstrings, there are various functions that let you know in advance:

  • the true size of the object requested
  • the size needed to store the requested object and all the objects needed to rebuild it if it is stored as a patch
  • the list of objects that need to be rebuilt to construct the requested object

As far as patch entries are concerned (First_pass.kind.Ofs and First_pass.kind.Ref), their source can also be an object from a patch which itself requires an object from a patch. This is referred to as the depth of the object in the PACK file. The maximum depth is 50: in other words, it may be necessary to reconstruct 49 objects upstream in order to build the requested object.

The advantage is, of course, the compression ratio. In addition to compressing the entries with zlib, some objects are just patches compared to other objects. For example, if the PACK file contains a blob with content A and another blob with content A+B, the latter could be a patch containing only +B and requiring our first blob as a source.

For simple use, the user must first calculate the size of the buffers needed to store the object in memory. They then need to allocate a Blob to hold the object. Finally, the object can be reconstructed according to its position (cursor) in the PACK file or according to its unique identifier if the user has the IDX file that allows the position of the object in the PACK file to be associated with its identifier (see Classeur).

let t = Carton.make ~map ~z ~allocate ~ref_length in
let size = Carton.size_of_offset t ~cursor in
let blob = Carton.Blob.make ~size in
Carton.of_offset t blob ~cursor
type 'fd t

Type representing a PACK file and used to extract objects.

val make : ?pagesize:int -> ?cachesize:int -> map:'fd Cachet.map -> 'fd -> z:Zl.bigstring -> allocate:(int -> Zl.window) -> ref_length:int -> (Uid.t -> int) -> 'fd t

make ~map fd ~z ~allocate ~ref_length where creates a representation of the PACK file whose read access is managed by the map function. A few arguments are required so that Carton does not allocate buffers arbitrarily but gives the user fine-grained control over its allocation policy (since it essentially involves allocating bigstrings).

  • A temporary buffer z is required to store a deflated entry
  • an allocate function is required to get a Zl.window required to deflate entries
  • it is necessary to know the size ref_length of the unique identifiers that can be used to refer to patches. In the case of Git, this value is 20 (the size of a SHA1 hash)
  • lastly, a function where may be required to find out the position of an object according to its unique identifier (see Classeur).

Note: If where is proposed and exhaustive, the *of_uid* functions can be used.

make calls Cachet.make with the cachesize and pagesize arguments. These must be multiples of 2. For more details about these arguments and map, please refer to the Cachet documentation.

val of_cache : 'fd Cachet.t -> z:Zl.bigstring -> allocate:(int -> Zl.window) -> ref_length:int -> (Uid.t -> int) -> 'fd t

of_cache cache ~z ~allocate ~ref_length where is equivalent to make but uses the cache already available and initialised by the user.

val copy : 'fd t -> 'fd t

copy t makes a copy of the PACK file representation, which implies a new empty cache and a copy of the internal buffers. In this way, the result of this copy can be used in parallel safely, even if our first value t attempts to extract objects at the same time.

val fd : 'fd t -> 'fd

fd t returns the file-descriptor given by the user to make the representation of the PACK file t.

val cache : 'fd t -> 'fd Cachet.t

cache t returns the cache used to access pages in the PACK file.

val allocate : 'fd t -> int -> Zl.window
val tmp : 'fd t -> De.bigstring
val ref_length : 'fd t -> int
val map : 'fd t -> cursor:int -> consumed:int -> Cachet.Bstr.t
val with_index : 'fd t -> (Uid.t -> int) -> 'fd t
module Blob : sig ... end

The Blob is a tuple of temporary buffers used to store an object that has been decompressed or reconstructed using a patch and a source.

module Visited : sig ... end
exception Cycle
exception Too_deep
exception Bad_type
val size_of_offset : 'fd t -> ?visited:Visited.t -> cursor:int -> Size.t -> Size.t

size_of_uid pack ?visited ~cursor size returns the size of the buffers (see Blobs) required to extract the object located at cursor from the PACK file. This does not correspond to the size of the object.

val size_of_uid : 'fd t -> ?visited:Visited.t -> uid:Uid.t -> Size.t -> Size.t

size_of_uid pack ?visited ~uid size returns the size of the buffers (see Blobs) required to extract the object identified by uid from the PACK file. This does not correspond to the size of the object.

The given pack must be able to recognize the object's position based on its unique identifier. In other words, pack must be constructed with an exhaustive where function for all the identifiers in the PACK file.

val actual_size_of_offset : 'fd t -> cursor:int -> Size.t

actual_size_of_offset pack ~cursor returns the true size of the object located at cursor in the given pack PACK file.

module Value : sig ... end
val of_offset : 'fd t -> Blob.t -> cursor:int -> Value.t

of_offset pack blob ~cursor is the object at the offset cursor into the given pack.

Note: This function does not allocate larges resources (or, at least, only the given allocate function to t is able to allocate a large resource). blob (which should be created with the associated Size.t given by size_of_offset) is enough to extract the object.

Note: This function is not tail-recursive. In other words, it can discover, step by step, the patches needed to rebuild the object. Even though a well-formed PACK file should not contain objects deeper than 50, if you want to rebuild an object and are sure that the function is tail-recursive, you need to calculate its Path.t first.

val of_uid : 'fd t -> Blob.t -> uid:Uid.t -> Value.t

As of_offset, of_uid pack block ~uid is the object identified by uid into the given pack.

The given pack must be able to recognize the object's position based on its unique identifier. In other words, pack must be constructed with an exhaustive where function for all the identifiers in the PACK file.

Path of object.

Due to the fact that of_offset/of_uid are not tail-rec, an other solution exists to extract an object from the PACK file. However, this solution requires a meta-data Path.t to be able to extract an object.

A Path.t is the delta-chain of the object. It assumes that a delta-chain can not be larger than 50 (see Git assumptions). From it, the way to construct an object is well-know and the step to discover if an object depends on an other one is deleted - and we ensure that the reconstruction is bound over our Path.t.

module Path : sig ... end
val path_of_offset : ?max_depth:int -> 'fd t -> cursor:int -> Path.t
val path_of_uid : 'fd t -> Uid.t -> Path.t
val of_offset_with_path : 'fd t -> path:Path.t -> Blob.t -> cursor:int -> Value.t
val of_offset_with_source : 'fd t -> Value.t -> cursor:int -> Value.t

of_offset_with_source ~map t ~path source ~cursor is the object available at cursor into t. This function is tail-recursive and use the given source if the requested object is a patch.

type identify =
  1. | Identify : 'ctx First_pass.identify -> identify
    (*

    Carton can be asked to calculate the identifier of an object but does not require the algorithm used (SHA1 or SHA256 for example) to be known. It only handles the result of this calculation, which is represented by a Uid.t. For more details on how to implement identify, please refer to what is explained in the first phase of analysing a PACK file. You then simply need to "surround" your value with Carton.Identify to completely abstract the algorithm used to calculate the object identifier.

    *)
val uid_of_offset : identify:identify -> 'fd t -> Blob.t -> cursor:int -> Kind.t * Uid.t
val uid_of_offset_with_source : identify:identify -> 'fd t -> kind:Kind.t -> Blob.t -> depth:int -> cursor:int -> Uid.t
type children = cursor:int -> uid:Uid.t -> int list
type where = cursor:int -> int
type oracle = {
  1. identify : identify;
  2. children : children;
  3. where : where;
  4. size : cursor:int -> Size.t;
  5. checksum : cursor:int -> Optint.t;
  6. is_base : pos:int -> int option;
  7. number_of_objects : int;
  8. hash : string;
}
type status =
  1. | Unresolved_base of {
    1. cursor : int;
    }
  2. | Unresolved_node
  3. | Resolved_base of {
    1. cursor : int;
    2. uid : Uid.t;
    3. crc : Optint.t;
    4. kind : Kind.t;
    }
  4. | Resolved_node of {
    1. cursor : int;
    2. uid : Uid.t;
    3. crc : Optint.t;
    4. kind : Kind.t;
    5. depth : int;
    6. parent : Uid.t;
    }
OCaml

Innovation. Community. Security.