package bap-std

  1. Overview
  2. Docs

Overview

Layered Architecture

The BAP library has the layered architecture consisting of four layers. Although the layers are not really observable from outside of the library, they make it easier to learn the library as they introduce new concepts sequentially. On top of these layers, the Project module is defined that consolidates all information about a target of analysis. The Project module may be viewed as an entry point to the library.

        +-----------------------------------------------------+
        | +--------+   +-----------------------------------+  |
        | |        |   |                                   |  |
        | |        |   |       Foundation Library          |  |
        | |        |   |                                   |  |
        | |        |   +-----------------------------------+  |
        | |   P    |                                          |
        | |        |   +-----------------------------------+  |
        | |   R    |   |                                   |  |
        | |        |   |          Memory Model             |  |
        | |   O    |   |                                   |  |
        | |        |   +-----------------------------------+  |
        | |   J    |                                          |
        | |        |   +-----------------------------------+  |
        | |   E    |   |                                   |  |
        | |        |   |           Disassembly             |  |
        | |   C    |   |                                   |  |
        | |        |   +-----------------------------------+  |
        | |   T    |                                          |
        | |        |   +-----------------------------------+  |
        | |        |   |                                   |  |
        | |        |   |        Semantic Analysis          |  |
        | |        |   |                                   |  |
        | +--------+   +-----------------------------------+  |
        +-----------------------------------------------------+

The Foundation library defines BAP Instruction language data types, as well as other useful data structures, like Value, Trie, Vector, etc. The Memory model layer is responsible for loading and parsing binary objects and representing them in a computer memory. It also defines a few useful data structures that are used extensively by later layers, e.g., Table and Memmap. The next layer performs disassembly and lifting to BIL. Finally, the semantic analysis layer transforms a binary into an IR representation, that is suitable for writing analysis.

Plugin Architecture

The standard library tries to be as extensible as possible. We are aware, that there are not good solutions for some problems, so we don't want to force our way of doing things. In short, we're trying to provide mechanisms, not policies. We achive this by employing the dependency injection principle. By inversing the dependency we allow the library to depend on a user code. For example, a user code can teach the library how to disassemble the binary or even how to reconstruct the CFG. In fact, the library by itself doesn't contain the disassembler or lifter, or any architecture specific code. Everything is injected later by corresponding plugins.

The library defines a fixed set of extension points. (Other libraries, that constitute the Platform and follow the same principle, can define their own extension points, so the following set is not complete):

  • loader - add new file formats (see Image.register_backend or Project.Input);
  • target - add new architecture (see register_target);
  • disassembler - plug in a disassembler (see 'disasm.hpp' for c++ disassembler interface);
  • attributes - extend the attribute type (Value.Tag.register);
  • symbolizer - add names to functions (see Symbolizer);
  • rooter - find function starts (see Rooter);
  • brancher - resolve jump destinations (see Brancher)
  • reconstructor - CFG reconstruction algorithm (see Reconstructor);
  • analysis - write your own arbitrary analysis pass (see Project.register_pass)

The Regular.Std library, that forms a foundation for the BAP Standard Library, also follows the dependency injection principle, so every data type that implements regular interface, can be dynamically extended with:

  • pretty printing function;
  • serialization subroutines;
  • caching.

Writing the analysis

A common use case, is to write some analysis that will take the program in some representation and then either output result of analysis in a human or machine readable way, or transform the program, in a way that can be employed by other analysis. Following a naming convention of a more established community of compiler writers, we name such analysis a _pass_.

The library itself doesn't run any analysis, it part of the job of a frontend to run it. In particular, the bap frontend, will run the analyses based on a command line specification. See bap --help for more information.

We use Project data structure to represent a program and all associated knowledge that we were capable to infer. To learn how to use the project data structure continue to Working with project.

Foundation Library

At this layer we define (Binary Instruction language) and few other useful data structures:

  • arch - describes computer architecture;
  • size - word and register sizes;
  • var - BIL variable;
  • typ - BIL type system;
  • exp - BIL expression sub-language;
  • stmt - BIL statements;
  • bitvector - a bitvector data structure to represent immediate data, used usually by their aliases word and addr;
  • value - an extensible variant type;
  • dict - an extensible record;
  • vector - an array that can grow;
  • Trie - prefix trees;

Most of the types implement the Regular interface. This interface is very similar to Core's Identifiable, and is supposed to represent a type that is as common as a built-in type. One should expect to find any function that is implemented for such types as int, string, char, etc. Namely, this interface includes:

  • comparison functions: (<, >, <= , >= , compare, between, ...);
  • each type defines a polymorphic Map with keys of type t;
  • each type provides a Set with values of type t;
  • hashtable is exposed via Table module;
  • hashset is available under Hash_set name
  • sexpable and binable interface;
  • to_string, str, pp, ppo, pps functions for pretty-printing.

It is a convention, that for each type, there is a module with the same name that implements its interface. For example, type exp is a type abbreviation for Exp.t, and module Exp contains all functions and types related to type exp. For example, to create a hashtable of statements, just type:

let table = Exp.Table.create ()

If a type is a variant type (i.e., it defines constructors) then for each constructor named Name, there exists a corresponding function named name that will accept the same number of arguments as the arity of the constructor (also named a _functional constructor_). For example, a Bil.Int can be constructed with the Bil.int function that has type word -> exp. If a constructor has several arguments of the same type we usually disambiguate using labels, e.g., Bil.Load of (exp,exp,endian,size) has function Bil.load with type: mem:exp -> addr:exp -> endian -> size -> exp

Value

Universal values can be viewed as extensible variants on steroids. Not only they maybe extended, but they also can be serialized, compared with user-defined comparison function and even pretty printed.

Dict

Like value is an extensible sum type, dict can be viewed as an extensible product type. Dict is a sequence of values of type Value, with tags used as field names. Of course, fields are unique.

Vector

Vector is an implementation of C++ STL like vectors with logarithmic push back.

Tries

The Foundation library also defines a prefix tree data structure that proves to be useful for binary analysis applications. Tries in BAP is a functor that derives a polymorphic trie data structure for a given Key.

For convenience we support instantiating tries for most of our data structures. For example, Word has several tries inside.

For the common string trie, there's Trie.String.

Memory model

This layer is responsible for the representation of binaries. It provides interfaces for the memory objects:

  • mem - a contiguous array of bytes, indexed with absolute addresses;
  • 'a table - a mapping from a memory regions to arbitrary data (no duplicates or intersections);
  • a memmap - a mapping from memory region to arbitrary data with duplicates and intersections allowed, aka segment tree or interval map;
  • image - represents a binary object with all its symbols, segments, sections and other meta information.

The Image module uses the plugin system to load binary objects. In order to add new loader, one should implement the Backend.t loader function and register it with the Image.register_backend function.

Disassembler

This layer defines interfaces for disassemblers. Two interfaces are provided:

  • Disasm - a regular interface that hides all complexities, but may not always be very flexible.
  • Disasm_expert - an expert interface that provides access to a low-level representation. It is very flexible and fast, but harder to use.

To disassemble files or data with the regular interface, use one of the following functions:

  • Disasm.of_mem - to disassemble a region of memory;
  • Disasm.of_image - to disassemble a loaded binary object;
  • Disasm.of_file - to disassemble file.

All these functions perform disassembly by recursive descent, reconstruct the control flow graph, and perform lifting.

The result of disassembly is represented by the abstract value of type disasm. Two main data structures that are used to represent disassembled program are:

  • insn - a machine instruction;
  • block - a basic block, i.e., a linear sequence of instructions.

The following figure shows the relationship between basic data structures of the disassembled program.

        +-----------------+
        | +-------------+ |
        | |   disasm    | |
        | +-------------+ |
        |        |        |
        |        | *      |
        | +-------------+ |
        | |    block    | |
        | +-------------+ |
        |        |        |
        |        | *      |
        | +-------------+ |
        | |     insn    | |
        | +-------------+ |
        |        |        |
        |        | *      |
        | +-------------+ |
        | |     stmt    | |
        | +-------------+ |
        +-----------------+

A disassembled program is represented as a set of interconnected basic blocks, called a whole program control flow graph (CFG) and it is indeed represented as a graph Graphs.Cfg. See graphlib for more information on graphs.

Each block is a container to a sequence of machine instructions. It is guaranteed that there's at least one instruction in the block, thus the Block.leader and Block.terminator functions are total.

Each machine instruction is represented by its opcode, name and array of operands (these are machine and disassembler specific), a set of predicates (that approximates instruction semantics on a very high level), and a sequence of BIL statements that precisely define the semantics of the instruction.

The expert interface exposes low level interface that provides facilities for building custom implementations of disassemblers. The interface to the disassembler backend is exposed via the Disasm_expert.Basic module. New backends can be added by implementing the 'disasm.hpp' interface.

Modules of type CPU provide a high level abstraction of the machine CPU and allow one to reason about the instruction semantics independently from the target platform. The module type Target brings CPU and ABI together. To get an instance of this module, you can use the target_of_arch function. Architecture specific implementations of the Target interface may (and usually do) provide more information, see corresponding support libraries for ARM and x86 architectures.

Semantic Analysis

On the semantic level the disassembled program is lifted into the BAP Intermediate Representation (BIR). BIR is a semi-graphical representation of BIL (where BIL represents a program as Abstract Syntax Tree). The BIR provides mechanisms to express richer relationships between program terms and it also easier to use for most use cases, especially for data dependency analysis.

The program in IR is build of terms. In fact the program itself is also a term. There're only 7 kinds of terms:

  • program - the program in whole;
  • sub - subroutine;
  • arg - subroutine argument;
  • blk - basic block;
  • def - definition of a variable;
  • phi - phi-node in the SSA form;
  • jmp - a transfer of control.

Unlike expressions and statements in BIL, IR's terms are concrete entities. Concrete entity is such entity that can change in time and space, as well as come in and out of existence. Contrary, abstract entity is eternal and unchangeable. Identity denotes the sameness of a concrete entity as it changes in time. Abstract entities don't have an identity since they are immutable. Program is built of concrete entities called terms. Terms have attributes that can change in time, without affecting the identity of a term. Attributes are abstract entities. In each particular point of space and time a term is represented by a snapshot of all its attributes, colloquially called value. Functions that change the value of a term in fact return a new value with different set of attributes. For example, def term has two attributes: left hand side (lhs), that associates definition with abstract variable, and right hand side (rhs) that associates def with an abstract expression. Suppose, that the definition was:

# let d_1 = Def.create x Bil.(var y + var z);;
val d_1 : Def.t = 00000001: x := y + z

To change the right hand side of a definition we use Def.with_rhs that returns the same definition but with different value:

# let d_2 = Def.with_rhs d_1 Bil.(int Word.b1);;
val d_2 : Def.t = 00000001: x := true

d_1 and d_2 is different values

# Def.equal d_1 d_2;;
- : bool = false

of the same term

# Term.same d_1 d_2;;
- : bool = true

The identity of this terms is denoted by the term identifier (tid). In the textual representation term identifiers are printed as ordinal numbers.

Terms, can contain other terms. But unlike BIL expressions or statements, this relation is not truly recursive, since the structure of program term is fixed: arg, phi, def, jmp are leaf terms; sub can only contain arg's or blk's; blk consists of phi, def and jmp sequences of terms, as pictured in the figure below. Although, the term structure is closed to changes, you still can extend particular term with attributes, using set_attr and get_attr functions of the Term module. This functions are using extensible variant type to encode attributes.

        +--------------------------------------------------------+
        |                +-------------------+                   |
        |                |      program      |                   |
        |                +---------+---------+                   |
        |                          |*                            |
        |                +---------+---------+                   |
        |                |        sub        |                   |
        |                +---------+---------+                   |
        |                          |                             |
        |        +-----------------+---------------+             |
        |        |*                                |*            |
        |  +-----+-------+                 +-------+-------+     |
        |  |    arg      |                 |      blk      |     |
        |  +-------------+                 +-------+-------+     |
        |                                          |             |
        |           +---------------+--------------+             |
        |           |*              |*             | *           |
        |     +-----+-----+   +-----+-----+   +----+-----+       |
        |     |    phi    |   |    def    |   |   jmp    |       |
        |     +-----------+   +-----------+   +----------+       |
        +--------------------------------------------------------+

Working with project

There're two general approaches to obtain a value of type project:

  • create it manually using Project.create function;
  • write a plugin to the bap frontend.

Although the first approach is simplistic and gives you a full control, we still recommend to use the latter.

To write a program analysis plugin (or pass in short) you need to implement a function with one of the following interfaces:

  • project -> project and register it with register_pass;
  • project -> unit and register it with register_pass';

Once loaded from the bap frontend (see bap --help) this function will be invoked with a value of type project that provides access to all information gathered from the input source. If the registered function returns a non unit type, then it can functionally update the project state, e.g., add annotations, discover new symbols, transform program representation, etc.

Example

The following plugin prints all sections in a file:

open Core_kernel.Std
open Bap.Std
open Format

let print_sections p =
  Project.memory p |> Memmap.to_sequence |> Seq.iter ~f:(fun (mem,x) ->
      Option.iter (Value.get Image.section x) ~f:(fun name ->
          printf "Section: %s@.%a@." name Memory.pp mem))

let () = Project.register_pass' print_sections

Note: this functionality is provided by the print plugin.

Passing information between passes

To pass data from one pass to another in a type safe manner, we use universal values. Values can be attached to a particular memory region, IR terms, or put into the storage dictionary. For the first case we use the memmap data structure. It is an interval tree containing all the memory regions that are used during analysis. For the storage we use Dict data structure. Also, each program term, has its own dictionary.

Memory annotations

By default the memory is annotated with the following attributes:

  • section -- for regions of memory that had a particular name in the original binary. For example, in ELF, sections have names that annotate a corresponding memory region. If project was created from memory object, then the overall memory will be marked as a "bap.user" section.
  • segment -- if the binary data was loaded from a binary format that contains segments, then the corresponding memory regions are be marked. Segments provide access to permission information.

BAP API

module Integer : sig ... end

Abstract integral type.

module Legacy : sig ... end

Legacy

OCaml

Innovation. Community. Security.