Overview
Layered Architecture
The BAP library has the layered architecture consisting of four layers. Although the layers are not really observable from outside of the library, they make it easier to learn the library as they introduce new concepts sequentially. On top of these layers, the Project module is defined that consolidates all information about a target of analysis. The Project
module may be viewed as an entry point to the library.
+-----------------------------------------------------+
| +--------+ +-----------------------------------+ |
| | | | | |
| | | | Foundation Library | |
| | | | | |
| | | +-----------------------------------+ |
| | P | |
| | | +-----------------------------------+ |
| | R | | | |
| | | | Memory Model | |
| | O | | | |
| | | +-----------------------------------+ |
| | J | |
| | | +-----------------------------------+ |
| | E | | | |
| | | | Disassembly | |
| | C | | | |
| | | +-----------------------------------+ |
| | T | |
| | | +-----------------------------------+ |
| | | | | |
| | | | Semantic Analysis | |
| | | | | |
| +--------+ +-----------------------------------+ |
+-----------------------------------------------------+
The Foundation library defines BAP Instruction language data types, as well as other useful data structures, like Value
, Trie
, Vector
, etc. The Memory model layer is responsible for loading and parsing binary objects and representing them in a computer memory. It also defines a few useful data structures that are used extensively by later layers, e.g., Table
and Memmap
. The next layer performs disassembly and lifting to BIL. Finally, the semantic analysis layer transforms a binary into an IR representation, that is suitable for writing analysis.
Plugin Architecture
The standard library tries to be as extensible as possible. We are aware, that there are not good solutions for some problems, so we don't want to force our way of doing things. In short, we're trying to provide mechanisms, not policies. We achive this by employing the dependency injection principle. By inversing the dependency we allow the library to depend on a user code. For example, a user code can teach the library how to disassemble the binary or even how to reconstruct the CFG. In fact, the library by itself doesn't contain the disassembler or lifter, or any architecture specific code. Everything is injected later by corresponding plugins.
The library defines a fixed set of extension points. (Other libraries, that constitute the Platform and follow the same principle, can define their own extension points, so the following set is not complete):
- loader - add new file formats (see
Image.register_backend
or Project.Input
); - target - add new architecture (see
register_target
); - disassembler - plug in a disassembler (see 'disasm.hpp' for c++ disassembler interface);
- attributes - extend the attribute type (
Value.Tag.register
); - symbolizer - add names to functions (see
Symbolizer
); - rooter - find function starts (see
Rooter
); - brancher - resolve jump destinations (see
Brancher
) - reconstructor - CFG reconstruction algorithm (see
Reconstructor
); - analysis - write your own arbitrary analysis pass (see
Project.register_pass
)
The Regular.Std
library, that forms a foundation for the BAP Standard Library, also follows the dependency injection principle, so every data type that implements regular interface, can be dynamically extended with:
- pretty printing function;
- serialization subroutines;
- caching.
Writing the analysis
A common use case, is to write some analysis that will take the program in some representation and then either output result of analysis in a human or machine readable way, or transform the program, in a way that can be employed by other analysis. Following a naming convention of a more established community of compiler writers, we name such analysis a _pass_.
The library itself doesn't run any analysis, it part of the job of a frontend to run it. In particular, the bap
frontend, will run the analyses based on a command line specification. See bap
--help
for more information.
We use Project
data structure to represent a program and all associated knowledge that we were capable to infer. To learn how to use the project data structure continue to Working with project.
Foundation Library
At this layer we define (Binary Instruction language) and few other useful data structures:
- arch - describes computer architecture;
- size - word and register sizes;
- var - BIL variable;
- typ - BIL type system;
- exp - BIL expression sub-language;
- stmt - BIL statements;
- bitvector - a bitvector data structure to represent immediate data, used usually by their aliases
word
and addr
; - value - an extensible variant type;
- dict - an extensible record;
- vector - an array that can grow;
- Trie - prefix trees;
Most of the types implement the Regular interface. This interface is very similar to Core's Identifiable
, and is supposed to represent a type that is as common as a built-in type. One should expect to find any function that is implemented for such types as int
, string
, char
, etc. Namely, this interface includes:
- comparison functions: (
<, >, <= , >= , compare, between, ...
); - each type defines a polymorphic
Map
with keys of type t
; - each type provides a
Set
with values of type t
; - hashtable is exposed via
Table
module; - hashset is available under
Hash_set
name - sexpable and binable interface;
to_string
, str
, pp
, ppo
, pps
functions for pretty-printing.
It is a convention, that for each type, there is a module with the same name that implements its interface. For example, type exp
is a type abbreviation for Exp.t
, and module Exp
contains all functions and types related to type exp
. For example, to create a hashtable of statements, just type:
let table = Exp.Table.create ()
If a type is a variant type (i.e., it defines constructors) then for each constructor named Name
, there exists a corresponding function named name
that will accept the same number of arguments as the arity of the constructor (also named a _functional constructor_). For example, a Bil.Int
can be constructed with the Bil.int
function that has type word ->
exp
. If a constructor has several arguments of the same type we usually disambiguate using labels, e.g., Bil.Load of
(exp,exp,endian,size)
has function Bil.load with type: mem:exp -> addr:exp -> endian -> size -> exp
Value
Universal values can be viewed as extensible variants on steroids. Not only they maybe extended, but they also can be serialized, compared with user-defined comparison function and even pretty printed.
Dict
Like value is an extensible sum type, dict can be viewed as an extensible product type. Dict is a sequence of values of type Value, with tags used as field names. Of course, fields are unique.
Vector
Vector
is an implementation of C++ STL like vectors with logarithmic push back.
Tries
The Foundation library also defines a prefix tree data structure that proves to be useful for binary analysis applications. Tries in BAP is a functor that derives a polymorphic trie data structure for a given Key.
For convenience we support instantiating tries for most of our data structures. For example, Word has several tries inside.
For the common string trie, there's Trie.String
.
Memory model
This layer is responsible for the representation of binaries. It provides interfaces for the memory objects:
- mem - a contiguous array of bytes, indexed with absolute addresses;
- 'a table - a mapping from a memory regions to arbitrary data (no duplicates or intersections);
- a memmap - a mapping from memory region to arbitrary data with duplicates and intersections allowed, aka segment tree or interval map;
- image - represents a binary object with all its symbols, segments, sections and other meta information.
The Image
module uses the plugin system to load binary objects. In order to add new loader, one should implement the Backend.t loader function and register it with the Image.register_backend function.
Disassembler
This layer defines interfaces for disassemblers. Two interfaces are provided:
- Disasm - a regular interface that hides all complexities, but may not always be very flexible.
- Disasm_expert - an expert interface that provides access to a low-level representation. It is very flexible and fast, but harder to use.
To disassemble files or data with the regular interface, use one of the following functions:
Disasm.of_mem
- to disassemble a region of memory;Disasm.of_image
- to disassemble a loaded binary object;Disasm.of_file
- to disassemble file.
All these functions perform disassembly by recursive descent, reconstruct the control flow graph, and perform lifting.
The result of disassembly is represented by the abstract value of type disasm. Two main data structures that are used to represent disassembled program are:
- insn - a machine instruction;
- block - a basic block, i.e., a linear sequence of instructions.
The following figure shows the relationship between basic data structures of the disassembled program.
+-----------------+
| +-------------+ |
| | disasm | |
| +-------------+ |
| | |
| | * |
| +-------------+ |
| | block | |
| +-------------+ |
| | |
| | * |
| +-------------+ |
| | insn | |
| +-------------+ |
| | |
| | * |
| +-------------+ |
| | stmt | |
| +-------------+ |
+-----------------+
A disassembled program is represented as a set of interconnected basic blocks, called a whole program control flow graph (CFG) and it is indeed represented as a graph Graphs.Cfg
. See graphlib for more information on graphs.
Each block is a container to a sequence of machine instructions. It is guaranteed that there's at least one instruction in the block, thus the Block.leader and Block.terminator functions are total.
Each machine instruction is represented by its opcode
, name
and array
of operands (these are machine and disassembler specific), a set of predicates (that approximates instruction semantics on a very high level), and a sequence of BIL statements that precisely define the semantics of the instruction.
The expert interface exposes low level interface that provides facilities for building custom implementations of disassemblers. The interface to the disassembler backend is exposed via the Disasm_expert.Basic
module. New backends can be added by implementing the 'disasm.hpp' interface.
Modules of type CPU provide a high level abstraction of the machine CPU and allow one to reason about the instruction semantics independently from the target platform. The module type Target brings CPU
and ABI
together. To get an instance of this module, you can use the target_of_arch function. Architecture specific implementations of the Target
interface may (and usually do) provide more information, see corresponding support libraries for ARM
and x86 architectures.
Semantic Analysis
On the semantic level the disassembled program is lifted into the BAP Intermediate Representation (BIR). BIR is a semi-graphical representation of BIL (where BIL represents a program as Abstract Syntax Tree). The BIR provides mechanisms to express richer relationships between program terms and it also easier to use for most use cases, especially for data dependency analysis.
The program in IR is build of terms. In fact the program itself is also a term. There're only 7 kinds of terms:
- program - the program in whole;
- sub - subroutine;
- arg - subroutine argument;
- blk - basic block;
- def - definition of a variable;
- phi - phi-node in the SSA form;
- jmp - a transfer of control.
Unlike expressions and statements in BIL, IR's terms are concrete entities. Concrete entity is such entity that can change in time and space, as well as come in and out of existence. Contrary, abstract entity is eternal and unchangeable. Identity denotes the sameness of a concrete entity as it changes in time. Abstract entities don't have an identity since they are immutable. Program is built of concrete entities called terms. Terms have attributes that can change in time, without affecting the identity of a term. Attributes are abstract entities. In each particular point of space and time a term is represented by a snapshot of all its attributes, colloquially called value. Functions that change the value of a term in fact return a new value with different set of attributes. For example, def
term has two attributes: left hand side (lhs), that associates definition with abstract variable, and right hand side (rhs) that associates def
with an abstract expression. Suppose, that the definition was:
# let d_1 = Def.create x Bil.(var y + var z);;
val d_1 : Def.t = 00000001: x := y + z
To change the right hand side of a definition we use Def.with_rhs
that returns the same definition but with different value:
# let d_2 = Def.with_rhs d_1 Bil.(int Word.b1);;
val d_2 : Def.t = 00000001: x := true
d_1
and d_2
is different values
# Def.equal d_1 d_2;;
- : bool = false
of the same term
# Term.same d_1 d_2;;
- : bool = true
The identity of this terms is denoted by the term identifier (tid
). In the textual representation term identifiers are printed as ordinal numbers.
Terms, can contain other terms. But unlike BIL expressions or statements, this relation is not truly recursive, since the structure of program term is fixed: arg
, phi
, def
, jmp
are leaf terms; sub
can only contain arg
's or blk
's; blk
consists of phi
, def
and jmp
sequences of terms, as pictured in the figure below. Although, the term structure is closed to changes, you still can extend particular term with attributes, using set_attr
and get_attr
functions of the Term module. This functions are using extensible variant type to encode attributes.
+--------------------------------------------------------+
| +-------------------+ |
| | program | |
| +---------+---------+ |
| |* |
| +---------+---------+ |
| | sub | |
| +---------+---------+ |
| | |
| +-----------------+---------------+ |
| |* |* |
| +-----+-------+ +-------+-------+ |
| | arg | | blk | |
| +-------------+ +-------+-------+ |
| | |
| +---------------+--------------+ |
| |* |* | * |
| +-----+-----+ +-----+-----+ +----+-----+ |
| | phi | | def | | jmp | |
| +-----------+ +-----------+ +----------+ |
+--------------------------------------------------------+
Working with project
There're two general approaches to obtain a value of type project:
- create it manually using
Project.create
function; - write a plugin to the
bap
frontend.
Although the first approach is simplistic and gives you a full control, we still recommend to use the latter.
To write a program analysis plugin (or pass in short) you need to implement a function with one of the following interfaces:
project -> project
and register it with register_pass;project -> unit
and register it with register_pass';
Once loaded from the bap
frontend (see bap --help
) this function will be invoked with a value of type project that provides access to all information gathered from the input source. If the registered function returns a non unit
type, then it can functionally update the project state, e.g., add annotations, discover new symbols, transform program representation, etc.
Example
The following plugin prints all sections in a file:
open Core_kernel.Std
open Bap.Std
open Format
let print_sections p =
Project.memory p |> Memmap.to_sequence |> Seq.iter ~f:(fun (mem,x) ->
Option.iter (Value.get Image.section x) ~f:(fun name ->
printf "Section: %s@.%a@." name Memory.pp mem))
let () = Project.register_pass' print_sections
Note: this functionality is provided by the print
plugin.
To pass data from one pass to another in a type safe manner, we use universal values. Values can be attached to a particular memory region, IR terms, or put into the storage
dictionary. For the first case we use the memmap data structure. It is an interval tree containing all the memory regions that are used during analysis. For the storage
we use Dict
data structure. Also, each program term, has its own dictionary.
Memory annotations
By default the memory is annotated with the following attributes:
- section -- for regions of memory that had a particular name in the original binary. For example, in ELF, sections have names that annotate a corresponding memory region. If project was created from memory object, then the overall memory will be marked as a
"bap.user"
section.
- segment -- if the binary data was loaded from a binary format that contains segments, then the corresponding memory regions are be marked. Segments provide access to permission information.
BAP API