Chapter 21 Profiling (ocamlprof, perf)

This chapter describes different ways to profile the execution of OCaml programs.

1 Profiling execution counts with ocamlprof

This section describes how the execution of OCaml programs can be profiled, by recording how many times functions are called, and branches of conditionals are taken. This approach applies to OCaml programs compiled with the bytecode and native backends.

1.1 Compiling for profiling

Before profiling an execution, the program must be compiled in profiling mode, using the ocamlcp front-end to the ocamlc compiler (see chapter 13) or the ocamloptp front-end to the ocamlopt compiler (see chapter 16). When compiling modules separately, ocamlcp or ocamloptp must be used when compiling the modules (production of .cmo or .cmx files), and can also be used (though this is not strictly necessary) when linking them together.

Note

If a module (.ml file) doesn’t have a corresponding interface (.mli file), then compiling it with ocamlcp will produce object files (.cmi and .cmo) that are not compatible with the ones produced by ocamlc, which may lead to problems (if the .cmi or .cmo is still around) when switching between profiling and non-profiling compilations. To avoid this problem, you should always have a .mli file for each .ml file. The same problem exists with ocamloptp.

Note

To make sure your programs can be compiled in profiling mode, avoid using any identifier that begins with __ocaml_prof.

The amount of profiling information can be controlled through the -P option to ocamlcp or ocamloptp, followed by one or several letters indicating which parts of the program should be profiled:

a
all options
f
function calls : a count point is set at the beginning of each function body
i
if …then …else … : count points are set in both then branch and else branch
l
while, for loops: a count point is set at the beginning of the loop body
m
match branches: a count point is set at the beginning of the body of each branch
t
try …with … branches: a count point is set at the beginning of the body of each branch

For instance, compiling with ocamlcp -P film profiles function calls, if…then…else…, loops and pattern matching.

Calling ocamlcp or ocamloptp without the -P option defaults to -P fm, meaning that only function calls and pattern matching are profiled.

Note

For compatibility with previous releases, ocamlcp also accepts the -p option, with the same arguments and behaviour as -P.

The ocamlcp and ocamloptp commands also accept all the options of the corresponding ocamlc or ocamlopt compiler, except the -pp (preprocessing) option.

1.2 Profiling an execution

Running an executable that has been compiled with ocamlcp or ocamloptp records the execution counts for the specified parts of the program and saves them in a file called ocamlprof.dump in the current directory.

If the environment variable OCAMLPROF_DUMP is set when the program exits, its value is used as the file name instead of ocamlprof.dump.

The dump file is written only if the program terminates normally (by calling exit or by falling through). It is not written if the program terminates with an uncaught exception.

If a compatible dump file already exists in the current directory, then the profiling information is accumulated in this dump file. This allows, for instance, the profiling of several executions of a program on different inputs. Note that dump files produced by byte-code executables (compiled with ocamlcp) are compatible with the dump files produced by native executables (compiled with ocamloptp).

1.3 Printing profiling information

The ocamlprof command produces a source listing of the program modules where execution counts have been inserted as comments. For instance,

        ocamlprof foo.ml

prints the source code for the foo module, with comments indicating how many times the functions in this module have been called. Naturally, this information is accurate only if the source file has not been modified after it was compiled.

The following options are recognized by ocamlprof:

-args filename
Read additional newline-terminated command line arguments from filename.
-args0 filename
Read additional null character terminated command line arguments from filename.
-f dumpfile
Specifies an alternate dump file of profiling information to be read.
-F string
Specifies an additional string to be output with profiling information. By default, ocamlprof will annotate programs with comments of the form (* n *) where n is the counter value for a profiling point. With option -F s, the annotation will be (* sn *).
-impl filename
Process the file filename as an implementation file, even if its extension is not .ml.
-intf filename
Process the file filename as an interface file, even if its extension is not .mli.
-keywords version+list
Set keywords according to the version+list specification.
-version
Print version string and exit.
-vnum
Print short version number and exit.
-help or --help
Display a short usage summary and exit.

Profiling with ocamlprof only records execution counts, not the actual time spent within each function. There is currently no way to perform time profiling on bytecode programs generated by ocamlc.

2 Time profiling with Linux perf

This section describes how to use perf to profile OCaml programs. Linux Performance Events (aka perf) is a suite of tools for performance observability included with the Linux kernel. The main features covered here are recording events using perf-record(1), and visualising recorded events using perf-report(1). perf has many additional profiling and visualising features, for more comprehensive documentation, see (man perf(1), perf wiki or Brendan Gregg’s Blog).

2.1 Background

Time profiling is typically performed by sampling the CPU call graph at frequent intervals to gather statistics on the code paths that are consuming CPU resources. To profile OCaml code, perf needs to understand the call graph of an OCaml program. perf supports multiple options for recording call graphs:

Frame pointer based call graphs use a convention where a register called the frame pointer holds the address for the beginning of the stack frame, and the previous values for the frame pointer are stored at a known offset from the current frame pointer. Using this information the call graph for the current function can be calculated purely based on the current stack, a process called unwinding. On x86_64 the register used for storing the frame pointer is $rbp, while ARM64 uses the register x29. OCaml 5 introduced non-contiguous stacks as part of the implementation of effects, see Retrofitting effect handlers onto OCaml (Section 5.5). Due to this implementation detail, frame pointers work better with the copying nature of perf.

DWARF based call graphs use the DWARF CFI information to perform unwinding. However this interacts poorly with the copying nature of perf, often leading to truncated call graphs where not enough of the stack has been copied by perf. It also produces larger trace files that are more costly to capture and process. Finally it requires including CFI debugging information in your program resulting in slightly larger binaries.

Hardware Last Branch Record (LBR) uses a processor provided method to record call graphs. This has the limitations of restricted availability (only on certain Intel CPUs) and a limited stack depth (16 on Haswell and 32 since Skylake).

Of these options, frame pointers are recommended for profiling OCaml code. It has the following advantages:

2.2 Compiling for Profiling

The process for compiling an executable for time profiling depends on the OCaml version. For OCaml versions 4.14 and earlier, either frame pointers or DWARF can be used, while for OCaml 5.0 and later, enabling frame pointers is recommended.

To enable frame pointers, configure the compiler with --enable-frame-pointers. Alternatively, you can install an opam switch with frame pointers enabled, as follows:

        opam switch create <YOUR-SWITCH-NAME-HERE> ocaml-option-fp

Frame pointer support for OCaml is available on x86_64 architecture for Linux starting with OCaml 4.12 and on macOS from OCaml 5.3. ARM64 architecture is supported on Linux and macOS from OCaml 5.4, while other Tier-1 architectures (POWER, RISC-V, and s390x) are currently unsupported.

2.3 Profiling an Execution

The basic perf command for profiling is:

        perf record -F 99 --call-graph fp PROGRAM

The -F 99 option sets perf to sample at 99Hz, reducing excessive data generation during longer runs and minimising overlap with other periodic activities. The --call-graph fp instructs perf to use frame pointers to get the call graph, followed by the OCaml executable you want to profile. This command creates a perf.data file in the current directory. Alternatively use --output to choose a more descriptive filename.

The perf record command works by copying a segment of the call stack at each sample and recording this data into a perf.data file. These samples can then be processed after recording using perf report to reconstruct the profiled program’s call stack at every sample.

perf uses the symbols present in an OCaml executable, so it helps to understand OCaml’s name mangling scheme to map names to OCaml source locations. OCaml 5.4 onwards uses a name mangling scheme of caml<MODULE_NAME>$<FUNCTION_NAME>_<NNN>, where NNN is a randomly generated number. In OCaml 5.3 the separator is a dot caml<MODULE_NAME>.<FUNCTION_NAME>_<NNN>, and before OCaml 5.1 the separator is double underscores. All formats are supported by perf.

Consider the following program:

(* fib.ml *) module Compute = struct let rec fib n = if n = 0 then 0 else if n = 1 then 1 else fib (n-1) + fib (n-2) end let main () = let r = Compute.fib 20 in Printf.printf "fib(20) = %d" r let _ = main ()

This program produces the names camlFib$main_274 for the main function and camlFib$fib_271 for the fib function in the Compute module.

2.4 Displaying profiling information

The perf report command summarises data in the perf.data file. The basic perf report command is:

        perf report -f --no-children -i perf.data

This command provides an interactive interface where you can navigate through the accumulated call graphs and select functions and threads for detailed information. Alternatively --stdio will output similar data using a text based report written to stdout. Note that if stack traces appear broken, it may be due to software not having frame pointer support enabled.

Consider the following program which calculates the Tak function.

let (x,y,z) = try let x = int_of_string Sys.argv.(1) in let y = int_of_string Sys.argv.(2) in let z = int_of_string Sys.argv.(3) in (x,y,z) with _ -> (18,12,6) let rec tak x y z = if x > y then tak (tak (x-1) y z) (tak (y-1) z x) (tak (z-1) x y) else z let main () = let r = tak x y z in Printf.printf "tak %d %d %d = %d\n" x y z r let _ = main ()

The perf report for this might resemble the following.

Samples: 809  of event 'cycles', Event count (approx.): 24701952617
  Overhead  Command     Shared Object      Symbol
-  100.00%  tak-fp.exe  tak-fp.exe         [.] Tak\$tak_402
     Tak\$tak_402
     Tak\$tak_402
   + Tak\$tak_402
     0.00%  tak-fp.exe  [kernel.kallsyms]  [k] 0xffffb9a5ff79d854
     0.00%  perf-exec   [kernel.kallsyms]  [k] 0xffffb9a5ff719c34

For a more interesting program implementing a Sudoku solver using Norvig’s Backtracking algorithm, the output might resemble.

Samples: 7K of event 'cycles:P', Event count (approx.): 141874642467
  Overhead  Command          Shared Object      Symbol
-   35.90%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig._pow_747
   - 35.29% Sudoku_norvig$_pow_747
      + 17.70% Sudoku_norvig$eliminate_value_760
      + 15.74% Sudoku_norvig$_pow_747
      + 1.03% Sudoku_norvig$fun_1435
     0.61% Sudoku_norvig$eliminate_value_760
-   12.99%  sudoku_norvig.e  sudoku_norvig.exe  [.] BatList$findnext_1543
   - 12.00% BatList$findnext_1543
      - 11.71% BatList$find_all_1539
         + 8.07% Sudoku_norvig$fun_1421
         + 2.73% Sudoku_norvig$eliminate_948
         + 0.90% Sudoku_norvig$assign_949
-   10.63%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$eliminate_value_760
   + 8.29% Sudoku_norvig$eliminate_value_760
   + 1.93% Sudoku_norvig$_pow_747
+    9.50%  sudoku_norvig.e  [kernel.kallsyms]  [k] asm_exc_nmi
+    7.24%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1435
+    5.06%  sudoku_norvig.e  sudoku_norvig.exe  [.] caml_modify
+    3.10%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1360
+    2.11%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$eliminate_948
+    2.00%  sudoku_norvig.e  sudoku_norvig.exe  [.] Stdlib.List$iter_366
+    1.54%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1421
+    1.47%  sudoku_norvig.e  sudoku_norvig.exe  [.] caml_apply2
+    1.42%  sudoku_norvig.e  sudoku_norvig.exe  [.] BatList$find_all_1539
+    1.20%  sudoku_norvig.e  sudoku_norvig.exe  [.] Sudoku_norvig$fun_1366
+    1.09%  sudoku_norvig.e  sudoku_norvig.exe  [.] Stdlib.List$fold_left_380

Note this implementation clearly has some optimsation possible in ‘_pow_747‘ a recursive Power of 2 function, and parts of the OCaml runtime ‘caml_modify‘ are present.

Profiling data can also be visualised as Flame Graphs, which highlight the most frequent code paths in stack traces. The original scripts scripts stackcollapse-perf.pl and flamegraph.pl can be found at Brendan Gregg’s Flame Graphs web page and work as follows:

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph

# Collect the results into perf.data
perf record -F 99 --call-graph fp PROGRAM
perf script -i perf.data | ./stackcollapse-perf.pl out.folded
flamegraph.pl out.folded > flamegraph.svg ## Create the FlameGraph svg

Alternatively inferno is a Rust port of the Flame Graphs tools which works in a similar way and is faster to process large perf files.

cargo install inferno

# Collect the results into perf.data
perf script -i perf.data | inferno-collapse-perf > stacks.folded
cat stacks.folded | inferno-flamegraph > flamegraph.svg

Some perf tools (e.g., perf-report(1) and perf-annotate(1)) use DWARF debugging information to associate symbols with source code locations, if you need these features, the program and it’s dependencies need to be compiled with -g to include debugging information in the executable.

Captured profile data can also be processed using perf script in various ways or with online tools like speedscope.app and profiler.firefox.com, or any other tool that supports perf-formatted data.

2.5 Conclusion

For Time profiling of native code, standard tools such as perf, eBPF, DTrace, or Instruments (on macOS) are recommended. Compiling with frame pointers enabled is often necessary for these tools to work effectively. Profiling with gprof is no longer supported.

Enabling frame pointers can impact performance on certain architectures (up to 10% performance cost on x86_64 has been measured). Users of this feature are encouraged to benchmark their own applications to assess this impact.

3 Time profiling with macOS Instruments

Instruments is an application performance analyzer and visualizer provided in XCode on macOS. It includes a CLI tool xctrace that can capture Time Profiling information about OCaml programs, there is also a GUI application for running and visualising profiles. Here we only cover capturing time profiles with xctrace, for more information on Instruments consult Apple’s documentation.

The same prerequisites hold for Time profiling with macOS Instruments as for Linux perf, consult Background21.2.1) from the previous section for an explanation of call graphs and follow the setup from Compiling for Profiling21.2.2).

The basic xctrace command for time profiling is:

$ xcrun xctrace record --output . --template "Time Profiler" \
  --target-stdout - --launch -- PROGRAM

The --template Time Profiler selects pre-defined template for time profiling which will create a trace file with extension ‘.trace‘ that can be opened with Instruments. Use the --output to select a name for the trace file, by default xctrace will create a new file each run rather than overwriting the existing file.

The resulting file can be opened with Instruments to view the profile, and it maybe be possible to generate Flame Graphs using inferno or the scripts from https://github.com/brendangregg/FlameGraph.

Note

Sometimes Instruments requires adding entitlements to an OCaml binary before it is able to profile it. This typically shows as a failure to launch the process or as a permission error. To work around this issue run the following command, replacing ’PROGRAM’ with the program to trace.

$ codesign -s - -v -f --entitlements =(echo -n
'<?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
   "https://www.apple.com/DTDs/PropertyList-1.0.dtd"\>
 <plist version="1.0">
    <dict>
        <key>com.apple.security.get-task-allow</key>
        <true/>
    </dict>
 </plist>') PROGRAM

For more details on entitlements and codesigning consult Apple’s documentation.

4 Glossary

The following terminology is used in this chapter of the manual.