Page
Library
Module
Module type
Parameter
Class
Class type
Source
OCANNL is sponsored by Ahrefs! Visit the Ahrefs website.
The long-term goal is to provide several "low-level" backends, aiming to seek inspiration from projects such as tinygrad, TVM, Luminal.
for
loops.The library users can compile any amount of code into a routine (i.e. a compilation unit). The user decides explicitly what the scope of a compilation unit is, by putting together the corresponding code. Depending on the use case:
Tensor axes are split into kinds: batch, input and output. Tensor dimensions have optional labels.
OCANNL has full support for a significantly extended einsum
notation, integrated with shape inference. Supports static indexing, with a built-in operation to take a slice of the batch axes, integrated with shape inference. Extensible to more static indexing patterns as needs arise.
OCANNL offers two main levels of abstraction.
Tensor expressions as differentiable computations, centered around the %op
syntax extension.
%op
stands for "operation", it's meant to express tensors: Tensor.t
, and tensor functions.Plain computations, centered around the %cd
syntax extension. It integrates the arrayjit
backend library with shape inference.
%cd
stands for "code", it's meant to express assignment computations: Assignments.comp
.Fully supports mixed-precision computations, with bidirectional precision inference.
The CUDA backend requires at least CUDA version 12.8. The Metal backend requires at least MSL version 3.1.
API documentation entry point.
A possible route to learning OCANNL:
Context
.To use debugging as provided by configuring Utils.settings.debug_log_from_routines <- true
with the cuda
backend, you need to wrap the code scheduling tasks and synchronizing cuda
devices with Utils.capture_stdout_logs
. The reason is that CUDA kernels are allowed to use printf
, but not fprintf
-- the driver dumps the printing buffer of a device to stdout
at certain times (e.g. when synchronizing the device). For an example, see the implementation of Train.example_train_loop
. Specifically, it wraps two sections: the call to Train.parallel_update
, and the body of the returned infer_callback
.
NOTE: debug logging from CUDA in complex settings is a bit tricky, it involves another thread (domain) intercepting and filtering stdout
. If facing issues, try the setting never_capture_stdout=true
(see ocannl_config.example).
This is very tentative.
0.6.2: Shape inference improvements, convolution NNs, toy generative transformer.
0.7: CPU-style performance and memory efficiency.
0.7.1: Real-life transformer inference. HIP backend (AMD hardware) and WebGPU backend.
0.8: GPU-style performance -- low hanging fruit.
0.8.1: shape understanding and manipulation enhancements.
0.9: Optimize performance: program search.
1.0: Few documentation gaps, some degree of feature completeness, ergonomics, safety.
\color{green}{\text{explore}}
issues.For more details, see CHANGES.
0.6.1: Syntax extension improvements, transformer building blocks.
0.6: more precisions, initialization, counter-based randomness, strided iteration.
0.5.3: Apple Metal backend.
0.5.2: More primitive operations.
%cd
and %op
support both curried and uncurried operator application syntax.%cd
syntax (better projections inference).0.5.0: Stream-to-stream synchronization at the buffer level.
Condition
-based events for CPU backends.0.4.1 Half precision, mixed precision, CUDA virtual devices (virtual devices renamed to streams in 0.5.0)
v0.3 Shape inference, jitted routines: a major rewrite of the whole project.
v0.2 Inching toward GPU:
v0.1 GCCJIT backend:
Gccjit
backend, single and double precision floats, code compiled as a monolithic update step function.OCANNL follows different design choices than OWL. For example:
Some aspects are more centralized in OCANNL than in OWL and form the "infrastructure":
Tensor
implements "putting pieces together".Train
has the optimization "frontend" and utilities.arrayjit
, which may one day become a standalone library: generates the code, performs backend-agnostic optimizations (virtual nodes whose computation is inlined), implements the backends.Some aspects that are more core to OWL are less encapsulated in OCANNL, so it should be more natural to extend them.
Although the project is called ocannl
, the main package is called neural_nets_lib
, to avoid the (opam linter's) complaint that the name can be confused with other packages. This also clarifies that ocannl
is composed of arrayjit
and neural_nets_lib
.
The dependency on cudajit
is optional so you have to install it first to enable the CUDA backend. The dependency on metal
is MacOS-specific but automatic.
The codebase is organized to separate user-facing recipes from framework internals:
lib/
: User-facing recipes and utilities
train.ml
- Training utilities and optimizersnn_blocks.ml
- Neural network building blocks (transformers, attention, convolution, etc.)ocannl.ml
- Re-exports for backward compatibilitytensor/
: Framework internals (separate library ocannl_tensor
)
tensor.ml/mli
- Core tensor type and operationsshape.ml/mli
- Shape inference systemoperation.ml
- Tensor operations and DSL modulesppx_*.ml
- Syntax extensions implementationarrayjit/
: Low-level optimizing compiler with multiple backendsNOTE TO POTENTIAL CONTRIBUTORS: while I am might be slowly starting to work with PRs in separate branches rather than just a stream of commits on the main branch, design migrations will be broken into small PRs to avoid main (master) branch staleness; and many changes will still be commits on the main branch. We allow for failing tests on the main branch, although going forward this would hopefully be happening less. Tagged i.e. released versions of the code are guaranteed to work as well as the given stage of the project permitted, the policy is that all tests must pass for releases with the backend sync_cc
and must have the behavior excpected of a backend with all other backends. We try to minimize discrepancy across backends but prefer more stringent tests even if some backends only pass them "in spirit" rather than with exact expectations of the sync_cc
backend.
OCANNL uses ppx_minidebug
for debugging. Currently, we migrated to a per-file opt-in scheme for enabling ppx_minidebug at compile time (via environment variables, see the top of .ml
files in question), and then a unified log level configuration (ocannl_log_level
) for tuning logging at runtime. Due to the compile-time nature of the per-file settings, run dune clean
after setting/exporting one of these environment variables.