Page
Library
Module
Module type
Parameter
Class
Class type
Source
cc
.Merge buffers representational abstraction (one per virtual device):
cuda-gdb
and compute-sanitizer
(pass the right arguments to cudajit).%cd
syntax.Sync_backend
creating CPU backends with a single device only, where all calls are synchronous. (It's a baseline and helps debugging.)%op
syntax: when under a ~config
parameter, refine the inline declared params' labels with config.label
.%op
syntax: incorporate the input tensor's (if any) label in the resulting tensor's label.~~
.Further terminology refactoring: Renamed Low_level.compile
to Low_level.lower
;
Low_level.compiled
to Low_level.optimized
, making it a record.Further refactoring of the Backends
API:
device
type into virtual device
and physical_device
,merge
, instead relying on merge buffers.Tnode.debug_name
function.Utils.settings.log_level
unified with ppx_minidebug's log levels.link
(resp. link_batch
), since they won't get introduced into the context. It is the responsibility of helper functions (such as those in Train
) to ensure the check.%op
syntax: lift ~config
applications out of (tensor) functions.randomness_lib
: currently only stdlib
and for_tests
.cuda_backend.missing.ml
.ocannl
to neural_nets_lib
, to prevent the opam linter from complaining about a confusing name.let%cd _ =
(and let%op _ =
?) do not affect root tracking (intended for adding shape constraints).beg_dims
and rightmost axes dims
.IDX
and CDSL
to Train
, and Tensor.O
to more precise Operation.At
.Tensor.mli
to reduce "the user learning surface".Shape.mli
.npy
package while we wait for a PR.cudajit
to depopts
.beg_dims
in constraints.outer_sum
: like einsum
but simpler, addition everywhere.Tensor.consume_forward_code
and consume_backprop_code
, (optionally but by default) used from Train
.Major rewrite. Abandoning the design choices of 0.1 and 0.2.
Train
module as an optimization "frontend".ppx_minidebug
(the debug_log_from_routines
setting).arrayjit
library / package containing compilation (former Ndarray, Node, Code).Formula
-> Tensor
No more "form vs. non-form" formulas / tensors.
%nn_rs
, %nn_dt
syntaxes and Synthetic
fetch primitive.%nn_op
to %op
and %nn_cd
to %cd
.gccjit
into a separate repository.cudajit
into a separate repository.zero_out
to initialize_neutral
to prepare arbitrary accumulation operation.Node
-> Lazy_array
-> Tnode
(tensor node).The Cuda backend.
Exec_as_cuda
backend where the dedicated Task_id
axis parallelizes over blocks, and a new dedicated Sample_num
axis parallelizes over threads in a block..cu
code and the assembly .ptx
code.Zero_out
low-level-code primitive using memset
.Staged_compilation
low-level-code primitive: a (stateful) callback for use by backends..npz
(.npy
archive) files (untested).Low-level code based optimizations:
ToPowOf
with integer exponent,axis_index
, simplified the axes-related types.'a low_level
into monomorphic unit_low_level
and float_low_level
.Node
+ NodeUI
into Ndarray
+ Node
.NaN
or infinity
.Task_id
functionality: removes If_task_id_is
and Global Task_id
; emoves parallelism from interpret_code
; removes task_id_func
vs unit_func
duplication.task_id
and sample_num
bindings.PrintBox_utils
benchmark tables cells.The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
Three / four synchronization heuristics:
A new category of axis dimensions is introduced: Frozen
. It is analogous to the Parallel
axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
memcpy
).%nn_rs
("postprocess results" variant of %nn_dt
) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
Rebalance
constructor around, but it is not playing any role.debug_virtual_nodes
, was tricky to maintain.Dynamic indexing now skips over parallel axes: when there is a Parallel
axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
Thread-local parameter task_id
for automated iteration over a dimension Parallel
.
Parallel
, and synchronization in the Gccjit
backend, are left as future work.Gccjit
.OCaml
backend.%nn_dt
and %nn_op
shape specification, allows identifiers.Gccjit
execution by printing the comments.refresh_session ()
, but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session.Interpreter
backend that can for example log all individual tensor modifications.Gccjit
backend that can sometimes be 400x faster than the Interpreter
backend (without any debug work/output).