Page
Library
Module
Module type
Parameter
Class
Class type
Source
randomness_lib
: currently only stdlib
and for_tests
.cuda_backend.missing.ml
.ocannl
to neural_nets_lib
, to prevent the opam linter from complaining about a confusing name.let%cd _ =
(and let%op _ =
?) do not affect root tracking (intended for adding shape constraints).beg_dims
and rightmost axes dims
.IDX
and CDSL
to Train
, and Tensor.O
to more precise Operation.At
.Tensor.mli
to reduce "the user learning surface".Shape.mli
.npy
package while we wait for a PR.cudajit
to depopts
.beg_dims
in constraints.outer_sum
: like einsum
but simpler, addition everywhere.Tensor.consume_forward_code
and consume_backprop_code
, (optionally but by default) used from Train
.Major rewrite. Abandoning the design choices of 0.1 and 0.2.
Train
module as an optimization "frontend".ppx_minidebug
(the debug_log_from_routines
setting).arrayjit
library / package containing compilation (former Ndarray, Node, Code).Formula
-> Tensor
No more "form vs. non-form" formulas / tensors.
%nn_rs
, %nn_dt
syntaxes and Synthetic
fetch primitive.%nn_op
to %op
and %nn_cd
to %cd
.gccjit
into a separate repository.cudajit
into a separate repository.zero_out
to initialize_neutral
to prepare arbitrary accumulation operation.Node
-> Lazy_array
-> Tnode
(tensor node).The Cuda backend.
Exec_as_cuda
backend where the dedicated Task_id
axis parallelizes over blocks, and a new dedicated Sample_num
axis parallelizes over threads in a block..cu
code and the assembly .ptx
code.Zero_out
low-level-code primitive using memset
.Staged_compilation
low-level-code primitive: a (stateful) callback for use by backends..npz
(.npy
archive) files (untested).Low-level code based optimizations:
ToPowOf
with integer exponent,axis_index
, simplified the axes-related types.'a low_level
into monomorphic unit_low_level
and float_low_level
.Node
+ NodeUI
into Ndarray
+ Node
.NaN
or infinity
.Task_id
functionality: removes If_task_id_is
and Global Task_id
; emoves parallelism from interpret_code
; removes task_id_func
vs unit_func
duplication.task_id
and sample_num
bindings.PrintBox_utils
benchmark tables cells.The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
Three / four synchronization heuristics:
A new category of axis dimensions is introduced: Frozen
. It is analogous to the Parallel
axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
memcpy
).%nn_rs
("postprocess results" variant of %nn_dt
) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
Rebalance
constructor around, but it is not playing any role.debug_virtual_nodes
, was tricky to maintain.Dynamic indexing now skips over parallel axes: when there is a Parallel
axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
Thread-local parameter task_id
for automated iteration over a dimension Parallel
.
Parallel
, and synchronization in the Gccjit
backend, are left as future work.Gccjit
.OCaml
backend.%nn_dt
and %nn_op
shape specification, allows identifiers.Gccjit
execution by printing the comments.refresh_session ()
, but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session.Interpreter
backend that can for example log all individual tensor modifications.Gccjit
backend that can sometimes be 400x faster than the Interpreter
backend (without any debug work/output).