Page
Library
Module
Module type
Parameter
Class
Class type
Source
randomness_lib: currently only stdlib and for_tests.cuda_backend.missing.ml.ocannl to neural_nets_lib, to prevent the opam linter from complaining about a confusing name.let%cd _ = (and let%op _ =?) do not affect root tracking (intended for adding shape constraints).beg_dims and rightmost axes dims.IDX and CDSL to Train, and Tensor.O to more precise Operation.At.Tensor.mli to reduce "the user learning surface".Shape.mli.npy package while we wait for a PR.cudajit to depopts.beg_dims in constraints.outer_sum: like einsum but simpler, addition everywhere.Tensor.consume_forward_code and consume_backprop_code, (optionally but by default) used from Train.Major rewrite. Abandoning the design choices of 0.1 and 0.2.
Train module as an optimization "frontend".ppx_minidebug (the debug_log_from_routines setting).arrayjit library / package containing compilation (former Ndarray, Node, Code).Formula -> TensorNo more "form vs. non-form" formulas / tensors.
%nn_rs, %nn_dt syntaxes and Synthetic fetch primitive.%nn_op to %op and %nn_cd to %cd.gccjit into a separate repository.cudajit into a separate repository.zero_out to initialize_neutral to prepare arbitrary accumulation operation.Node -> Lazy_array -> Tnode (tensor node).The Cuda backend.
Exec_as_cuda backend where the dedicated Task_id axis parallelizes over blocks, and a new dedicated Sample_num axis parallelizes over threads in a block..cu code and the assembly .ptx code.Zero_out low-level-code primitive using memset.Staged_compilation low-level-code primitive: a (stateful) callback for use by backends..npz (.npy archive) files (untested).Low-level code based optimizations:
ToPowOf with integer exponent,axis_index, simplified the axes-related types.'a low_level into monomorphic unit_low_level and float_low_level.Node + NodeUI into Ndarray + Node.NaN or infinity.Task_id functionality: removes If_task_id_is and Global Task_id; emoves parallelism from interpret_code; removes task_id_func vs unit_func duplication.task_id and sample_num bindings.PrintBox_utils benchmark tables cells.The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
Three / four synchronization heuristics:
A new category of axis dimensions is introduced: Frozen. It is analogous to the Parallel axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
memcpy).%nn_rs ("postprocess results" variant of %nn_dt) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
Rebalance constructor around, but it is not playing any role.debug_virtual_nodes, was tricky to maintain.Dynamic indexing now skips over parallel axes: when there is a Parallel axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
Thread-local parameter task_id for automated iteration over a dimension Parallel.
Parallel, and synchronization in the Gccjit backend, are left as future work.Gccjit.OCaml backend.%nn_dt and %nn_op shape specification, allows identifiers.Gccjit execution by printing the comments.refresh_session (), but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session.Interpreter backend that can for example log all individual tensor modifications.Gccjit backend that can sometimes be 400x faster than the Interpreter backend (without any debug work/output).