Page
Library
Module
Module type
Parameter
Class
Class type
Source
Implemented the previously-mocked support for half precision (FP16).
Ctypes.bigarray_start.Tnode.update_prec.nn_blocks.ml hinting at an intended design pattern for model components.pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.%cd block comment syntax ~~ to allow detailed structuring. Rewrote Train.grad_update to use the %cd syntax.Train.sgd_one slightly more thrifty: p =- learning_rate *. sgd_delta --> p =- learning_rate * sgd_delta ~logic:"." without the inline tensor expression.Log levels related de-confusion:
log_level and inform about its setting.debug_log_from_routines should only happen when log_level > 1.Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.c_compile_and_load, propagating compilation errors now instead of infinite loop on error.C_syntax backend builder.let%expect_test, to make them more deterministic.cc.Merge buffers representational abstraction (one per virtual device):
cuda-gdb and compute-sanitizer (pass the right arguments to cudajit).%cd syntax.Sync_backend creating CPU backends with a single device only, where all calls are synchronous. (It's a baseline and helps debugging.)%op syntax: when under a ~config parameter, refine the inline declared params' labels with config.label.%op syntax: incorporate the input tensor's (if any) label in the resulting tensor's label.~~.Further terminology refactoring: Renamed Low_level.compile to Low_level.lower;
Low_level.compiled to Low_level.optimized, making it a record.Further refactoring of the Backends API:
device type into virtual device and physical_device,merge, instead relying on merge buffers.Tnode.debug_name function.Utils.settings.log_level unified with ppx_minidebug's log levels.link (resp. link_batch), since they won't get introduced into the context. It is the responsibility of helper functions (such as those in Train) to ensure the check.%op syntax: lift ~config applications out of (tensor) functions.randomness_lib: currently only stdlib and for_tests.cuda_backend.missing.ml.ocannl to neural_nets_lib, to prevent the opam linter from complaining about a confusing name.let%cd _ = (and let%op _ =?) do not affect root tracking (intended for adding shape constraints).beg_dims and rightmost axes dims.IDX and CDSL to Train, and Tensor.O to more precise Operation.At.Tensor.mli to reduce "the user learning surface".Shape.mli.npy package while we wait for a PR.cudajit to depopts.beg_dims in constraints.outer_sum: like einsum but simpler, addition everywhere.Tensor.consume_forward_code and consume_backprop_code, (optionally but by default) used from Train.Major rewrite. Abandoning the design choices of 0.1 and 0.2.
Train module as an optimization "frontend".ppx_minidebug (the debug_log_from_routines setting).arrayjit library / package containing compilation (former Ndarray, Node, Code).Formula -> TensorNo more "form vs. non-form" formulas / tensors.
%nn_rs, %nn_dt syntaxes and Synthetic fetch primitive.%nn_op to %op and %nn_cd to %cd.gccjit into a separate repository.cudajit into a separate repository.zero_out to initialize_neutral to prepare arbitrary accumulation operation.Node -> Lazy_array -> Tnode (tensor node).The Cuda backend.
Exec_as_cuda backend where the dedicated Task_id axis parallelizes over blocks, and a new dedicated Sample_num axis parallelizes over threads in a block..cu code and the assembly .ptx code.Zero_out low-level-code primitive using memset.Staged_compilation low-level-code primitive: a (stateful) callback for use by backends..npz (.npy archive) files (untested).Low-level code based optimizations:
ToPowOf with integer exponent,axis_index, simplified the axes-related types.'a low_level into monomorphic unit_low_level and float_low_level.Node + NodeUI into Ndarray + Node.NaN or infinity.Task_id functionality: removes If_task_id_is and Global Task_id; emoves parallelism from interpret_code; removes task_id_func vs unit_func duplication.task_id and sample_num bindings.PrintBox_utils benchmark tables cells.The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
Three / four synchronization heuristics:
A new category of axis dimensions is introduced: Frozen. It is analogous to the Parallel axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
memcpy).%nn_rs ("postprocess results" variant of %nn_dt) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
Rebalance constructor around, but it is not playing any role.debug_virtual_nodes, was tricky to maintain.Dynamic indexing now skips over parallel axes: when there is a Parallel axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
Thread-local parameter task_id for automated iteration over a dimension Parallel.
Parallel, and synchronization in the Gccjit backend, are left as future work.Gccjit.OCaml backend.%nn_dt and %nn_op shape specification, allows identifiers.Gccjit execution by printing the comments.refresh_session (), but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session.Interpreter backend that can for example log all individual tensor modifications.Gccjit backend that can sometimes be 400x faster than the Interpreter backend (without any debug work/output).