package neural_nets_lib
- Added
- Changed
- Fixed
- [0.6.0] -- 2025-08-19
- [0.5.3] -- 2025-05-24
- [0.5.2] -- 2025-04-07
- [0.5.1] -- 2025-01-01
- Added
- Fixed
- [0.5.0] -- 2024-12-18
- [0.4.1] -- 2024-09-17
- [0.4.0] -- 2024-09-04
- [0.3.3] -- 2024-04-24
- [0.3.2] -- 2024-04-22
- [0.3.1] -- 2024-04-15
- [0.3.0] -- 2024-03-31
- [0.2.1] -- 2023-07-19
- [0.2.0] -- 2023-06-03
- [0.1.2] -- 2023-05-12
- [0.1.1] -- 2023-05-06
- [0.1.0] -- 2023-05-04
Install
dune-project
Dependency
Authors
Maintainers
Sources
md5=529f0921963a6eee0194159a9c0fea41
sha512=fc16e8b6cd72cb2ae18277b3727d065fa6c1d137e3187f9586fb0bfe7edeb45597cb58f389e79c20d7e3ae80661e6f9f20e0b95dcbbf27ee5688bcc571d395dd
doc/CHANGES.html
[0.6.1] -- 2025-01-12
Added
- Record-based syntax for inline tensor definitions in
%opand%cdexpressions uniform1variants for non-vectorized random number generation (Uint4x32_to_prec_uniform1)- Support for uint32/uint64 precisions and
big_modelsflag for indexing arithmetic - Created docs landing page with automatic publishing action
- Group Relative Policy Optimization (GRPO) documentation in RL slides
- Counter-based randomness with lightweight (2-round) Threefry variant as default
- Claude GitHub Actions for automated code review and PR assistance
- Heterogeneous precision support for primitive operations
- Both zero-initialized and undefined-initialization buffer creation options
- More output options for
ocannl_read_configutility - Added comprehensive RL/REINFORCE tutorial slides with concrete examples
- Added clear explanations of slipshow navigation semantics to CLAUDE.md
- Transformer architecture support with multi-head attention, layer normalization, and positional encodings
- CNN building blocks: conv2d, pooling operations (max/avg), and comprehensive migration guide
- Context API as simplified backend interface replacing stream-based parallelism
- Shape constraint provenance tracking for dramatically improved error messages with origins
- Dimension capture and equality constraints in einsum specifications via
set_dimandset_equal - New einsum operations:
einmax1(unary max-reduce) andtropical(max-reduce with add) %ocanti-quotation syntax for improved OCaml integration in ppx extensions- Tensor initialization operation with configurable strategies
offsetsconvenience operation for index generation- Comprehensive migration guide for PyTorch/TensorFlow users
- Shapes and einsum tutorial slides with slipshow presentation format
- Configurable limit on shape constraint provenance tracking
- Origin tracking in shape error messages
Changed
- Major fix to tensor initialization handling with uniform generation across TDSL, NTDSL, PDSL
%opscope delimiting changed from~configto unit parameters for cleaner syntax- Renamed
zero_initializedtozero_initialized_by_codefor clarity - Split Threefry4x32 into crypto (20-round) and light (2-round) variants
- Changed
Operation.rangesemantics to match Python'srangefunction - Improved precision handling in low-level operations with bidirectional inference
- Enhanced record syntax with field shortcuts (
o→output_dims,i→input_dims,b→batch_dims) - More precise and thus more lenient rootness checks in
Tensor.consume_functions - Generalized
guess_output_nodestocollect_nodes_guess_outputfor more reuse - Converted documentation slides to use slipshow for better updatability and navigation
- Updated CLAUDE.md with record syntax documentation and testing guidelines
- Major reorganization: moved tensor-related modules to dedicated
tensor/directory - Renamed einsum operator
*+to+*for better consistency - Refactored DSL modules into
Operation.DSL_modulesfor cleaner API - Removed stream-based parallelism in favor of simpler Context API
- Improved
%opand%cdsyntax extensions with better function application handling - Enhanced shape inference with proper dimension staging (no closing at stage 2)
- Migrated documentation to
docs/directory with pandoc rendering support - Improved
.cdfile generation with clearer rendering of special operations - Updated ppx_minidebug integration with log pruning for better performance
Fixed
- Unnecessary dune rules triggering issue
- Precision handling for
Uint4x32_to_prec_uniform1in scalar computations - CUDA, Metal, and C backend fixes for various precision and initialization issues
- Zero-dimensional Bigarray indexing
- Test dependency on
OCANNL_BACKENDenvironment variable - Build setup for
ocannl_read_configutility needed for tests - Missing package dependencies and assignments in dune configuration
- Critical transformer bugs: mask handling, attention dimension specifications, position encodings
- C backend INFINITY macro usage (was using invalid inf literals)
- Shape inference bugs with dimension closing and constraint generation
- Dropout pseudo-random number splitting
- Layer normalization implementation in
nn_blocks.ml - Dimension inference for attention layers with hidden dimensions
- Pooling operations projection inference
- Division simplification for integer precision
- Various syntax extension edge cases and error handling
[0.6.0] -- 2025-08-19
Added
- Support for Brain float aka. bfloat16 aka. BF16, and for FP8.
- Support for convolution via affine indexing expressions in: projections, einsum notation, shape inference.
- MNIST and CIFAR10 datasets (borrowed from Raven).
- Names dataset with bigram use-case helper.
- Half-moons synthetic dataset.
- New precision
Uint4x32that piggybacks on theComplex.ttype for theBigarraybacking. - New precision
Int64for integer operations. New operation
Threefry4x32, which is unusually and hopefully uniquely coarse-grained (requiring nontrivial implementation code for each backend that should conform to a common algorithm).- This way we avoid introducing multiple operations on bits.
Support of counter-based randomness via the
Threefry4x32operation and random seed tracking.- The cascade of splits uses the Tnode id, the train step and the tensor cell position.
- Added a new operation
Uint4x32_to_prec_uniformthat converts the 128-bit random values to floating point uniform distributions efficiently. - Vector operations support with
Set_from_vecin low-level IR for efficient vectorized assignments. - Added a field
paramstoTensor.tsince we need to track parameters to properly initialize computations (see below). Embed_self_idoperation for positional embeddings.- Bidirectional precision inference (both top-down and bottom-up).
- Enhanced
%cdsyntax with support for.forward,.backprop,.zero_gradsand automatic comment generation. - Inline tensor declarations in
%cdsyntax for standalone expressions. Train.init_paramsfor streamlined parameter initialization.- Better configurability with
inline_complex_computationssetting.
Changed
- Removed the ndarray initialization logic. Some of its functionality is now incorporated into
fetch_op. Refactored
init_opand the badly namedglobal_identifierfromops.mlintodedicated_accessinlow_level.mland a biggerfetch_opinassignments.ml(more meaningful file locations).- Also renamed the badly named
Get_globaltoAccess.
- Also renamed the badly named
- Initialization now needs to be handled via running the corresponding code explicitly. In particular
Tensor.init_paramswill run the forward code of tensors from theparamsfield. - Virtual nodes and inlining now also work across routines. This required changing the API to pass the
optimize_ctxoptimization context. - Made ppx_minidebug logging per-file opt-in at compile time for better control.
- Refactored Tensor API to reduce boilerplate and share parameter signatures.
- Renamed
float_ttoscalar_tthroughout the codebase for consistency. - Migrated from heap-local allocation to on-stack allocation by default.
- Improved shape inference with better Total_elems constraint handling and LUB (Least Upper Bound) support.
- Enhanced projections inference with better slot selection heuristics.
- More defensive handling of empty dimensions and zero-dimension scalars.
Fixed
- Memory leak in builtins.c.
- Context handling for constants initialized on devices.
- Zero-initialization that wasn't being performed on Linux (MacOS zero-initializes by default).
- Surjectivity and bijectivity checking in indexing operations.
- CUDA backend regressions and missing constructs.
- Duplicate Shape_rows constraints elimination.
- Precision inference issues with premature forcing.
- Bus error on large datasets.
- Session-level bugs that appeared only in specific backends.
- Identifier generation to not start with digits.
- Host-device synchronization issues with
devices_not_lagging_hostsemantics. - Shape inference corner cases with Total_elems and row constraints.
- Various issues with convolution and strided iteration support.
- Moved away from using statically loaded builtins.c from routines (kernels), all backends now prepend their builtins textually.
- Emulating _Float16 aka. half on systems with C compilers that don't support it.
[0.5.3] -- 2025-05-24
Added
- The Metal framework backend (Apple Silicon).
- Setting
debug_log_to_stream_filesto neatly keep logs from routine execution in their separate files. - Settings
clean_up_artifacts_on_startup,prefer_backend_uniformity. - Tools directory and the
minisedtool: regexp replacement file rewrite. - Directory arrayjit/bin and executable
read_configfor extracting OCANNL configuration into txt files.
Changed
- Removed
initializeandis_initializedfrom the backend API; instead, backends should be initialized on functor application. The functors now takeconfigas argument. - More descriptive identifier names in C-syntax code in case of name conflicts.
- Changed the backend config name
cctomulticore_ccfor consistency. - Migrated out of
Stdlib.FormattoPPrintfor all structured formatting. - Migrated stdout capture to thread-based (domain-based actually); for Windows compatibility but also much more robust for large logs.
Fixed
- Avoid conflicts with C math function names like
fma. - Satur01_gate had wrong semantics.
[0.5.2] -- 2025-04-07
Added
Lots of new primitive ops:
- Unary: Satur01 | Exp | Log | Exp2 | Log2 | Sin | Cos | Sqrt | Recip | Recip_sqrt | Neg | Tanh_approx | Not
- Binary: Satur01_gate | Max | Min | Mod | Cmplt | Cmpeq | Cmpne
- Ternary: Where | FMA (non-accumulating)
Ternary tensor operations.
- A differentiable
whereoperation.
- A differentiable
- More flexible gradient construction via the
%cdsyntax (better projections inference). - CC backend piggy-backing on OCaml's C compiler (consistent across OSes).
Changed
- Updated to printbox 0.12, with upstreamed graphing.
-pthread->-lpthreadinc_library_flagsindunefiles.- Removed Numpy support for easier compatibility on native Windows.
- Unary (primitive) ops and relu are now named, not operator syntax.
- Refactored
%cdparsing of primitive ops. %cdand%opsupport both curried and uncurried operator application syntax.- Updated to ppx_minidebug 2.2.0 with support for cross-run diffing.
Fixed
- Numbers text rendering (consistent across OSes).
- Moved closing row variables to stage 3, because stage 2 may need to process inequalities generating more LUBs.
- Don't unnecessarily prevent bytecode-only build targets.
[0.5.1] -- 2025-01-01
Added
- Automatic transfers to host from the context that most recently updated a node.
- Automatic transfers of routine's inputs from host to routine's context if the host array modification was not yet transfered.
Fixed
- Added
#as alternative to~~for comment lines inocannl_configfiles, and fixed a bug in their parsing.
[0.5.0] -- 2024-12-18
Added
- Interface files for
BackendsandLow_level. - Fixed #245: tracking of used memory. But there's room for improvement.
- Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.
Changed
- Migrated to cudajit 0.6.1.
- Verifying that code is linked with the right contexts, by tracking
embedded_nodeswith assignments. - Renaming: (virtual)
device->stream,physical_device->device. - New files: split out
backend_intf.ml,backend_impl.ml,schedulers.mlfrombackends.ml; movedTnode.tasktotask.ml; renamedbackend_utils.mltoc_syntax.ml. - Removed half-static verification of merge buffer nodes inside
device_to_device. - Fixed #286: cross-stream-sharing incorporated into
Tnode.memory_mode. - Moved the multicore backend from a
device = streammodel to a single device model. - Got rid of
unsafe_cleanup. - Rename
subordinaltostream_id. - Removed dependency on
core, broke up dependency onppx_jane. - Huge refactoring of backend internal interfaces and API (not repeating same code).
- Built per-tensor-node stream-to-stream synchronization into copying functions.
- Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
- Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
- Fresh backends are now fresh modules to structurally prevent any potential cache leaking.
Fixed
- Validating merge nodes for the CUDA backend.
- Checking
is_releasedon weak array retrieval.
[0.4.1] -- 2024-09-17
Added
Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using
Ctypes.bigarray_start. - We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
- We work around the missing Ctypes coverage by not using
- Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via
Tnode.update_prec. - A placeholder
nn_blocks.mlhinting at an intended design pattern for model components. - A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
- Slides for the Fun OCaml meetup: docs/Fun OCaml.
- New syntax: inline tensor declarations with a literal float as initial value.
Changed
- Removed the
pipes_cc, pipes_gccjitbackends (Pipes_multicore_backend) -- I had fixedPipes_multicore_backendby using thepolllibrary instead ofUnix.select, but it turns out to be very very slow. - Changed the
%cdblock comment syntax~~to allow detailed structuring. RewroteTrain.grad_updateto use the%cdsyntax. - Made
Train.sgd_oneslightly more thrifty:p =- learning_rate *. sgd_delta-->p =- learning_rate * sgd_delta ~logic:"."without the inline tensor expression.
Fixed
Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore
log_leveland inform about its setting. - By default do not log from tests.
debug_log_from_routinesshould only happen whenlog_level > 1.
- Bugs in
Multicore_backend:awaitwas not checking queue emptiness,worker'sCondition.broadcastwas non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced withsaturn_lockfree. - Reduced busy-waiting inside
c_compile_and_load, propagating compilation errors now instead of infinite loop on error. - Fixed loss of significant digits for small numbers when outputting files.
- Added missing mixed-precision conversions in the
C_syntaxbackend builder. - Restored the functionality of debug logging from the cuda backend.
- Always reinitialize global state at the beginning of
let%expect_test, to make them more deterministic.
[0.4.0] -- 2024-09-04
Added
- A new backend "cc": C based on a configurable C compiler command, defaulting to
cc. Merge buffers representational abstraction (one per virtual device):
- backends just need to support device-to-device transfers,
- merging gets implemented in "user space".
- CUDA streaming multiprocessor parallelism via streams <-> virtual devices.
- Support for
cuda-gdbandcompute-sanitizer(pass the right arguments to cudajit). - Inline declarations for (non-differentiable) tensors in the
%cdsyntax. - A minimal wrapper
Sync_backendcreating CPU backends with a single device only, where all calls are synchronous. (It's a baseline and helps debugging.) - In progress: proper (condition variables based) scheduler. The legacy scheduler (pipes based) kept for now as baseline and to help debugging.
- Documentation for the syntax extensions.
%opsyntax: when under a~configparameter, refine the inline declared params' labels withconfig.label.%opsyntax: incorporate the input tensor's (if any) label in the resulting tensor's label.- Comments in config files using the line prefix
~~.
Changed
- Terminology in the API: Renamed almost all uses of "jit" into uses of "compile" and / or "link".
- Split the compile-to-ptx phase from the build-module and build-kernel-launcher phase.
- Migrated the CUDA backend to ppx_minidebug-based execution tracing.
- Fixes for mixed precision computations.
Further terminology refactoring: Renamed
Low_level.compiletoLow_level.lower;- and
Low_level.compiledtoLow_level.optimized, making it a record.
- and
Further refactoring of the
BackendsAPI:- split the
devicetype into virtualdeviceandphysical_device, - removed the direct support for
merge, instead relying on merge buffers.
- split the
- Updated to cudajit 0.4.
- A template for C-syntax backends, refactoring CC and CUDA backends.
- Improvements to handling of tensor node labels, and to the
Tnode.debug_namefunction. - Output files generated by backends, and files generated by logging, in separate subdirectories.
- C-syntax logging: also output the pre-assignment value when logging an assignment.
- Migrated to ppx_minidebug 2.0 with the benefits it brings: no runtime passing,
Utils.settings.log_levelunified with ppx_minidebug's log levels.
Fixed
- Allow verifying that non-embedded tensor nodes of the tensor(s) associated with a linked code are already in the context passed to
link(resp.link_batch), since they won't get introduced into the context. It is the responsibility of helper functions (such as those inTrain) to ensure the check. - Fixed both known and newly discovered shortcomings of the syntax extensions.
- In particular,
%opsyntax: lift~configapplications out of (tensor) functions. - Multiple other tiny fixes.
[0.3.3] -- 2024-04-24
Added
- GitHub workflow for continuous integration and API docs.
- Randomness plug-ins via global config
randomness_lib: currently onlystdlibandfor_tests.
Fixed
- A bit of code rot in the Cuda backend mock
cuda_backend.missing.ml. - NPY: Compatibility with OCaml 5.2.0.
- Renamed the main package name from
ocannltoneural_nets_lib, to prevent the opam linter from complaining about a confusing name.
[0.3.2] -- 2024-04-22
Added
let%cd _ =(andlet%op _ =?) do not affect root tracking (intended for adding shape constraints).- More expressive shape constraints: allowing row variables to be sandwiched between leftmost axes
beg_dimsand rightmost axesdims. - Einsum notation support for leftmost axes.
Changed
- Cleaned up "user-facing" API by moving
IDXandCDSLtoTrain, andTensor.Oto more preciseOperation.At. - Added interface
Tensor.mlito reduce "the user learning surface". - Improved documentation and layout of
Shape.mli. - A more reasonable syntax for labels specifications and einsum notation. In particular, whitespace insensitive (except whitespace not allowed inside identifiers).
- Vendored the
npypackage while we wait for a PR.
Fixed
- Moved
cudajittodepopts. - Slice shape inference is now complete, by using leftmost axes
beg_dimsin constraints.
[0.3.1] -- 2024-04-15
Added
- Tensor parameters saving and restoring, Ndarray saving and restoring.
- An operation
outer_sum: likeeinsumbut simpler, addition everywhere.
Changed
- Tweaks to make the project usable as a package (external library).
- Sanitizing code inclusion via code roots management:
Tensor.consume_forward_codeandconsume_backprop_code, (optionally but by default) used fromTrain.
Fixed
- Shape inference in presence of non-0 fixed indexing inside einsums was broken (because actually not implemented).
- Incompleteness of shape inference for slicing was leading to inferring shapes with no axes: constraint generation was intended to raise a shape error instead. Proper fix coming in 0.3.2 will make slice shape inference complete.
[0.3.0] -- 2024-03-31
Major rewrite. Abandoning the design choices of 0.1 and 0.2.
Added
- Optionally, inferring or checking tensor (batch) sizes from data (e.g. file) sizes.
- Static indexing. A "slice" operator to select individual batches.
- Established the backends API with first-class modules.
- The
Trainmodule as an optimization "frontend". - Parallel optimization across devices.
- Global settings configurable via config files, environment variables, and commandline flags.
- Integration of backend logging with
ppx_minidebug(thedebug_log_from_routinessetting).
Changed
- The Cuda backend is not supported for now. It is (optionally) buildable to reduce code rot.
- Dynamic indexing is not supported anymore (to reduce complexity). It might be reintroduced if needed.
- Factored out the
arrayjitlibrary / package containing compilation (former Ndarray, Node, Code). - Renamed
Formula->Tensor No more "form vs. non-form" formulas / tensors.
- Formula/tensor roots are split into forward roots and backprop roots.
- No more
%nn_rs,%nn_dtsyntaxes andSyntheticfetch primitive. - Renamed
%nn_opto%opand%nn_cdto%cd. - Migrated
gccjitinto a separate repository. - Migrated
cudajitinto a separate repository. - Massive rewrite of shape inference in a declarative style.
- Generalize
zero_outtoinitialize_neutralto prepare arbitrary accumulation operation. - Renamed
Node->Lazy_array->Tnode(tensor node).
[0.2.1] -- 2023-07-19
Added
The Cuda backend.
- The Cudajit interface based on Nvrtc and the Cuda driver API.
- A naive
Exec_as_cudabackend where the dedicatedTask_idaxis parallelizes over blocks, and a new dedicatedSample_numaxis parallelizes over threads in a block. - When outputting debug files, stores the source
.cucode and the assembly.ptxcode. - Supports thread-only tensors, tensors with thread-local "replicated" working copies, constant tensors, and globally updated tensors.
- The backend uses atomic adds for shared updates, and within-block synchronization to minimize update races and parameter staleness.
- Debugging: full trace (for thread 0) by logging assignments with the assigned value and indices for the LHS tensor and the RHS tensors, the expression used to compute the assigned value, of values of subexpressions.
- Cuda FFI for retrieving GPU specs and for getting and setting limits.
Zero_outlow-level-code primitive usingmemset.Staged_compilationlow-level-code primitive: a (stateful) callback for use by backends.- When outputting debug files, also stores the high-level code.
- Saving and restoring tensor content to
.npz(.npyarchive) files (untested). Low-level code based optimizations:
- unrolls
ToPowOfwith integer exponent, - simplifies local computations that are just expressions,
- some arithmetic simplifications.
- unrolls
Changed
- Monomorphic
axis_index, simplified the axes-related types. - Splits
'a low_levelinto monomorphicunit_low_levelandfloat_low_level. - Removes integer bigarray types.
- Refactors
Node+NodeUIintoNdarray+Node. - Tensor printouts include whether a tensor contains
NaNorinfinity. - Simplifies the
Task_idfunctionality: removesIf_task_id_isandGlobal Task_id; emoves parallelism frominterpret_code; removestask_id_funcvsunit_funcduplication.
Fixed
- "Non-diff" code inclusion.
- Ensures unique indices/symbols also for the
task_idandsample_numbindings. - Removes endlines from
PrintBox_utilsbenchmark tables cells.
[0.2.0] -- 2023-06-03
Added
The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
Three / four synchronization heuristics:
- "parallel": a slice of the tensor is copied host-to-device at the beginning and device-to-host at the end, without interference because each task has a different slice.
- "update on host": the tensor is copied host-to-device at the beginning; each write is an update, it reads the old value from host to update it on the host. Thus each write is a synchronization point.
- "replicated": the tensor is copied host-to-device at the beginning; only task 0 copies device-to-host.
- "device-only": no copying to/from host.
- On-device-only tensors that are not materialized on the OCaml side.
A new category of axis dimensions is introduced:
Frozen. It is analogous to theParallelaxis category in that a single task execution / "device call" only processes a 1D slice of the axis.- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using
memcpy).
- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using
- A new syntax
%nn_rs("postprocess results" variant of%nn_dt) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.
Changed
Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
- Keeping the
Rebalanceconstructor around, but it is not playing any role.
- Keeping the
- Got rid of
debug_virtual_nodes, was tricky to maintain. Dynamic indexing now skips over parallel axes: when there is a
Parallelaxis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.- Removed the "indexing axes from-right" functionality for now (fails as not implemented).
- Dynamic indexing now can produce virtual nodes.
Fixed
- Dynamic indexing fixes.
[0.1.2] -- 2023-05-12
Added
Thread-local parameter
task_idfor automated iteration over a dimensionParallel.- This implements multicore SGD.
- Rebalancing of computations that don't use
Parallel, and synchronization in theGccjitbackend, are left as future work. - Already provides significant speedups in the interpreter (6-7x for me), but that's a moot point.
- Giving up further work this approach for now, because the bottleneck is the memory access with
Gccjit. - Keeping the new representation capability around, maybe it will be a stepping stone to other things.
- Monolithic step update with "macrobatch" (multiple steps within one backend call).
Changed
- Streamlined the source code, e.g. removed the
OCamlbackend. - Better syntax for
%nn_dtand%nn_opshape specification, allows identifiers. - Improved virtual node and scalar constant inlining.
- Better debugging, e.g. an option to "trace"
Gccjitexecution by printing the comments.
[0.1.1] -- 2023-05-06
Added
- An inline constants optimization that compile-time computes scalar constant subexpressions and inlines the values.
Changed
- Improved debuggability.
Fixed
- A last-minute breaking bug (would be nice to have a pre-release or a pre-publish hook to run tests!).
- The virtual nodes optimization is more robust, correct even with aggressive inlining settings (e.g. escaping variables check).
[0.1.0] -- 2023-05-04
Added
- The first changes-tracking release. Earlier development history is still somewhat documented via closed issues.
- Supports single and double precision floats, more precisions in the future.
- Generates a monolithic step update routine executed by
refresh_session (), but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session. - An
Interpreterbackend that can for example log all individual tensor modifications. - A
Gccjitbackend that can sometimes be 400x faster than theInterpreterbackend (without any debug work/output). - A virtual nodes (tensors) optimization that inlines computation of a cell in lieu of tensor accesses, can sometimes reduce memory consumption by 1/3.
- Added
- Changed
- Fixed
- [0.6.0] -- 2025-08-19
- [0.5.3] -- 2025-05-24
- [0.5.2] -- 2025-04-07
- [0.5.1] -- 2025-01-01
- Added
- Fixed
- [0.5.0] -- 2024-12-18
- [0.4.1] -- 2024-09-17
- [0.4.0] -- 2024-09-04
- [0.3.3] -- 2024-04-24
- [0.3.2] -- 2024-04-22
- [0.3.1] -- 2024-04-15
- [0.3.0] -- 2024-03-31
- [0.2.1] -- 2023-07-19
- [0.2.0] -- 2023-06-03
- [0.1.2] -- 2023-05-12
- [0.1.1] -- 2023-05-06
- [0.1.0] -- 2023-05-04