OCaml Multicore - October 2020
Welcome to the October 2020 multicore OCaml report, compiled by @shakthimaan, @kayceesrk and of course myself. The [previous monthly (https://discuss.ocaml.org/tag/multicore-monthly) updates are also available for your perusal.
OCaml 4.12.0-dev: The upstream OCaml tree has been branched for the 4.12 release, and the OCaml readiness team is busy stabilising it with the ecosystem. The 4.12.0 development stream has significant progress towards multicore support, especially with the runtime handling of naked pointers. The release will ship with a dynamic checker for naked pointers that you can use to verify that your own codebase is clean of them, as this will be a prerequisite for OCaml 5.0 and multicore compatibility. This is activated via the --enable-naked-pointers-checker
configure option.
Convergence with upstream and multicore trees: The multicore OCaml trees have seen significant robustness improvements as we've converged our trees with upstream OCaml (possible now that the upstream architectural changes are synched with the requirements of multicore). In particular, the handling of global C roots is much better in multicore now as it uses the upstream OCaml scheme, and the GC colour scheme also exactly matches upstream OCaml's. This means that community libraries from opam
work increasingly well when built with multicore OCaml (using the no-effects-syntax
branch).
Features: Multicore OCaml is also using domain local allocation buffers now to simplify its internals. We are also now working on benchmarking the IO subsystem, and support for CPU parallelism for the Lwt concurrency library has been added, as well as refreshing the new Asynchronous Effect-based IO (aeio) with Multicore OCaml, Lwt, and httpaf in an http-effects library.
Benchmarking: The Sandmark benchmarking test suite has additional configuration options, and there are new proposals in that project to leverage as much of the OCaml tools and ecosystem as much as possible.
As with previous updates, the Multicore OCaml ongoing, and completed tasks are listed first, which are followed by improvements to the Sandmark benchmarking test suite. Finally, the upstream OCaml related work is mentioned for your reference.
Multicore OCaml
Ongoing
-
ocaml-multicore/ocaml-multicore#422 Simplify minor heaps configuration logic and masking
The PR is a step towards using Domain local allocation buffers. A
Minor_heap_max
size is used to reserve the minor heaps area, andIs_young
for relying on a boundary check. TheMinor_heap_max
can be overridden using OCAMLRUNPARAM environment variable. -
ocaml-multicore/ocaml-multicore#426 Replace global roots implementation
An effort to replace the existing global roots implementation to be in line with OCaml's
globroots
. The objective is to also have a per-domain skip list, and a global orphans when a domain is terminated. -
ocaml-multicore/ocaml-multiore#427 Garbage Collector colours change backport
The Garbage Collector colour scheme changes in the major collector have now been backported to Multicore OCaml. The
mark_entry
does not includeend
,mark_stack_push
resembles closer to trunk, andcaml_shrink_mark_stack
has been adapted from trunk. -
ocaml-multicore/ocaml-multicore#429 Fix a STW interrupt race
The STW interrupt race in
caml_try_run_on_all_domains_with_spin_work
is fixed in this PR, where theenter_spin_callback
andenter_spin_data
fields ofstw_request
are initialized after we interrupt other domains.
Completed
Systhreads support
-
ocaml-multicore/ocaml-multicore#381 Reimplementing Systhreads with pthreads (Domain execution contexts)
The re-implementation of Systhreads with pthreads has been completed for Multicore OCaml. The Domain Execution Context (DEC) is introduced which allows multiple threads to run atop a domain.
-
ocaml-multicore/ocaml-multicore#410 systhreads:
caml_c_thread_register
andcaml_c_thread_unregister
The
caml_c_thread_register
andcaml_c_thread_unregister
functions have been reimported to systhreads. In Multicore OCaml, threads created by C code will be registered to domain 0 threads chaining.
Domain Local Storage
-
ocaml-multicore/ocaml-multicore#404 Domain.DLS.new_key takes an initialiser
The
Domain.DLS.new_key
now accepts an initialiser argument to assign an associated value to a key, if not initialised already. Also,Domain.DLS.get
no longer returns an option value. -
ocaml-multicore/ocaml-multicore#405 Rework Domain.DLS.get search function such that it no longer allocates
The
Domain.DLS.get
has been updated to remove any memory allocation, if the key already exists in the domain local storage. The PR also changes thesearch
function to accept all inputs as variables, instead of a closure from the environment.
Lwt
-
ocaml-multicore/multicore-opam#33 Add lwt.5.3.0+multicore
The Lwt.5.3.0 concurrency library has been added to support CPU parallelism with Multicore OCaml. A blog post introducing its installation and usage has been written by Sudha Parimala.
-
The Asynchronous Effect-based IO builds with a recent Lwt, and the HTTP effects demo has been updated to work with Multicore OCaml, Lwt, and httpaf. The demo source code is available at the http-effects repo.
Sundries
-
ocaml-multicore/ocaml-multicore#406 Remove ephemeron usage of RPC
The inter-domain mechanism is not required with the stop-the-world minor GC, and hence the same has been removed in the ephemeron implementation. The PR also does clean up and simplifies the ephemeron data structure and code.
-
ocaml-multicore/ocaml-multicore#411 Fix typo for presume and presume_arg in
internal_variable_names
A minor typo bug fix to rename
Presume
andPresume_arg
ininternal_variables_names.ml
. -
ocaml-multicore/ocaml-multicore#414 Fix up
Ppoll
semantics_of_primitives
entryThe
semantics_of_primitives
entry forPpoll
has been fixed which was causing flambda builds to remove poll points. -
ocaml-multicore/ocaml-multicore#416 Fix callback effect bug
The PR fixes a bug when the C-to-OCaml callback prevents effects crossing a C callback boundary. The stack parent is cleared before a callback, and restored afterwards. It also makes the stack parent a local root, so that the GC can see it inside the callback.
Benchmarking
Ongoing
Configuration
-
ocaml-bench/ocaml-bench-scripts#12 Add support for parallel multibench targets and JSON input
The
RUN_CONFIG_JSON
andBUILD_BENCH_TARGET
variables are now added and passed during run-time for the execution of parallel benchmarks. Default values are specified so that the serial benchmarks can still run without explicitly requiring the same. -
ocaml-bench/sandmark#180 Notebook Refactoring and User changes
A refactoring effort is underway to make the parallel benchmark interactive. The user accounts on The Littlest JupyterHub installation have direct access to the benchmark results produced from
ocaml-bench-scripts
on the system. -
ocaml-bench/sandmark#189 Add environment support for wrapper in JSON configuration file
The OCAMLRUNPARAM is now passed as an environment variable to the benchmarks during runtime, so that, different parameter values can be used to obtain multiple results for comparison. The use case and the discussion are available at Running benchmarks with varying OCAMLRUNPARAM issue. The environment variables can be specified in the
run_config.json
file, as shown below:{ "name": "orun_2M", "environment": "OCAMLRUNPARAM='s=2M'", "command": "orun -o %{output} -- taskset --cpu-list 5 %{command}" }
Proposals
-
ocaml-bench/sandmark#159 Implement a better way to describe tasklet cpulist
The discussion to implement a better way to obtain the taskset list of cores for a benchmark run is still in progress. This is required to be able to specify hyper-threaded cores, NUMA zones, and the specific cores to use for the parallel benchmarks.
-
ocaml-bench/sandmark#179 [RFC] Classifying benchmarks based on running time
A proposal to categorize the benchmarks based on their running time has been provided. The following classification types have been suggested:
lt_1s
: Benchmarks that run for less than 1 second.lt_10s
: Benchmarks that run for at least 1 second, but, less than 10 seconds.10s_100s
: Benchmarks that run for at least 10 seconds, but, less than 100 seconds.gt_100s
: Benchmarks that run for at least 100 seconds.
The PR for the same is available at Classification of benchmarks.
-
We are exploring the use of
opam-compiler
switch environment to build the Sandmark benchmark test suite. The merge of systhreads compatibility support now enables us to install dune natively inside the switch environment, along with the other benchmarks. With this approach, we hope to modularize our benchmarking test suite, and converge to fully using the OCaml tools and ecosystem.
Sundries
-
ocaml-bench/sandmark#181 Lock-free map bench
An implementation of a concurrent hash-array mapped trie that is lock-free, and is based on Prokopec, A. et. al. (2011). This cache-aware implementation benchmark is currently under review.
-
ocaml-bench/sandmark#183 Use crout_decomposition name for numerical analysis benchmark
A couple of LU decomposition benchmarks exist in the Sandmark repository, and this PR renames the
numerical-analysis/lu_decomposition.ml
benchmark tocrout_decomposition.ml
. This is to address Rename lu_decomposition benchmark in numerical-analysis any naming confusion between the two benchmarks, as their implementations are different.
Completed
-
ocaml-bench/sandmark#177 Display raw baseline numbers in normalized graphs
The raw baseline numbers are now included in the normalized graphs in the sequential notebook output. The graph for
maxrsskb
, for example, is shown below:
-
ocaml-bench/sandmark#178 Change to new Domain.DLS API with Initializer
The
multicore-minilight
andmulticore-numerical
benchmarks have now been updated to use the new Domain.DLS API with initializer. -
ocaml-bench/sandmark#185 Clean up existing effect benchmarks
The PR ensures that the code compiles without any warnings, and adds a
multicore_effects_run_config.json
configuration file, and arun_all_effect.sh
script to execute the same. -
ocaml-bench/sandmark#186 Very simple effect microbenchmarks to cover code paths
A set of four microbenchmarks to test the throughput of our effects system have now been added to the Sandmark test suite. These include
effect_throughput_clone
,effect_throughput_val
,effect_throughput_perform
, andeffect_throughput_perform_drop
. -
ocaml-bench/sandmark#187 Implementation of 'recursion' benchmarks for effects
A collection of recursion benchmarks to measure the overhead of effects are now included to Sandmark. This is inspired by the (Manticore benchmarks)[https://github.com/ManticoreProject/benchmark/].
OCaml
Ongoing
-
ocaml/ocaml#9876 Do not cache young_limit in a processor register
The PR removes the caching of
young_limit
in a register for ARM64, PowerPC and RISC-V ports. The Sandmark benchmarks are presently being tested on the respective hardware. -
ocaml/ocaml#9934 Prefetching optimisations for sweeping
The Sandmark benchmarking tests were performed for analysing a couple of patches that optimise
sweep_slice
, and for the use of prefetching. The objective is to reduce cache misses during GC.
Completed
-
ocaml/ocaml#9947 Add a naked pointers dynamic checker
The check for "naked pointers" (dangerous out-of-heap pointers) is now done in run-time, and tests for the three modes: naked pointers, naked pointers and dynamic checker, and no naked pointers have been added in the PR.
-
ocaml/ocaml#9951 Ensure that the mark stack push optimisation handles naked pointers
The PR adds a precise check on whether to push an object into the mark stack, to handle naked pointers.
We would like to thank all the OCaml users and developers in the community for their continued support, reviews and contribution to the project.
Acronyms
- AEIO: Asynchronous Effect-based IO
- API: Application Programming Interface
- ARM: Advanced RISC Machine
- CPU: Central Processing Unit
- DEC: Domain Execution Context
- DLS: Domain Local Storage
- GC: Garbage Collector
- HTTP: Hypertext Transfer Protocol
- JSON: JavaScript Object Notation
- NUMA: Non-Uniform Memory Access
- OPAM: OCaml Package Manager
- OS: Operating System
- PR: Pull Request
- RISC-V: Reduced Instruction Set Computing - V
- RPC: Remote Procedure Call
- STW: Stop-The-World