OCaml Planet
The OCaml Planet aggregates various blogs from the OCaml community. If you would like to be added, read the Planet syndication HOWTO.
Weekly News — OCaml Weekly News, Aug 25, 2015
 Destructive use of file descriptors
 Learning about compilerlibs
 POSTDOC position at NOVA LINCS Lisbon (background in Programming Languages and Tools)
 Gg 0.9.1
 PhD position announcement: Mixing Unproved and Proved subsystems through Contracts for CorrectbyConstruction system design
 Vg 0.8.2
 Ocsigen releases
 Albatross: A Verifying Compiler v0.2
 Simple exception  different behaviour between toplevel and compiled
 Position at Wolfram Mathcore
 Size of .cmo / .cmi workload of the compiler
 Introduc…
 Destructive use of file descriptors
 Learning about compilerlibs
 POSTDOC position at NOVA LINCS Lisbon (background in Programming Languages and Tools)
 Gg 0.9.1
 PhD position announcement: Mixing Unproved and Proved subsystems through Contracts for CorrectbyConstruction system design
 Vg 0.8.2
 Ocsigen releases
 Albatross: A Verifying Compiler v0.2
 Simple exception  different behaviour between toplevel and compiled
 Position at Wolfram Mathcore
 Size of .cmo / .cmi workload of the compiler
 Introduction to Functional Programming in OCaml
 Setting Pervasive's stderr to unbuffered as default?
 Other OCaml News
Tenth OCaml compiler hacking evening and OCaml/ML talks — OCaml Labs compiler hacking ( (OCaml Labs), Aug 20, 2015
We'll be meeting in the Computer Lab next Friday (28th August 2015) for another evening of compiler hacking. All welcome!
We'll also be having an afternoon of OCaml and MLrelated talks beforehand, with titles suspiciously similar to talks at the ML Workshop and OCaml Workshop the following week.
If you're planning to come along to either the talks or to compiler hacking, please add yourself to the Doodle poll. Further updates, if any, will be posted to the compiler hacking mail…
Read more...We'll be meeting in the Computer Lab next Friday (28th August 2015) for another evening of compiler hacking. All welcome!
We'll also be having an afternoon of OCaml and MLrelated talks beforehand, with titles suspiciously similar to talks at the ML Workshop and OCaml Workshop the following week.
If you're planning to come along to either the talks or to compiler hacking, please add yourself to the Doodle poll. Further updates, if any, will be posted to the compiler hacking mailing list.
Schedule
3.30pm Polymorphism, subtyping and type inference in MLsub
Stephen Dolan (with Alan Mycroft)
3.55pm The State of the OCaml Platform: September 2015
Anil Madhavapeddy (with Amir Chaudhry, Thomas Gazagnaire, Jeremy Yallop, David Sheets)
4.20pm Modular macros
Jeremy Yallop (with Leo White)
4.45pm Break
5.15pm Effective Concurrency through Algebraic Effects
KC Sivaramakrishnan (with Stephen Dolan, Leo White, Jeremy Yallop, Anil Madhavapeddy)
5.40pm A review of the growth of the OCaml community
Amir Chaudhry
6.05pm Persistent Networking with Irmin and MirageOS
Mindy Preston (with Magnus Skjegstad, Thomas Gazagnaire, Richard Mortier, Anil Madhavapeddy)
6.30pm Food
7.30pm Compiler hacking
Further details
Where: Room FW26, Computer Laboratory, Madingley Road
When: 3.30pm (workshop); 6.30pm (compiler hacking), Friday 28th August 2015
Who: anyone interested in improving OCaml. Knowledge of OCaml programming will obviously be helpful, but prior experience of working on OCaml internals isn't necessary.
What: fixing bugs, implementing new features, learning about OCaml internals.
Wiki: https://github.com/ocamllabs/compilerhacking/wiki
We're defining "compiler" pretty broadly, to include anything that's part of the standard distribution, which means at least the standard library, runtime, tools (ocamldep, ocamllex, ocamlyacc, etc.), ocamlbuild, the documentation, and the compiler itself. We'll have suggestions for miniprojects for various levels of experience (see also some things we've done on previous evenings), but feel free to come along and work on whatever you fancy.
HideNo (functional) experience required — Jane Street (Yaron Minsky), Aug 19, 2015
Jane Street is a serious functional programming shop. We use OCaml, a statically typed functional language for almost everything and have what is probably the largest OCaml codebase anywhere.
This leads lots of people to think that they shouldn't even bother applying, under the assumption that we are only interested in hiring deep functional programming gurus. I think people get to this conclusion in part because they think of functional languages, especially those with fancy type systems, as ar…
Read more...Jane Street is a serious functional programming shop. We use OCaml, a statically typed functional language for almost everything and have what is probably the largest OCaml codebase anywhere.
This leads lots of people to think that they shouldn't even bother applying, under the assumption that we are only interested in hiring deep functional programming gurus. I think people get to this conclusion in part because they think of functional languages, especially those with fancy type systems, as arcane tools that can only be used effectively after deep study.
To the contrary, one of the reasons we started building production systems with OCaml was that it was relatively easy to understand, even for people with no formal CS background. Since then, we've had good experiences taking students with no functional experience at all and getting them to the point of being able to complete a project in just a few weeks. We also have a very successful "OCaml Bootcamp" program, where over four weeks, we train all of the incoming traders and many other nondeveloper hires on OCaml and our development tools and libraries. By the end, most of them are able to create useful applications.
All of this is to say that we don't go out of our way to hire people who are already familiar with functional programming. In practice, it's just not that hard for strong programmers to pick it up after they start.
That said, a solid majority of the developers we hire do come in with functional programming experience  but that's because of their preferences, not ours. Programmers with an interest in functional languages have an extra reason to want to work here, and so we get an unusually high number of good applicants from that pool.
There's a more general lesson here: using wellloved tools is a good way of attracting (and retaining) great developers.
HideHaskell Engineer at Wagon (Fulltime) — Functional Jobs (FunctionalJobs.com), Aug 18, 2015
We’re a team of functional programmers writing apps and services in Haskell (and Javascript). Yes, it’s true: Haskell is our main backend language. We also use functional programming practices across our stack.
Wagon is a great place to do your best work. We love to teach and learn functional programming; our team is humble, hard working, and fun. We speak at the Bay Area Haskell Meetup, contribute to open source, and have weekly lunches with interesting people from the community.
Read more...We’re a team of functional programmers writing apps and services in Haskell (and Javascript). Yes, it’s true: Haskell is our main backend language. We also use functional programming practices across our stack.
Wagon is a great place to do your best work. We love to teach and learn functional programming; our team is humble, hard working, and fun. We speak at the Bay Area Haskell Meetup, contribute to open source, and have weekly lunches with interesting people from the community.
Work on challenging engineering problems at Wagon. How to integrate Haskell with modern client and serverside technologies, like Electron and Docker? How to deploy and manage distributed systems built with Haskell? Which pieces of our infrastructure should we opensource?
Learn more about our stack, how we combine Haskell, React, and Electron, and what it’s like working at a Haskellpowered startup.
Background
 love of functional programming
 personal project or production experience using Haskell, OCaml, Clojure, or Scala
 passionate (but practical) about software architecture
 interested in data processing, scaling, and performance challenges
 experience with databases (optional)
Role
 write Haskell for client and serverside applications
 integrate Haskell with modern tools like Docker, AWS, and Electron
 architect Wagon to work with analytic databases like Redshift, BigQuery, Spark, etc
 build systems and abstractions for streaming data processing and numeric computing
 work with libraries like Conduit, Warp, and Aeson
 use testing frameworks like QuickCheck and HSpec
 develop deployment and monitoring tools for distributed Haskell systems
Get information on how to apply for this position.
HideAnnouncing Lambdoc 1.0beta4 — Dario Teixeira, Aug 17, 2015
I'm happy to announce the release of version 1.0beta4 of Lambdoc, a library providing support for semantically rich documents in web applications. Though Lambdoc was designed with Ocsigen/Eliom integration in mind, it does not actually depend on the Ocsigen server or Eliom, and you may use it with other frameworks. In fact, you may find it useful outside the web application domain altogether.
An overview of Lambdoc's features may be found in previous posts announcing the beta1 and beta3 rel…
Read more...I'm happy to announce the release of version 1.0beta4 of Lambdoc, a library providing support for semantically rich documents in web applications. Though Lambdoc was designed with Ocsigen/Eliom integration in mind, it does not actually depend on the Ocsigen server or Eliom, and you may use it with other frameworks. In fact, you may find it useful outside the web application domain altogether.
An overview of Lambdoc's features may be found in previous posts announcing the beta1 and beta3 releases. Between beta3 and beta4, the most salient changes are as follows:

Introduction of Lambdoc_core_foldmap, a module for aiding the construction of functions for deep traversal and transformation of a document tree. The basic idea is inspired by the compiler's Ast_mapper module, so it should be widely familiar. Moreover, the foldmapper is the result of a functor parameterised by a custom monad, so it's easily integrated in an application using Lwt or Async if the foldmapping requires doing monadic I/O. The tutorial directory includes some examples using Lambdoc_core_foldmap:
 Tutorial 7 illustrates one of the simplest possible applications of this feature: a function that counts the number of bold sequences used in a document.
 Tutorial 8 depicts a link validator that uses Cohttp to verify that all external links are live. Note that it registers itself as a parsing postprocessor, allowing any found errors to be reported together with other unrelated document errors. Moreover, it lives under the Lwt monad.
 Tutorial 9 implements a simple document transformer which replaces all instances of Eastasia with Eurasia and viceversa.

The addition of Lambdoc_core_foldmap enabled the simplification of the extension mechanism. Previous versions of Lambdoc feature hooks for reading/writing link and image URLs. All of those hooks are now gone.

Lambdoc documents may now carry information about the parsed source (location, etc) in every attribute. I briefly entertained the possibility of making the attribute polymorphic, thus allowing for document to carry custom metadata. However, at this moment I have no practical need for this extra flexibility, and I am wary of increasing complexity in the name of hypothetical use cases.
Lambdoc 1.0beta4 should be available in OPAM any moment now. Documentation is still a work in progress, and since OCamldoc gets terribly confused with Lambdoc's heavy use of module aliases, we may have to wait for Codoc before proper API docs can get generated. In a small effort to ameliorate this situation, the examples directory includes a tutorial with selfcontained demos of Lambdoc's various features.
HideJustinTime Summoning of Unikernels (v0.2) — Magnus Skjegstad, Aug 17, 2015
Jitsu  or JustinTime Summoning of Unikernels  is a prototype DNS server that can boot virtual machines on demand. When Jitsu receives a DNS query, a virtual machine is booted automatically before the query response is sent back to the client. If the virtual machine is a unikernel, it can boot in milliseconds and be available as soon as the client receives the response. To the client it will look like it was on the whole time.
Jitsu can be used to run microservices that only exist after they…
Read more...Jitsu  or JustinTime Summoning of Unikernels  is a prototype DNS server that can boot virtual machines on demand. When Jitsu receives a DNS query, a virtual machine is booted automatically before the query response is sent back to the client. If the virtual machine is a unikernel, it can boot in milliseconds and be available as soon as the client receives the response. To the client it will look like it was on the whole time.
Jitsu can be used to run microservices that only exist after they have been resolved in DNS  and perhaps in the future can facilitate demanddriven clouds or extreme scaling with a unikernel per URL. Jitsu has also been used to boot unikernels in milliseconds on ARM devices.
A new version of Jitsu was just released and I'll summarize some of the old and new features here. This is the first version that supports both MirageOS and Rumprun unikernels and uses the distributed Irmin database to store state. A full list of changes is available here.
A Jitsu demo hosting a MirageOS unikernel runs here and nginx on Rumprun here. These demos should be up most of the time, but may occasionally be unstable or unavailable as they are also used to test new features.
For more technical details and up to date example configurations with MirageOS and Rumprun, see the Jitsu README.
Overview
How it works
The following figure shows what happens when Jitsu receives a DNS query for a unikernel that is not currently running.
First, the client sends a DNS request to Jitsu. The request is received and checked against a list of domains mapped to unikernels. If there is a match, Jitsu boots the corresponding unikernel VM. When the unikernel has started booting, Jitsu sends a DNS reply to the client containing the future IP address of the unikernel. When the DNS response is received by the client it initiates a TCP connection to the IP address in the DNS response.
Each DNS reply from Jitsu contains a timetolive (TTL) value that tells the client for how long the reply is valid. This is a feature built into DNS that allows query results to be cached by the client or by other name servers. When the TTL expires, the client will have to send a new DNS query to Jitsu to verify that a cached response still correct. By using low TTL values (typically less than an hour), Jitsu can keep track of how often a unikernel is used and may automatically stop unikernels that have not been requested within a certain time period. By default, unikernels are stopped after 2 x TTL timeout.
Masking boot delays
When Jitsu boots a unikernel and returns the DNS response there is a race between the client and the unikernel. The unikernel has to be able to respond to a TCP request within the time it takes to send the DNS reponse back to the client and for the client to attempt to connect. What happens if the unikernel is unable to complete its boot process in time?
The problem is illustrated in this figure:
After being booted by Jitsu, the unikernel has about 1 roundtriptime (RTT) to finish booting. This is the time it takes for the DNS query to go back to the client and for the TCP handshake to be initiated with a SYN packet. For example, the typical boot time of a MirageOS unikernel on ARM is about 3400 ms. It is then likely that the unikernel will not be ready in time and that the first SYN packet is lost. TCP is able to recover by retransmitting the SYN packet, but it requires a timeout and a retransmit. With recommended timeout values it will take a second or longer for the client to finally connect.
To mask this delay and avoid the SYN retransmission, Jitsu now supports three alternative mechanisms:
 Delay the DNS response by a fixed timeout (e.g. 150200 ms) to let the unikernel complete the boot process
 Wait for the unikernel to signal Jitsu before sending the DNS response
 Cache the incoming SYN packet on behalf of the unikernel with Synjitsu
The simplest solution to set up is the fixed delay. Depending on the application, a delay of a few hundred milliseconds may be acceptable to the client. For a web application for example, this may not be noticeable. This mechanism does not require modifications to the unikernel itself. The downside is that the delay is fixed, so even if the unikernel starts faster than expected there will still be a delay for the client.
A more dynamic approach is to let the unikernel notify Jitsu when it is ready. This is currently done by waiting for a key to appear in Xenstore, Xen's shared information store. To write the key the unikernel only needs a working Xenstore client implementation. Jitsu will watch Xenstore and immediately send the DNS response when the key appears. While more dynamic, this mechanism doesn't allow Jitsu to send the DNS reply while the unikernel is booting  making the delay longer than necessary.
To be able to use both a dynamic delay and send the response back while the unikernel is booting, Jitsu has support for running a separate unikernel service that caches incoming SYNs until the unikernel is ready. We call this Synjitsu.
Synjitsu
Synjitsu is a unikernel service that handles TCP connections on behalf of unikernels that are completing their boot process. Synjitsu is always running and captures TCP SYN packets that appear on the network bridge that don't have a matching unikernel yet. The SYNs are then stored in the Xenstore database. When a new unikernel has booted it will check for cached SYNs that matches its MAC and IPaddress in Xenstore. Every SYN it finds will then be processed as if it was received over the network and trigger a SYN/ACK to complete the TCP three way handshake. When the unikernel has finished booting all incoming SYNs are ignored by Synjitsu and go directly to the unikernel as regular network traffic.
The process is shown above. A DNS query has already been sent to Jitsu DNS, a unikernel has been booted and a reply sent back to the client. The client now attempts to send a TCP SYN packet to initiate the TCP connection. Unfortunately, the unikernel is not ready yet and would be unable to reply. Synjitsu then silently stores the SYN for the remaining milliseconds while the unikernel completes its boot process. When the unikernel is ready it will retrieve the SYN and send a SYN/ACK back to the client.
For Synjitsu to work properly we also have to handle ARP traffic. ARP is used to find the MAC address that matches an IP address on the local network. Before the incoming TCP SYN can reach its destination, the router (usually the local gateway) has to know the MAC address it should be sent to. As the unikernel is still booting it is unable to announce its address and IP  in fact the IP is not really in use yet. To compensate for this, Jitsu will tell Synjitsu the MAC and IPaddress of every unikernel that is currently booting. Synjitsu then sends gratuitous ARP packets to announce to the network that it is handling the specified MAC and IP for now. As soon as the real unikernel finishes booting it sends its own gratuitous ARP packet to notify the network that it is ready.
Synjitsu is currently a highly experimental feature and requires a modified MirageOS TCP/IP stack. For more information about running Synjitsu, see our paper or ask on the mailing list. The source code is available here.
Irmin and Jitsu
Irmin is a distributed database with gitlike features, such as a full history of changes and support for branching and merging. Jitsu's internal state is now stored in an Irmin database which can be inspected using the Irmin tool. The database used to store the state of the demonstration is shown below.
$ irmin tree
/jitsu/vm/1ca...b60/config/disk/0........................."/dev/loop5@xvda"
/jitsu/vm/1ca...b60/config/disk/1........................."/dev/loop6@xvdb"
/jitsu/vm/1ca...b60/config/dns/0....................."www.rump.jitsu.v0.no"
/jitsu/vm/1ca...b60/config/ip/0............................."89.16.190.215"
/jitsu/vm/1ca...b60/config/kernel/0............................."nginx.bin"
/jitsu/vm/1ca...b60/config/memory/0................................."64000"
/jitsu/vm/1ca...b60/config/name/0................................."rumpxl"
/jitsu/vm/1ca...b60/config/nic/0......................................"br0"
/jitsu/vm/1ca...b60/config/response_delay/0..........................."0.9"
/jitsu/vm/1ca...b60/config/rumprun_config/0......................"json.cfg"
/jitsu/vm/1ca...b60/dns/www.rump.jitsu.v0.no/ttl......................."60"
/jitsu/vm/1ca...b60/ip......................................"89.16.190.215"
/jitsu/vm/1ca...b60/response_delay...................................."0.1"
/jitsu/vm/1ca...b60/stop_mode....................................."destroy"
/jitsu/vm/1ca...b60/use_synjitsu...................................."false"
/jitsu/vm/de4...ad3/config/dns/0.........................."www.jitsu.v0.no"
/jitsu/vm/de4...ad3/config/ip/0............................."89.16.190.214"
/jitsu/vm/de4...ad3/config/kernel/0..........................."mirwww.xen"
/jitsu/vm/de4...ad3/config/memory/0................................."64000"
/jitsu/vm/de4...ad3/config/name/0.................................."wwwxl"
/jitsu/vm/de4...ad3/config/nic/0......................................"br0"
/jitsu/vm/de4...ad3/config/response_delay/0..........................."0.1"
/jitsu/vm/de4...ad3/config/wait_for_key/0....................."data/status"
/jitsu/vm/de4...ad3/dns/www.jitsu.v0.no/ttl............................"60"
/jitsu/vm/de4...ad3/ip......................................"89.16.190.214"
/jitsu/vm/de4...ad3/response_delay...................................."0.1"
/jitsu/vm/de4...ad3/stop_mode....................................."destroy"
/jitsu/vm/de4...ad3/use_synjitsu...................................."false"
/jitsu/vm/de4...ad3/wait_for_key.............................."data/status"
The Jitsu database is currently read only, but in the future the plan is to allow clients to create their own branch of the database, perform changes and then merge with Jitsu's master branch. This can then be used to control Jitsu while it is running and may, for example, allow unikernels to modify their own boot and DNS configuration. The Irmin database will also make it easier to split Jitsu into smaller components that cooperate and allow some features to run within separate unikernels (e.g. DNS).
New backends
Jitsu v0.2 includes support for several backends that can be used to manage the unikernel VMs. The original libvirt backend is still used by default, but libxl and XAPI are also supported (but not as well tested). If you encounter problems with the new backends, please report them here.
Tell me more!
This post has mainly focused on Jitsu, but if you are interested in unikernels and want more information about writing your own or hosting your web site with one, these links may be useful:
 Hello Mirage World describes how to create and compile a simple "Hello world" unikernel.
 In My first unikernel, Thomas Leonard describes how he set up a MirageOS REST service with a storage backend.
 Mindy Preston has documented how she runs her blog as a unikernel with Amazon EC2 in I Am Unikernel! (and So Can You!).
 From Jekyll site to Unikernel in fifty lines of code
 Tutorial: Serve a static website as a Unikernel explains how to run an nginx server in a Rumprun unikernel.
There are also many MirageOS application examples available here. The examples are kept up to date with the latest libraries.
If you experiment with Jitsu, please let us know how it went on the mailing list!
(Thanks to Daniel Bünzli and Amir Chaudhry for comments on previous versions of this post)
HideSummer Ocsigen releases — Ocsigen blog (Gabriel Radanne), Aug 17, 2015
We are happy to announce the releases of
We also welcome a new member in the ocsigen team, Vasilis Papavasileiou.
Key changes in the various releases:

PPX support for js_of_ocaml with OCaml >= 4.02.2. See documentation here.
This was also the occasion to introduce a new syntax for object literals, and to improve the Camlp4 syntax (w.r.t. to locations). Both syntaxes emit the same code, and are perfectly compatible.
…
We are happy to announce the releases of
We also welcome a new member in the ocsigen team, Vasilis Papavasileiou.
Key changes in the various releases:

PPX support for js_of_ocaml with OCaml >= 4.02.2. See documentation here.
This was also the occasion to introduce a new syntax for object literals, and to improve the Camlp4 syntax (w.r.t. to locations). Both syntaxes emit the same code, and are perfectly compatible.

Support for dynlink in js_of_ocaml.

Logging improvements in Eliom and Server, in particular on the client side.

A healthy amount of bugfixes.
The next releases will probably see major changes. The current plan is:

Replace Server’s internals with cohttp, as part of our move towards Miragecompatible libraries. See here.

Shared_react, which allows to build reactive pages from server side. See here.

PPX for Eliom.

Support for async/core in js_of_ocaml.
Have fun with Ocsigen!
HideMerging OCaml patches — GaGallium (Gabriel Scherer), Aug 07, 2015
In Merging OCaml Patches I wrote a description of my personal process to merge OCaml patches (usually submitted as github pull requests) in the upstream repository (currently SVN). This description may be useful for external contributors to understand the process, and maybe meet me halfway by doing a bit of the work upfront.
In particular, as an external contributor, you
 must add tests to the testsuite,
 should write a proper changelog entry,
 should rebase your PR into a good patch series …
In Merging OCaml Patches I wrote a description of my personal process to merge OCaml patches (usually submitted as github pull requests) in the upstream repository (currently SVN). This description may be useful for external contributors to understand the process, and maybe meet me halfway by doing a bit of the work upfront.
In particular, as an external contributor, you
 must add tests to the testsuite,
 should write a proper changelog entry,
 should rebase your PR into a good patch series when it converges, and
 could include authorship information in commit messages.
I'm busy writing a thesis manuscript right now (gave up on the ICFP programming contest this year!), and limiting my OCamlrelated work to the minimum: merging stuff on my free time. I think it has been a good deal so far. Thanks to all for the good work!
HideOCaml Compiler Hacking: how to add a primitive — Cedeela (simon), Aug 06, 2015
I have been hacking on the OCaml compiler recently; in particular, I added some support for coloring warning/error messages. At some point during the discussion over this pull request, it became clear that colored output should only be enabled if stderr was an interactive terminal (in opposition to a regular file handle or whatnot). The compiler does not link with the Unix library, so I finally decided to add a primitive caml_sys_is_interactive. My purpose here is to explain how it works (from w…
Read more...I have been hacking on the OCaml compiler recently; in particular, I added some support for coloring warning/error messages. At some point during the discussion over this pull request, it became clear that colored output should only be enabled if stderr was an interactive terminal (in opposition to a regular file handle or whatnot). The compiler does not link with the Unix library, so I finally decided to add a primitive caml_sys_is_interactive. My purpose here is to explain how it works (from what I gathered) and how to do it.
I am not a compiler expert, some of the explanations here might be wrong or misleading. If you spot mistakes I will be happy to fix them.
What is a primitive?
OCaml allows C functions to be used directly as primitives, that is, as basic operations of the language. The stdlib is full of such functions.
A primitive (generally named caml_foo since C is not namespaced) is generally located in a C file in byterun/ (the directory containing the interpreter and runtime of OCaml). It is prefixed with CAMLprim and should take and return only C values of type value (that is, an OCaml value, as declared in byterun/caml/mlvalues.h).
Let's take an example: Sys.file_exists : string > bool. The module sys.ml contains the following signature:
external file_exists : string > bool = "caml_sys_file_exists"
and there is, in byterun/sys.c, the function caml_sys_file_exists (here in a simplified form):
CAMLprim value caml_sys_file_exists(value name)
{
struct stat st;
char *p = String_val(name);
int ret = stat(p, &st);
return Val_bool(ret == 0);
}
This function uses macros defined in byterun/caml/mlvalues.h to convert between OCaml values and C values, but this is not the point of this post.
note: I have no idea how CAMLprim works, but there is a lot of magical automation that extracts a list of all primitives, exports their names in C arrays, etc. A primitive is a C function shipped with the runtime or a library (such as Unix), whereas some other functions are CAMLexport or CAMLlocal (I don't know exactly what that means).
edit: @gasche points that all the CAMLfoo primitives are documented in the official manual. In particular, writing C "stubs" (C functions that can be used directly from OCaml via a external declaration) requires, in general, using CAMLparam and CAMLlocal to ensure that the GC is aware that some values exist (either as parameters to the function, or as local variables).
How to add a primitive: Bootstrap!
It is a bad™ idea to add a primitive to, say, byterun/sys.c and use it in the compiler immediately. I tried it, and it failed to compile. The correct way, as I learnt from Jérémie Dimino (@diml) and Thomas Refis, is as follows:
add the primitive into the relevant .c file, but do not use it yet anywhere in the compiler.
make world. This compiles the interpreter and bytecode compiler.
make bootstrap. This updates the bytecode archives (in boot/) of ocamlc, ocamldep, and ocamllex.
Then, commit those new archives, as they will be needed to compile the compiler.
make changes that use the new primitive, such as fancy coloring system. Using the primitive first requires to declare it as external foo : a > b = "caml_foo" in one or more files.
In other projects, ctypes can be used instead of writing primitives by hand, depending on the programmers' preference.
make world world.opt tests should now work properly.
argue with @gasche about whether the primitive is useful or not ;)
Provably considered harmful — Andrej Bauer, Aug 05, 2015
This is officially a rant and should be read as such.
Here is my pet peeve: theoretical computer scientists misuse the word “provably”. Stop it. Stop it!
Theoretical computer science is closer to mathematics than it is to computer science. There are definitions, theorems and proofs. Theoretical computer scientists must understand mathematical terminology. The words “proof” and “provable” are in the domain of mathematical logic. A statement is provable if it h…
Read more...This is officially a rant and should be read as such.
Here is my pet peeve: theoretical computer scientists misuse the word “provably”. Stop it. Stop it!
Theoretical computer science is closer to mathematics than it is to computer science. There are definitions, theorems and proofs. Theoretical computer scientists must understand mathematical terminology. The words “proof” and “provable” are in the domain of mathematical logic. A statement is provable if it has a proof in a given formal system. It makes no sense to say “provable” without specifying or implying a specific proof system.
But theoretical computer scientists say things like (I just googled these randomly) “A Provably Optimal Algorithm for Crowdsourcing” and “A Provably Correct Learning Algorithm for LatentVariable PCFGs” and even “provably true”.
So what is a “provably optimal algorithm” ? It is an algorithm for which there exists a proof of optimality. But why would you ever want to say that in the title of your paper? I can think of several reasons:
 You proved that there exists a proof of optimality but did not actually find the proof itself. This of course is highly unlikely, but that does even not matter, for as soon as we know there exists a proof, the algorithm is optimal. Just say “optimal algorithm” and the world will be a more exact place.
 Your paper is an intricate piece of logic which analyzes existence of proofs of optimality in some superimportant formal systems. This of course is not what theoretical computer scientists do for the most part. Any paper which actually did this, would instead say something like “$\mathrm{Ind}(\Sigma^0_1)$provability of optimality”.
 You distinguish between algorithms which are optimal and those which are optimal and you proved they’re optimal. In that case we should turn tables: if you ever publish a claim of optimality without having proved it, put that in your title and call it “Unproved optimal algorithm”.
 You proved that your algorithm is optimal, showed the proof to the anonymous referee and the editor, and then published your optimal algorithm without proof. You want to tease your readers by telling them “there is a proof but it’s a secret”. If this is what you meant to convey, then the title was well chosen.
To see that “provable” is just a buzzword, consider what it would mean to have the opposite, that is an “unprovably optimal algorithm”. That is an algorithm which is optimal, but there exists no proof of optimality. Such a thing can be manufactured by a logician, probably, but is certainly not found in everyday life.
As someone is going to say that “provably true” makes sense, let me also comment on that. A statement is true if it is valid in all models. So “provably true” would mean something like “there exists a proof in a given formal system that the statement is valid in all models”. Well, I will not deny that a situation might arise in which this sort of thing is precisely what you would consider, but I will also bet you that it is far more likely that “provably true” should just be replaced by either “provable” or “true”, depending on the particularities of the situation.
As a rule of thumb, unless you are a logician, if you feel like using the word “provably”, just skip it.
And as long as I am ranting, please stop introducing every single concept with “informally” and prefixing every single displayed formula with “formally”. Which is it,
 you think that “informal” descriptions are somehow unworthy or broken, and you should therefore alert the reader that you’re lying to them, or
 you think that “formal” descriptions are an unnecessary burden which must be included in the paper to satisfy the gods of mathematics?
If the former, stop lying to your readers, and if the latter stop doing theoretical computer science.
Now I will go on to refereeing that pile of POPL papers which must contain at least a dozen misuses of “provably” and two dozen false formal/informal dichotomies.
HideRelease of OCamlbitcoin 2.0 — OCamlCore Forge News ( (Dario Teixeira), Aug 05, 2015
Full Time: Software Developer (Functional Programming) at Jane Street in New York, NY; London, UK; Hong Kong — GitHub Jobs, Aug 03, 2015
Software Developer
Jane Street is a proprietary quantitative trading firm, focusing primarily on trading equities and equity derivatives. We use innovative technology, a scientific approach, and a deep understanding of markets to stay successful in our highly competitive field. We operate around the clock and around the globe, employing over 400 people in offices in New York, London and Hong Kong.
The markets in which we trade change rapidly, but our intellectual approach changes faster still…
Read more...Software Developer
Jane Street is a proprietary quantitative trading firm, focusing primarily on trading equities and equity derivatives. We use innovative technology, a scientific approach, and a deep understanding of markets to stay successful in our highly competitive field. We operate around the clock and around the globe, employing over 400 people in offices in New York, London and Hong Kong.
The markets in which we trade change rapidly, but our intellectual approach changes faster still. Every day, we have new problems to solve and new theories to test. Our entrepreneurial culture is driven by our talented team of traders and programmers. At Jane Street, we don't come to work wanting to leave. We come to work excited to test new theories, have thoughtprovoking discussions, and maybe sneak in a game of pingpong or two. Keeping our culture casual and our employees happy is of paramount importance to us.
We are looking to hire great software developers with an interest in functional programming. OCaml, a statically typed functional programming language with similarities to Haskell, Scheme, Erlang, F# and SML, is our language of choice. We've got the largest team of OCaml developers in any industrial setting, and probably the world's largest OCaml codebase. We use OCaml for running our entire business, supporting everything from research to systems administration to trading systems. If you're interested in seeing how functional programming plays out in the real world, there's no better place.
The atmosphere is informal and intellectual. There is a focus on education, and people learn about software and trading, both through formal classes and on the job. The work is challenging, and you get to see the practical impact of your efforts in quick and dramatic terms. Jane Street is also small enough that people have the freedom to get involved in many different areas of the business. Compensation is highly competitive, and there's a lot of room for growth.
You can learn more about Jane Street and our technology from our main site, janestreet.com. You can also look at a a talk given at CMU about why Jane Street uses functional programming (http://ocaml.janestreet.com/?q=node/61), and our programming blog (http://ocaml.janestreet.com).
We also have extensive benefits, including:
 90% book reimbursement for workrelated books
 90% tuition reimbursement for continuing education
 Excellent, zeropremium medical and dental insurance
 Free lunch delivered daily from a selection of restaurants
 Catered breakfasts and fresh brewed Peet's coffee
 An onsite, private gym in New York with towel service
 Kitchens fully stocked with a variety of snack choices
 Full company 401(k) match up to 6% of salary, vests immediately
 Three weeks of paid vacation for new hires in the US
 16 weeks fully paid maternity/paternity leave for primary caregivers, plus additional unpaid leave
More information at http://janestreet.com/culture/benefits/
HideSoftware Engineer, OCaml at Pegged Software (Fulltime) — Functional Jobs (FunctionalJobs.com), Aug 03, 2015
Pegged is seeking passionate, collaborative, entrepreneurial engineers to help us transform and equalize how people are hired. We’re a small, earlystage, highgrowth company that cares about producing a valuable product, growing together and using the right tool for the job. We are hiring at all levels of experience at our New York City and Baltimore offices.
Pegged Software is fixing how people get hired. We do this by applying modern data science to build predictive models that match i…
Read more...Pegged is seeking passionate, collaborative, entrepreneurial engineers to help us transform and equalize how people are hired. We’re a small, earlystage, highgrowth company that cares about producing a valuable product, growing together and using the right tool for the job. We are hiring at all levels of experience at our New York City and Baltimore offices.
Pegged Software is fixing how people get hired. We do this by applying modern data science to build predictive models that match individuals with roles where they can excel. We use traditional application data in some very innovative ways, and we’ve built our own datagathering platforms in the form of applicant screenings. We’ve started by focusing our technology on the healthcare industry, and plan on expanding to other industries, and then to a directtoconsumer product that will help individuals and companies find one another.
At the hospitals we work with, employee turnover is a major problem. This churn costs our clients thousands of dollars that could be better spent improving patient care. We’ve reduced employee turnover at our client hospitals by 45 to 75 percent, which has resulted in improved patient experience, company performance, and employee engagement.
This is just the beginning of a revolution in how employers use data to make better HR decisions, and how individuals are able to find fulfilling work beyond the bounds of bias. We have doubled in size over the last year, just completed a $7.5M round of financing, and are growing even more quickly in 2015.
What We Believe In
 Empathy with one another and our users. Inclusiveness, communication and collaboration are core tenets of how we work.
 Contribution to the open source community. We are built on open source technologies and we give back.
 Small feedback loops are critical to developing product and technology. Ownership gives individuals the agency to grow, and aligns outcomes.
What We Use
We are pragmatic and use whatever tool is best to get the job done. That being said, we most enjoy programming in OCaml on top of AWS, and make use of Python to glue things together.
Responsibilities
 Write serverside code for webbased applications and APIs.
 Create robust highvolume production applications and develop prototypes quickly.
 Work with Data Scientists to create a data pipeline that supports offline analysis, and online experimentation.
 Write automation code to deploy, monitor and alert to support your application.
Must Have
 An entrepreneurial mindset  you have a track record of easily learning new technologies. You are a selfdirected learner who can adapt quickly to new tools to solve problems.
 A demonstrated ability to produce and support quality production code.
 Excellent communication skills.
 A collaborative approach to working on a team.
 A passion about making the world a better place.
Nice to Have
 Experience with Amazon Web Services (AWS)
 Experience with OCaml or functional programming in any environment
 Experience building highly available, fault tolerant systems
 Experience building a data pipeline (bonus points for knowing about the Lambda Architecture).
Perks
 The mission, the people and the work.
 The mission: work on a project that is making the world a better place.
 The people: work for a high growth company, and learn in an environment where people care about growth and work/life balance.
 The work: pragmatically use amazing technology.
 Competitive salary, benefits and equity.
Get information on how to apply for this position.
HideFormally verifying the complexity of OCaml programs with CFML  part 3 — GaGallium (Armaël Guéneau), Aug 03, 2015
In two previous blog posts (part 1 and interlude), I presented a set of notions to prove asymptotic complexity of OCaml programs, then an OCaml implementation of Okasaki's "binary random access lists". In this last blog post, I combine the two ideas, and present a formalization of binary random access lists, using CFML.
A too simple example
As an appetizer, let's try to prove the specification "with bigO" for the incr
function, of the first blog post:
Lemma incr…
In two previous blog posts (part 1 and interlude), I presented a set of notions to prove asymptotic complexity of OCaml programs, then an OCaml implementation of Okasaki's "binary random access lists". In this last blog post, I combine the two ideas, and present a formalization of binary random access lists, using CFML.
A too simple example
As an appetizer, let's try to prove the specification "with bigO" for the incr
function, of the first blog post:
Lemma incr_spec :
SpecO1 (fun F =>
Spec incr r R>> forall (i: int),
R (\$ F tt \* r ~~> i) (# r ~~> (i+1))).
The goal starts with SpecO1
, which is a specialized form of SpecO
for \(O(1)\) cost functions. Just like we used the xcf
tactic for goals starting with Spec
, we use here the xcfO
tactic. It takes an argument, though: the expression of the cost function. Note that the domain of a \(O(1)\) cost function is unit
: its value does not depend of any parameter of the program.
Proof.
xcfO (fun (_:unit) => 1).
Which turns our goal into:
============================
Spec incr r R>> forall (i: int),
R (\$ 1 \* r ~~> i) (# r ~~> (i+1)).
Oh well. This example is quite trivial, so the xcfO
tactic automatically proved all the \(O\)related subgoals, giving us the same spec as before  with a precise credit count.
A more interesting case study: binary random access lists
The OCaml implementation of binary random access lists I presented earlier has been proved correct in CFML, with bigO bounds, and is a more interesting example. In the rest of this post, I'll present some key points of the proof, and explain how they use the new definitions, lemmas and tactics of our extension of CFML.
The complete Coq script (quite raw at the moment) for the proof can be found here.
Invariants of the structure
As mentioned in the previous post, the OCaml implementation carries a number of implicit invariants: in the Coq formalization, we are going to express explicitly these invariants.
Knowing that a random access list structure satisfies these invariants is key for the complexity proof: it gives us information about its size, and the size of its subtrees. Then, to know that our structures actually satisfy these invariants, we need to prove functional correctness of the OCaml code, i.e. prove that the functions do not break the invariants of the structure.
Consequently, our Coq proof is twofold: it proves both functional correctness, and algorithmic complexity.
Predicates
CFML automatically generates the Coq counterpart of the OCaml datatypes, tree a
and rlist a
. We start the proof by defining three predicates btree
, inv
and Rlist
, that make explicit the invariants of the structure.
First, a btree
predicate. btree n t L
means that the t
is a complete (binary) tree of depth n
which contains the sequence of elements in L
.
Inductive btree : int > tree a > list a > Prop :=
 btree_nil : forall x,
btree 0 (Leaf x) (x::nil)
 btree_cons : forall p p' n t1 t2 L1 L2 L',
btree p t1 L1 >
btree p t2 L2 >
p' =' p+1 >
n =' 2^p' >
L' =' L1 ++ L2 >
btree p' (Node n t1 t2) L'.
Then, an inv
predicate: the invariant for the whole structure. inv p ts L
means that ts
is a rlist
of complete trees of increasing depth, starting with depth p
. L
is the sequence of elements represented by ts
. ts
being a wellformed binary random access list corresponds to the case where p
is equal to 0
. It is useful to consider the cases where p
is nonzero, though: reasoning by induction on ts
will lead to such cases.
Inductive inv : int > rlist a > list a > Prop :=
 inv_nil : forall p,
p >= 0 >
inv p nil nil
 inv_cons : forall p (t: tree a) ts d L L' T,
inv (p+1) ts L >
L' <> nil >
p >= 0 >
(match d with
 Zero => L = L'
 One t => btree p t T /\ L' = T ++ L
end) >
inv p (d :: ts) L'.
Finally, the Rlist
predicate corresponds to the p = 0
case: it describes a complete wellformed binary random access list.
Definition Rlist (s: rlist a) (L: list a) := inv 0 s L.
Bounds
Given structures verifying these invariants, we can deduce additional properties, in particular:
Lemma length_correct : forall t p L,
btree p t L > length L = 2^p.
Lemma ts_bound_log : forall ts p L,
inv p ts L > length ts <= Z.log2 (2 * (length L) + 1).
These lemmas will be key for proving our \(log\) complexity bounds, and constitute in fact our only mathematical analysis for this library.
cons_tree
: a first proof
Let us jump directly to the proof of the (internal) cons_tree
function.
let rec cons_tree (t: 'a tree) = function
 [] > [One t]
 Zero :: ts > One t :: ts
 One t' :: ts > Zero :: cons_tree (link t t') ts
and link t1 t2 = Node (size t1 + size t2, t1, t2)
cons_tree t ts
adds a new tree t
to the rlist
ts
. It may recursively walk through the list, calling link
(the process is very similar to incrementing an integer represented as a list of bits).
As link
runs in constant time, cons_tree
performs \(O(ts)\) operations. Moreover, we showed earlier that \(ts = O(log(L))\) where \(L\) is the list of elements contained in ts
. Therefore, cons_tree
performs "in \(O(log(n))\)" (we want to eventually express the complexities depending on the number of elements in the structure  here "\(n\)").
Our formal proof follows this twostep informal reasoning: first we prove a \(O(ts)\) complexity, reasoning by induction on ts
to follow the flow of the OCaml program; then we use our ts_bound_log
lemma to deduce a logarithmic bound depending on the number of elements stored in ts
.
First step: an auxiliary spec
We therefore prove an auxiliary specification, as our first step. Let's walk through the proof.
Lemma cons_tree_spec_aux :
SpecO (fun n => n) (fun F =>
Spec cons_tree (t: tree a) (ts: rlist a) R>>
forall p T L, btree p t T > inv p ts L >
R (\$ F (length ts)) (fun ts' => \[inv p ts' (T++L)])).
To prove a SpecO
goal, one must start by providing an explicit cost function. In this case however, we do not provide one right away:
As
cons_tree
calls thelink
function, its cost function depends onlink
's one: we need to unpacklink
's spec in order to access its (abstract) cost function;The domain of our cost function (and our bounding function
fun n => n
) isZ
. However, we are only interested in non negative values, as islength ts
. We can apply a lemma allowing us to prove our spec for values greater that some bound, with a different cost function that equals ours on this subdomain (and is basically equal to zero elsewhere).
Proof.
destruct link_spec as (link_cost & link_cost_nonneg & ? & ?).
applys @SpecO_of_SpecO_after 0.
specialize (link_cost_nonneg tt). (* Help the automated tactics. *)
The cost function can be now provided, using xcfO
. As it is now, in order to "guess" the exact expression of the cost function, the usual method is to look at the OCaml program, and remember that a function consumes a credit each time it is called.
Here, we give the exact number of credits needed, but we could give more, as long as the asymptotic bound still holds.
xcfO (fun n => 1 + (1 + (link_cost tt)) * n).
Our cost function is still relatively simple, so the additional goals (monotonicity, domination, ....) are automatically proven by xcfO
.
The last quirk is due to SpecO_of_SpecO_after
. The goal uses an abstract cost function F'
, and we are given a proof that `forall x, 0 <= x > F' x = 1 + (1 + (link_cost tt)) * x$. In practice, we just add this proof to our context, and rewrite it when needed.
intros F' FeqF'.
The rest of the proof uses standard CFML tactics, plus:
xpay
when presented to aPay; ...
goal;csimpl
when presented to a heap implication involving credits: turns the goal into an (in)equality between the quantities of credits.
xinduction (fun (t:tree a) (ts:rlist a) => LibList.length ts).
xcf. intros ? ts. introv IH Rt Rts. rewrites~ FeqF'.
inverts Rts.
 xpay. csimpl. xgo; hsimpl; constructors~.
 { xpay. csimpl. simpl_nonneg~.
xmatch.
 xret; hsimpl; constructors~; subst; splits~.
 unpack; subst. xapps~.
{ csimpl. rew_length. math_nia. }
intros. xapps~.
{ rewrites~ FeqF'. csimpl; rew_length; math_nia. }
intros. xret.
{ hsimpl. constructors~. rew_list~. } }
Qed.
Second step: the main specification
The main specification can then use a cost function in \(O(log(L))\), \(L\) being the list of elements in the structure.
Lemma cons_tree_spec :
SpecO Z.log2 (fun F =>
Spec cons_tree (t: tree a) (ts: rlist a) R>>
forall p T L, btree p t T > inv p ts L >
R (\$ F (length L)) (fun ts' => \[inv p ts' (T++L)])).
The proof is simple: we first reuse the cost function of the previous lemma cons_tree_spec_aux
, feeding it with a sufficient number of credits, as indicated by the ts_bound_log
lemma ("\(ts \leq log(2 \times L + 1)\)").
Proof.
destruct cons_tree_spec_aux
as (cons_tree_cost & cost_pos & cost_mon & cost_dom & cons_tree_spec).
xcfO (fun n => cons_tree_cost (Z.log2 (2 * n + 1))).
This time, we have to prove some additional goals by hand, produced by xcfO
.
Monotonicity
 applys~ @monotonic_comp. monotonic_Z_auto~.
We first apply
monotonic_comp
: our cost function is monotonic as composition of two monotonic functions.applys~
includes a bit of automation, so the fact thatcons_tree_cost
(present in the context) is automatically used. Remains to prove thatfun n => Z.log2 (2 * n + 1)
is monotonic: themonotonic_Z_auto~
tactic is able to prove it automatically (and more generally solves monotonicity goals for most of simple functions composed of+
,*
,Z.log
andZ.pow
).Domination
 applys @idominated_transitive. applys~ @idominated_comp cost_dom. monotonic_Z_auto. monotonic_Z_auto~. simpl. idominated_Z_auto~.
Our initial goal is
idominated _ _ (fun n => cons_tree_cost (Z.log2 (2 * n + 1))) Z.log2
. We cannot directly apply a composition lemma; however we know thatcons_tree_cost
is a \(O(n)\): we first invoke transitivity ofidominated
, then apply a composition lemma.The remaining goals are proved automatically, either by
monotonic_Z_auto
which handles simple monotonicity goals, oridominated_Z_auto
which handles simpleidominated
goals (here,idominated_Z_auto~
provesidominated _ _ (fun n => Z.log2 (2 * n + 1) Z.log2
).The specification
 xweaken~. do 4 intro. intro spec. intros. xgc; [xapply~ spec ]; csimpl~. { apply~ cost_mon. apply~ ts_bound_log. } Qed.
The specification itself is proved by weakening of cons_tree_spec_aux
, plus the following arguments:
ts_bound_log
: \(ts \leq log(2 \times L + 1)\)cons_tree_cost
is monotonic: needed to applyts_bound_log
inequality undercons_tree_cost
.
lookup
: how to deal with multiple parameters
As explained in the first post, things can get tricky when the cost function depends on multiple parameters. More precisely, the user has to specify which notion of "going to infinity" she's intending, by choosing the right filter for the domain (e.g. Z*Z
for a cost function with two parameters).
Proving a specification for the lookup
function involves precisely this kind of difficulty.
let rec lookup i = function
 [] > raise (Invalid_argument "lookup")
 Zero :: ts > lookup i ts
 One t :: ts >
if i < size t
then lookup_tree i t
else lookup (i  size t) ts
We prove a lookup i ts
specification by induction on ts
. During the induction we have two parameters: \(ts\), and the depth \(p\) of ts
's first tree (matching an inv p ts L
invariant).
The respective status of these two parameter differs, though. Once the proof by induction done, we'll want, as for cons_tree
, express the cost function depending on \(L\). \(ts\) will tend to infinity with \(L\), but \(p\) will be fixed to \(0\), as lookup
is only supposed to be called on wellformed random access lists from the user point of view.
When proving cons_tree
, we did not have to provide any filter: the standard ilter for Z
was inferred. Here, we proceed as follows:
We establish on paper a first asymptotic bound of \(O(p + ts)\);
We provide a filter
towards_infinity_xZ p
on (a subset type of)Z*Z
, which makes its second component tend to infinity, while the first is fixed top
(p
is a parameter of the filter);We prove an intermediate specification using this filter, for any fixed
p
. Note that sadly, we cannot useSpecO
to state our specification: to get a provable and useful specification, we need to quantify overp
"in the middle ofSpecO
".
The result is a quite ugly intermediate specification, unfortunately; the result of unfolding SpecO
and quantifying over p
in the middle.
Lemma lookup_spec_ind :
exists (F: Z * Z > Z),
(forall m n, 0 <= m > 0 <= n > 0 <= F (m, n)) /\
(forall (p: Z),
0 <= p >
monotonic (fixed_fst_le le p) le (fun p => F (proj1_sig p)) /\
idominated (FO := fo_towards_infinity_xZ p) _ _
(fun p => F (proj1_sig p))
(fun p => let '(m, n) := proj1_sig p in m + n)) /\
Spec lookup (i:int) (ts: rlist a) R>>
forall p L, inv p ts L > ZInbound i L >
R (\$F (p, length ts)) (fun x => \[ZNth i L x]).
The proof has the same spirit as these shown before  basically applying monotonic_*
and idominated_*
lemmas  just more involved.
 Finally, we can write a nicer toplevel specification to wrap it:
Lemma lookup_spec :
SpecO Z.log2 (fun F =>
Spec lookup (i: int) (ts: rlist a) R>>
forall L, Rlist ts L > ZInbound i L >
R (\$F (length L)) (fun x => \[ZNth i L x])).
After instantiating the cost function of our intermediate spec as lookup_spec_cost
, and projecting the monotonicity and domination hypothesis with \(p = 0\), we provide the following cost function:
xcfO (fun n => lookup_spec_cost (0, Z.log2 (2 * n + 1))).
As fun (m,n) => lookup_spec_cost (m, n)
is a \(O(m + n)\), by composition, our cost function is indeed a \(O(log(n))\).
Epilogue
The set of lemmas and tactics I just presented is, at the moment, only a unfinished tentative: it definitely has some rough edges, and can be unpleasant to use. However, I think an interesting aspect about it is that, trying to formalize reasoning on bigOs, it shows how much these tend to be informal when performed on paper. Thus, our bet is that the additional complexity is in most part not artificial, but reflects actual subtleties in the proofs.
HideIntermediate truth values — Andrej Bauer, Jul 30, 2015
I have not written a blog post in a while, so I decided to write up a short observation about truth values in intuitionistic logic which sometimes seems a bit puzzling.
Let $\Omega$ be the set of truth values (in Coq this would be the setoid whose underlying type is $\mathsf{Prop}$ and equality is equivalence $\leftrightarrow$, while in HoTT it is the hpropostions). Call a truth value $p : \Omega$ intermediate if it is neither true nor false, i.e., $p \neq \bot$ and $p \neq \top$. Such a R…
Read more...I have not written a blog post in a while, so I decided to write up a short observation about truth values in intuitionistic logic which sometimes seems a bit puzzling.
Let $\Omega$ be the set of truth values (in Coq this would be the setoid whose underlying type is $\mathsf{Prop}$ and equality is equivalence $\leftrightarrow$, while in HoTT it is the hpropostions). Call a truth value $p : \Omega$ intermediate if it is neither true nor false, i.e., $p \neq \bot$ and $p \neq \top$. Such a “third” truth value $p$ is proscribed by excluded middle.
The puzzle is to explain how the following two facts fit together:
 “There is no intermediate truth value” is an intuitionistic theorem.
 There are models of intuitionistic logic with many truth values.
Mind you, excluded middle says “every truth value is either $\bot$ or $\top$” while we are saying that there is no truth value different from $\bot$ and $\top$:
$$\lnot \exists p : \Omega \,.\, (p \neq \top) \land (p \neq \bot).$$
Coq proves this:
Theorem no_intermediate: ~ exists p : Prop, ~ (p <> True) /\ ~ (p <> False). Proof. firstorder. Qed.
Note that we replaced $=$ with $\leftrightarrow$ because equality on $\Omega$ is equivalence on $\mathsf{Prop}$ in Coq (this would not be neccessary if we used HoTT where hpropositions enjoy an extensionality principle). You should also try proving the theorem by yourself, it is easy.
A model of intuitionistic mathematics with many truth values is a sheaf topos $\mathsf{Sh}(X)$ over a topological space $X$, so long as $X$ has more than two open sets. The global points of the sheaf of truth values $\Omega$ are the open subsets of $X$, and more generally the elements of $\Omega(U)$ are the open subsets of $U$.
So, if we take an intermediate open set $\emptyset \subset U \subset X$, should it not be an intermediate truth value? Before reading on you should stop and think for yourself.
Really, stop reading.
Let us calculate which open set (truth value) is
$$(U \neq \top) \land (U \neq \bot).$$
Because $U = \top$ is equivalent to $U$ and $U = \bot$ to $\lnot U$ our formula is equivalent to
$$\lnot U \land \lnot\lnot U.$$
Remembering that negation is topological exterior we get
$$\mathsf{Ext}(U) \cap \mathsf{Ext}(\mathsf{Ext}(U))$$
which is empty,
$$\emptyset.$$
Indeed, $U$ is not a counterexample!
We have here a typical distinction between internal and external language:
 The mathematicians inside the topos think and speak locally. They ask not “Is this statement true?” but “Where is this statement true?” If you aks them a yes/no question they will answer by giving you an open subset of $X$. They will conclude that $U \neq \top$ holds on the exterior of $U$, and $U \neq \bot$ on the double exterior of $U$, and that nowhere are they true both together.
 The mathematicians outside the topos (that’s us) understand $(U \neq \top) \land (U \neq \bot)$ differently: it is about comparing the open set $U$ to the open sets $X$ and $\emptyset$ as elements of the topology of $X$. For them “yes” and “no” are valid answers, and no other.
By the way, the mathematicians on the inside also think that “yes” and “no” are valid answers, and there is no other – that is precisely the statement “there is no intermediate truth value” – but they think of it as “holding everywhere on $X$”.
There are of course many situations where the difference between iternal and external language may create confusion. For example, if $X = \mathbb{N}$ is a countable discrete space then the object of natural numbers is the sheaf of all maps into $\mathbb{N}$, of which there are uncountably many. Thus on the outside we “see” that there are uncountably many natural numbers in $\mathsf{Sh}(X)$. Those on the inside would of course disagree. This is an amusing situation, sort of a reverse of Skolem’s paradox.
HideFormally verifying the complexity of OCaml programs with CFML  interlude — GaGallium (Armaël Guéneau), Jul 30, 2015
This small interlude post is not about Coq nor CFML: I present a data structure from Okasaki's Purely functional data structures, implemented in OCaml.
It comes as a preliminary explanation for an incoming formalization of it in CFML (see next post), including an complexity analysis!
The structure we decided to study is Okasaki's "binary random access list". It is a functional data structure, which features usual list operations (cons
and uncons
for adding and removing an …
This small interlude post is not about Coq nor CFML: I present a data structure from Okasaki's Purely functional data structures, implemented in OCaml.
It comes as a preliminary explanation for an incoming formalization of it in CFML (see next post), including an complexity analysis!
The structure we decided to study is Okasaki's "binary random access list". It is a functional data structure, which features usual list operations (cons
and uncons
for adding and removing an element at the head of the list), but also random access lookup
and update
. That is, you can add or remove an element at the head of the list, but also modify or query the \(i^{th}\) element of the structure.
These four operations perform in worstcase \(O(log(n))\) steps, where \(n\) is the number of elements stored in the structure.
Let us see step by step how it is implemented in OCaml. The source code for the complete implementation can be found here.
Type definitions, implicit invariants
A binary random access list is a list of binary trees: we first define a tree
OCaml type.
type 'a tree = Leaf of 'a  Node of int * 'a tree * 'a tree
Notice that only the leaves store elements. Nodes contain an integer corresponding to the number of elements stored (in the leaves) in the tree, which makes a size
function trivial to implement:
let size = function
 Leaf x > 1
 Node (w, _, _) > w
Now, a binary random access list is a list of either a tree, either nothing. We consequently define an "tree option" type, here dubbed "digit".
type 'a digit = Zero  One of 'a tree
The name "digit" comes of the similarity between a binary random access list and a list of bits, representing an integer  adding an element at the head being similar to incrementing the integer, etc. We'll see more of that later.
Finally, we define the type for the whole structure: the binary random access list.
type 'a rlist = 'a digit list
A valid binary random access list should satisfy some additional invariants:
Trees are complete trees  a tree of depth \(d\) always has \(2^d\) leaves;
Any
rlist
contains trees of increasing depth starting at some depth \(p\): if the \(i^{th}\) cell (indexing from \(0\)) contains a tree (is of the formOne t
), then this tree has depth \(p+i\).A complete binary random access list is a
rlist
with starting depth \(p\) equal to \(0\): its first tree, if present, is only a leaf.
To sum up, a binary random access list is a list, which stores in its \(i^{th}\) cell either nothing, or a complete binary tree of depth \(i\).
As an example, a binary random access list storing the sequence \(1, 2, 3, 4, 5\) can be represented as:
Binary random access lists look like integers in binary
As I mentioned before, a binary random access list is in fact quite similar to an integer represented in binary  i.e. as a list of bits.
Actually, if you erase the trees from the implementation (type 'a digit = Zero  One of 'a tree
becomes type digit = Zero  One
), you obtain an implementation of integers, represented as a list of bits (least significant bit at the head of the list); the cons
operation being incrementing, the uncons
one decrementing (lookup
and update
do not make much sense, though).
How do you increment an integer? You look at the least significant bit; if it's a zero, you turn it into a one; if it's a one, you turn it into zero, and recursively continue to increment, starting with the next bit.
"Consing" to a binary random access list is similar, except that we have to handle a bit more information: the elements of the structure, stored in the trees.
Instead of adding \(1\), we add a tree t
(more precisely, a digit One t
): if the first element of the list is Zero
, we turn it into One t
. If it's a One t'
, we turn it into Zero
, and recursively continue, but with a new tree: Node (size t + size t', t, t')
. This corresponds to the link
operation, which combines two trees of depth \(d\) into one of depth \(d+1\).
let link t1 t2 =
Node (size t1 + size t2, t1, t2)
The OCaml implementation follows:
let rec cons_tree t = function
 [] > [One t]
 Zero :: ts > One t :: ts
 One t' :: ts > Zero :: cons_tree (link t t') ts
let cons x ts =
cons_tree (Leaf x) ts
The uncons
operations follows the same idea: it's just like decrementing, except that you also have to invert the link
operation to obtain trees of smaller depth.
let rec uncons_tree = function
 [] > raise Empty
 [One t] > (t, [])
 One t :: ts > (t, Zero :: ts)
 Zero :: ts >
match uncons_tree ts with
 Node (_, t1, t2), ts' > (t1, One t2 :: ts')
 _ > assert false
let head ts =
match uncons_tree ts with
 (Leaf x, _) > x
 _ > assert false
let tail ts =
let (_,ts') = uncons_tree ts in ts'
Unconsing a rlist
of starting depth \(p\) always returns a tree of depth \(p\) (or fails if the list is empty). In particular, as the binary access list manipulated by the user always starts with depth \(0\), we can assume in the implementation of head
that the unconsed tree is a leaf.
Random access
Random access lookup and update are quite easy to implement.
The idea is to walk through the list; faced with a Node (w, l, r)
tree, we know how much elements it contains: it is exactly w
. Knowing the index \(i\) of the element we want to visit, we can compare it to w
to know whether we should explore this tree (if \(i < w\)), or look for the element \(i  w\) in the rest of the list.
let rec lookup i = function
 [] > raise (Invalid_argument "lookup")
 Zero :: ts > lookup i ts
 One t :: ts >
if i < size t
then lookup_tree i t
else lookup (i  size t) ts
If we have to walk through the tree, we can also do this without performing an exhaustive exploration: by comparing the index to half the size of the tree, we can decide whether we should explore the left or right subtree.
let rec lookup_tree i = function
 Leaf x > if i = 0 then x else raise (Invalid_argument "lookup")
 Node (w, t1, t2) >
if i < w/2
then lookup_tree i t1
else lookup_tree (i  w/2) t2
The update
function works in a similar fashion.
let rec update_tree i y = function
 Leaf x > if i = 0 then Leaf y else raise (Invalid_argument "update")
 Node (w, t1, t2) >
if i < w/2
then Node (w, update_tree i y t1, t2)
else Node (w, t1, update_tree (i  w/2) y t2)
let rec update i y = function
 [] > raise (Invalid_argument "update")
 Zero :: ts > Zero :: update i y ts
 One t :: ts >
if i < size t
then One (update_tree i y t) :: ts
else One t :: update (i  size t) y ts
Next time
Final post incoming, where I present a formalization of this exact OCaml implementation, using CFML and the time creditsrelated functions I presented in the first post. Stay tuned!
HideWelcome to the Ocsigen Blog! — Ocsigen blog (The Ocsigen Team), Jul 29, 2015
The Ocsigen project finally has a blog. We are planning to publish news from the wider Ocsigen community, tutorials, and more.
Ocsigen encompasses the building blocks that you need in order to efficiently develop Web applications in OCaml. Among other libraries and tools, Ocsigen provides:

js_of_ocaml, which is a compiler from OCaml to JavaScript;

Ocsigen Server, which is a web server implemented in OCaml;

Lwt, which is a cooperative thread library for OCaml;
The Ocsigen project finally has a blog. We are planning to publish news from the wider Ocsigen community, tutorials, and more.
Ocsigen encompasses the building blocks that you need in order to efficiently develop Web applications in OCaml. Among other libraries and tools, Ocsigen provides:

js_of_ocaml, which is a compiler from OCaml to JavaScript;

Ocsigen Server, which is a web server implemented in OCaml;

Lwt, which is a cooperative thread library for OCaml;

TyXML, which is a library for generating valid HTML and SVG content; last but not least,

Eliom, which puts everything together to provide a modern language for developing web applications, with tight integration between client and server.
Happy hacking, and stay tuned!
HideWeekly News — OCaml Weekly News, Jul 28, 2015
Fun With Opam: Advice to My Past Self — Mindy Preston, Jul 27, 2015
Most instructions on how to get started with OCaml packages now advise the user to get started with opam, which is excellent advice. Getting up and running with opam is pretty easy, but I wasn’t sure where to go from there when I wanted to modify other people’s packages and use the modifications in my environment. I wish I’d realized that the documentation for making packages has a lot of applicable advice for that use case, as well as the apparent target (making your own packges from scra…
Read more...Most instructions on how to get started with OCaml packages now advise the user to get started with opam, which is excellent advice. Getting up and running with opam is pretty easy, but I wasn’t sure where to go from there when I wanted to modify other people’s packages and use the modifications in my environment. I wish I’d realized that the documentation for making packages has a lot of applicable advice for that use case, as well as the apparent target (making your own packges from scratch).
Most of the documentation on opam seems to focus on its use as a dependency manager, but it’s also capable of providing many virtual environments via opam switch
. Often buried in documentation is the very useful opam switch aliasof
, which one can use to make new virtual environments using a standard OCaml compiler. Most of my current switches are based off the 4.02.1 compiler (just a bit short of uptodate), so I make new switches for projects with opam switch new_project aliasof 4.02.1
.
I wish, also, that I’d started writing opam
files for my own projects and installing them locally via opam pin
sooner. The earlier I do this, the less likely I am to publicize a repository with missing or broken dependencies, and it reduces the friction of sharing earlystage code with other people. Combined with the switchperproject approach, having opam
files ready means you can easily have fresh switches with the minimal amount of dependencies installed; you can also install a minimal set of reversedependencies (if any!), to limit the amount of time you spend waiting for downstream packages to recompile when you want to test a change. I have great sorrow when I think back on the number of times I waited through a complete recompile of all zillion packages that depend on miragetypes
when I really only wanted to test tcpip
; a switch that doesn’t have mirageconsoleunix
and friends installed won’t try to recompile them, and I won’t have to spend any minutes of my only life waiting for that compilation to finish.
Having a lot of switches does lead to some confusion, since executing opam switch
results in a filesystem change in the user’s .opam
directory but can’t enforce the change in all open terminals. The problem is compounded by the requirement to eval $(opam config env)
after running opam switch
— if the user forgets to do this, opam switch
will report the expected branch, but any commands that try to use the configuration will use the values for the previous switch. After having been caught out by this a few times, I caved and customized my command prompt to report on the current actual switch (along with the current git branch):
function gitcurrentbranch { git branch 2> /dev/null  sed e '/^[^*]/d' e 's/* \(.*\)/(\1) /' } function opamsw { # this is pretty brittle but good enough # have to adjust last cut if .opam is not /home/username/.opam or similar depth echo $CAML_LD_LIBRARY_PATHcut d: f 1cut d'/' f 5 } export PS1="[\$(opamsw)] \$(gitcurrentbranch)$PS1"
It’s embarrassing to admit how many times I was confused by the apparently nonsensical output of make
and opam install
before I made this change, but at least that’s decreased since I started showing myself this information by default.
Something that still trips me up, though, is the setup.data
file created by projects that use oasis to generate build files. setup.data
contains cached information on how to build projects, including full paths to compilers and libraries, which isn’t updated if you use opam switch
. When I began writing this post, my “solution” for this was compulsively removing setup.data
every time there’s a weirdlooking problem building a project that uses oasis, but writing I’ve been inspired to do better by making another adjustment:
function setupdata { grep E "^standard_library=" setup.data 2>/dev/nullcut d'/' f5 } function opamsw { setup=$(setupdata) switch=$(echo $CAML_LD_LIBRARY_PATHcut d: f 1cut d'/' f 5) if [ n "$setup" ] && [ "$switch" != "$setup" ] then echo "(!!! setup.data $switch)" else echo $switch fi } export PS1="[\$(opamsw)] \$(gitcurrentbranch)$PS1"
Now my prompt will warn me when setup.data
and my current switch conflict, as they do in the following example where I last built miragetcpip
in the separatearptcpip
switch:
[4.02.1] user@computer:~$ cd miragetcpip [(!!! setup.data 4.02.1)] (separate_arp) user@computer:~/miragetcpip$
This hasn’t saved me any grief yet, but I’m sure it will soon enough.
HideOCaml 4.02.3 released — Caml INRIA, Jul 27, 2015
Organized chaos: managing randomness — Open Mirage (David Kaloper), Jul 22, 2015
This post gives a bit of background on the Random Number Generator (RNG) in the recent MirageOS v2.5 release.
First we give background about why RNGs are really critical for security. Then we try to clarify the oftenconfused concepts of "randomness" and "entropy", as used in this context. Finally, we explore the challenges of harvesting goodquality entropy in a unikernel environment.
Playing dice
Security software must play dice.
It must do so to create secrets, for…
Read more...This post gives a bit of background on the Random Number Generator (RNG) in the recent MirageOS v2.5 release.
First we give background about why RNGs are really critical for security. Then we try to clarify the oftenconfused concepts of "randomness" and "entropy", as used in this context. Finally, we explore the challenges of harvesting goodquality entropy in a unikernel environment.
Playing dice
Security software must play dice.
It must do so to create secrets, for example. Secrets can then serve as the keys that protect communication, like the DiffieHellman key exchanged between two TLS endpoints. Proof of the knowledge of a particular secret can be used to verify the identity of someone on the Internet, as in the case of verifying the possession of the secret RSA key associated with an X.509 certificate. As an attacker guessing a secret can have disastrous consequences, it must be chosen in a manner that is realistically unpredictable by anyone else — we need it to be random.
There are other reasons to use randomness. A number of algorithms require a
unique value every time they are invoked and badly malfunction when this
assumption is violated, with random choice being one way to provide a value
likely to be unique. For example, repeating the k
parameter
in DSA digital signatures compromises the secret key, while reusing a
GCM nonce negates both confidentiality and authenticity. Other
algorithms are probabilistic, in that they generate random values before
operating on an input, and then store the chosen values in the output, such as
the OAEP padding mode for RSA. This is done in order to confuse the
relationship between the input and the output and defeat a clever attacker who
tries to manipulate the input to gain knowledge about secrets by looking at the
output. Still other algorithms pick random numbers to internally change their
operation and hide the physical amount of time they need to execute, to avoid
revealing information about the secrets they operate on. This is known as
blinding, and is one way to counter timing sidechannel
attacks.
Randomness is therefore quite pervasive in a security context. In fact, many cryptographic algorithms are designed under the assumption of a readily available source of randomness, termed a random oracle. The security analysis of those algorithms is conditional on the oracle; we know that they have certain security characteristics, like the difficulty of guessing the correct message or impersonating somebody, only given an ideal random oracle.
And so security software has a problem here. Computers are inherently deterministic, made to behave reproducibly given a known program and starting state. How to go about solving this?
Random failures
Before taking a look at how we try to solve this problem, let's instead consider what happens if we fail to do so. There is even a Wikipedia page about this, which is a nice starting point. Some of the highlights:
The first public release of Netscape's original SSL, version 2.0, was broken several months after its release. The weakness was in initializing the RNG with the current time, the process ID and the parent process ID of the browser. The time stamp can be guessed to a certain precision, leaving only its subsecond part and the two PIDs unknown. This relatively small unknown space of initial values can be bruteforced.
About a decade later, Debian patched their version of OpenSSL and reduced RNG initialization to the current PID. As a result, only 32767 random sequences were possible. This flaw went undetected for two years and became known as the Debian fiasco. Personal reports indicate that some of the 32767 distinct secret keys that could be generated with OpenSSL on a Debian system during that time are still in circulation.
Computing the largest common divisor of a pair of numbers is much faster than discovering all the prime divisors of a particular number. RSA public keys contain a number, and secret keys contain its factors. An RSA key is usually generated by randomly picking the factors. If a pool of keys was generated with a heavily biased random number generator, such that factors are likely to repeat, it is possible to search for common factors in all pairs and crack the affected keys, a technique which produces spectacular results.
Recently, a bitcoin application for Android was discovered to be downloading its random initial value from a website. It wasn't even necessary to intercept this unencrypted traffic, because the website started serving a redirect page and the Android application was left initializing its RNG with the text of the redirection message. It therefore started generating the same private ECDSA key and the associated bitcoin address for every affected user, an issue which reportedly cost some users their bitcoins.
Playstation 3 game signatures can be forged. Sony reused a
single k
parameter, which is supposed to be "unique, unpredictable and
secret", for every ECDSA signature they made. This lead to complete compromise
of the signing keys. Admittedly, this is not really an RNG
problem in itself, but it shows where such a malfunction can lead.
These are only some of the most spectacular failures related to random numbers. For example, it is widely known in security circles that RNGs of embedded devices tend to be predictable, leading to widespread use of weak keys on routers and similar equipment, amongst other things. So when implementing a unikernel operating system, you don't want to end up on that Wikipedia page either.
Random sequences and stuff
But what are random numbers, really? Intuitively, we tend to think about them as somehow "dancing around", or being "jiggly" in a sense. If we have a software component that keeps producing random outputs, these outputs form a sequence, and we hope this to be a random sequence.
But such a thing is notoriously difficult to define. The above linked page opens with the following quote:
A random sequence is a vague notion... in which each term is unpredictable to the uninitiated and whose digits pass a certain number of tests traditional with statisticians.
The intuitive jigglyness is captured by statistical
randomness. We require each output, taken
independently, to come from the same distribution (and in fact we want it to be
the uniform distribution). That is, when we take a long sequence of outputs, we
want them to cover the entire range, we want them to cover it evenly, and we
want the evenness to increase as the number of outputs increases — which
constitutes a purely frequentist definition of randomness. In addition, we want
the absence of clear patterns between outputs. We don't want the sequence to
look like 7, 8, 9, 10, ...
, even with a bit of noise, and we
don't want correlation between outputs. The problem here is that noone really
knows what "having patterns" means; it is entirely possible that we only
searched for patterns too simple, and that in fact there is a pattern that fully
explains the sequence lurking just around the complexity corner.
Nonetheless, there is a well established battery of tests to check statistical randomness of RNG outputs, called the Diehard Tests, and serves as the defacto standard for testing random number generators. Here's the beginning of a certain sequence that passes the test with flying colors:
3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4, 3, 3, 8, 3, ...
We still would not recommend using digits of π
as a secret key.
Neither would we recommend
releasing software for everyone to study, which uses that sequence to generate
the secrets. But what went wrong?
The other concept of randomness. Roughly, a random sequence should not be predictable to anyone with any knowledge other than the sequence itself. In other words, it cannot be compressed no matter how much we try, and in the extreme, this means that it cannot be generated by a program. While the latter restriction is obviously a little too strong for our purpose, it highlights a deep distinction in what people mean by being "random". Jumping around is one thing. Being actually unpredictable is a wholly different matter.
There are many other simple mathematical processes which generate sequences with high statistical randomness. Many of those are used to produce "random" sequences for various purposes. But these are still completely deterministic processes that exhibit random behaviour only in the statistical sense. Instead of being random, they are pseudorandom, and we call such generators PseudoRandom Number Generators (PRNGs).
We can look for something approaching a "true" random sequence in nature. The current agreement is that the nature of quantum processes is random in this sense, and random sequences based on this idea are readily available for download. Or we can use the microphone and keep recording; the lowestorder bits of the signal are pretty unpredictable. But we cannot write a program to generate an actually random sequence.
Still, we need to compromise. The real problem of common PRNGs is that knowing the rule and observing some of the outputs is enough to predict the rest of the sequence. The entire future behavior of Mersenne twister, one of the most commonly used generators in various programming packages, can be predicted after observing only 624 outputs in a row. A step up from such a process is a Cryptographically Secure PseudoRandom Number Generator (CSPRNG). Their key property is that it is computationally prohibitively expensive to distinguish their outputs from a "true" random sequence. This also means that it is computationally prohibitively expensive to reconstruct their internal state, just by looking at their outputs. In a sense, someone trying to predict the outputs can not take shortcuts, and is instead forced to perform the laborious task of starting the generator with all the possible states and checking if the output matches the observed sequence. This is how we can quantify a CSPRNG unpredictability: it takes trying about half of all the possibilities to guess the state.
MirageOS' security stack contains a CSPRNG, a design called Fortuna.
What it really does, is encrypt the simple sequence 0, 1, 2, 3, ...
with AES
(AESCTR) using a secret key. This makes it as resistant to prediction as AES is
to knownplaintext attacks. After each output, it
generates a bit more, hashes that, and uses the result as the next key. This is
not to improve the statistical randomness, as it is already guaranteed by AES.
Rather, it's a form of forward secrecy: an attacker who
learns the secret key at some point would need to perform the preimage
attack on the hash function to figure out the earlier key
and reconstruct the earlier outputs.
Entropy
Although resistant to prediction based solely on the outputs, just like any other software RNG, Fortuna is still just a deterministic PRNG. Its entire output is as unpredictable as its initial value, which we call the seed. From the information perspective, a PRNG can only transform what was unpredictable about its initial seed into an equally unpredictable sequence. In other words, we typically use PRNGs to stretch the unpredictability inherent in the initial seed into an infinite stream. The best PRNGs do not give out more hints about their starting position, but they can never outrace the amount of unpredictability that they started with.
We often call this quality of unpredictability entropy. In a sense, by employing an algorithmic generator, we have just shifted the burden of being unpredictable to the beginning. But now we're cornered and have to search for entropy in the only place where a computer can find it: in the physical world.
A typical (kernellevel) RNGsystem reaches out into the world around it through hardware interaction: as hardware events happen, various drivers tend to emit small packets of data, such as the time, or hardwarespecific state. These events are a product of the user interactions with the keyboard and mouse, of network packets arriving at an interface, of the hard drive asserting interrupts to signal the end of a DMA transfer, and the like. They are combined together and used to seed the internal (CS)PRNG.
In fact, describing them as a seed from which the entire sequence is unfolded is a deliberate oversimplification: what really happens is that the PRNG is continuously fed with random events, which change its state as they arrive, and the requests for random bytes are served from the PRNG. The PRNG is used to "mix" the unpredictability inherent in its input, that is, to smooth out various timestamps and similar values into a statistically wellbehaved sequence.
Do Virtual Machines Dream of Electric Sheep?
Our problem here is that a virtual machine (VM) in a typical configuration barely sees any physical hardware. Users do not interact with VMs in server scenarios using a directlyconnected keyboard and mouse. VMs make use of a virtualized network interface and virtualized disks. Even the CPU features can be intercepted and virtualized. Virtual environments are entropystarved.
This is a known problem and various analyses of the weakness of random outputs in virtual environments have been published. The problem is especially severe right after boot. The gradual trickle of unpredictability from hardware events slowly moves the pseudorandom stream into an increasingly unpredictable state, but at the very start, it still tends to be fairly predictable. Typically, operating systems store some of their PRNG output on shutdown and use it to quickly reseed their PRNG on the next boot, in order to reuse whatever entropy was contained in its state. Unfortunately, it is common to boot several machines from the same system image, or from a pristine image lacking a seed, making random outputs in a virtual machine vulnerable to prediction close to the startup phase.
To help solve these problems, we employ several sources of entropy in MirageOS
unikernels. The case of a Unix executable is simple, as we reuse the system's
own RNG, as exposed via /dev/urandom
, as the source of our entropy. This is
because the kernel is in a much better position to enter an unpredictable state
than any single process running under its supervision. The case of Xen
unikernels is harder. Here, we group the entropy sources into those that
originate within the unikernel itself, and those that originate externally.
In the external case, we again rely on the kernel interacting with the hardware, but this time it's the dom0 kernel. We have a background service, Xentropyd, which runs in dom0, reads the RNG and serves its output to other domains through the Xen Console. The problem is that in many scenarios, like hosting on popular cloud providers, we cannot expect this degree of cooperation from dom0. A bigger problem is that although most of the code is present, we haven't fully fleshed out this design and it remains disabled in MirageOS 2.5.0
So we need to be able to achieve unpredictability relying purely on what is available inside a unikernel. A unikernel has no direct exposure to the hardware, but it is of course interacting with the outside world. To tap into this ambient entropy, we have to continuously sample all interevent timings in its event loop. This process is analogous to what happens in a fullblown OS kernel, except our events lack the extra hardware context, and our timers are potentially less granular (for example, on ARM). This makes our interactionbased events somewhat more predictable, or in other words, they have a little less entropy.
Recent Intel chips come with an ondie random generator, which ultimately
derives from thermal readings, and is available through RDRAND
and (more
directly) RDSEED
instructions. The community has expressed concern that
relying exclusively on this generator might not be a wise choice: it could
silently malfunction, and its design is hidden in the hardware, which raises
concerns about potential intentional biases in the output — a scheme not
unheard of. However, since entropy is additive, its output can never
reduce whatever unpredictability the system already has. Therefore, if
available, we continuously sample this ondie RNG, and inject its outputs into
our PRNG.
The combination of event timings and a builtin RNG does have good unpredictability in the long run, especially if our unikernel is running on a multitenant host and competing for CPU with other instances. But the entropy in each individual event is still relatively low: we can assume that a determined attacker can guess each individual time stamp up to a certain precision that we don't know, but which is potentially quite high. This creates the following problem: imagine that an attacker knows the current PRNG state, and can measure the time of the next event, but not with sufficient precision to know the last two bits of the timestamp. To this attacker, our event contains two bits of entropy. If we immediately update the PRNG, the attacker only has to observe some of the output and check four candidate states against it, to fully recover knowledge about the state and negate our entropy addition. On the other hand, if we decide to wait and try to accumulate many more events before updating the PRNG, we keep generating a fully predictable sequence in the meantime.
And here is where Fortuna really shines. It keeps accumulating events in a number of internal pools in a roundrobin fashion. These pools are constantly being activated, but with an exponentially decreasing frequency. The pools activated too frequently are wasted, but one of them is activated with just the right frequency to contain enough entropy to make it prohibitively expensive for an attacker to enumerate all the possibilities. This design was shown to be within a constant factor from optimal entropy use, and in particular, scales robustly with the actual amount of entropy inherent in the input events.
This leaves us with the problem of boottime entropy. Not only can the saved random seed be reused by cloning the disk image, but in many cases, a MirageOS unikernel is running without such storage at all!
Following the design of Whirlwind RNG, we employ an entropy bootstrapping loop. It's an iterated computation, which measures the time it took to perform the previous iteration, and then performs the amount of work that depends on the time, many times over. In this way, it creates a feedback loop with a fragile dependency on any nondeterminism in the physical execution on the CPU, such as any contention or races in the CPU state. Even on ARM, which currently uses a less finegrained timer and whose design is not as parallel as Intel's, this yields an initial value which varies wildly between boots. We use this value to kickstart the PRNG, giving it quick divergence, and ensuring that the state is unpredictable from the very start.
Parting words
While some of our techniques (in particular bootstrapping on ARM) need a little more exposure before we place our full confidence in them — and users should probably avoid generating longterm private keys in unikernels running on bare Xen just yet — the combination of boostrapping, continuous reseeding, and robust accumulation gives us a hopefully comprehensive solution to generating randomness in a unikernel environment.
We intend to reevaluate the effectiveness of this design after getting some
experience with how it works in the wild. To this end, we particularly
appreciate the community feedback and
you can reach us through our mailing list, or hop onto
freenode
and join #mirage
.
Thanks to Daniel, Mort and Amir for their comments on earlier drafts.
HideWeekly News — OCaml Weekly News, Jul 21, 2015
Building the LLVM Fuzzer on Debian. — Erik de Castro Lopo, Jul 21, 2015
I've been using the awesome American Fuzzy Lop fuzzer since late last year but had also heard good things about the LLVM Fuzzer. Getting the code for the LLVM Fuzzer is trivial, but when I tried to use it, I ran into all sorts of road blocks.
Firstly, the LLVM Fuzzer needs to be compiled with and used with Clang (GNU GCC won't work) and it needs to be Clang >= 3.7. Now Debian does ship a clang3.7 in the Testing and Unstable releases, but that package has a bug (#779785) which means the…
Read more...I've been using the awesome American Fuzzy Lop fuzzer since late last year but had also heard good things about the LLVM Fuzzer. Getting the code for the LLVM Fuzzer is trivial, but when I tried to use it, I ran into all sorts of road blocks.
Firstly, the LLVM Fuzzer needs to be compiled with and used with Clang (GNU GCC won't work) and it needs to be Clang >= 3.7. Now Debian does ship a clang3.7 in the Testing and Unstable releases, but that package has a bug (#779785) which means the Debian package is missing the static libraries required by the Address Sanitizer options. Use of the Address Sanitizers (and other sanitizers) increases the effectiveness of fuzzing tremendously.
This bug meant I had to build Clang from source, which nnfortunately, is rather poorly documented (I intend to submit a patch to improve this) and I only managed it with help from the #llvm IRC channel.
Building Clang from the git mirror can be done as follows:
mkdir LLVM cd LLVM/ git clone http://llvm.org/git/llvm.git (cd llvm/tools/ && git clone http://llvm.org/git/clang.git) (cd llvm/projects/ && git clone http://llvm.org/git/compilerrt.git) (cd llvm/projects/ && git clone http://llvm.org/git/libcxx.git) (cd llvm/projects/ && git clone http://llvm.org/git/libcxxabi) mkdir p llvmbuild (cd llvmbuild/ && cmake G "Unix Makefiles" DCMAKE_INSTALL_PREFIX=$(HOME)/Clang/3.8 ../llvm) (cd llvmbuild/ && make install)
If all the above works, you will now have working clang and clang++ compilers installed in $HOME/Clang/3.8/bin and you can then follow the examples in the LLVM Fuzzer documentation.
HideOCaml serverside developer at Ahrefs Research (Fulltime) — Functional Jobs (FunctionalJobs.com), Jul 20, 2015
Who we are
Ahrefs Research is a San Francisco branch of Ahrefs Pte Ltd (Singapore), which runs an internetscale bot that crawls whole Web 24/7, storing huge volumes of information to be indexed and structured in timely fashion. On top of that Ahrefs is building analytical services for endusers.
Ahrefs Research develops a custom petabytescale distributed storage to accommodate all that data coming in at high speed, focusing on performance, robustness and ease of use. Performancecritical low…
Read more...Who we are
Ahrefs Research is a San Francisco branch of Ahrefs Pte Ltd (Singapore), which runs an internetscale bot that crawls whole Web 24/7, storing huge volumes of information to be indexed and structured in timely fashion. On top of that Ahrefs is building analytical services for endusers.
Ahrefs Research develops a custom petabytescale distributed storage to accommodate all that data coming in at high speed, focusing on performance, robustness and ease of use. Performancecritical lowlevel part is implemented in C++ on top of a distributed filesystem, while all the coordination logic and communication layer, along with API library exposed to the developer is in OCaml.
We are a small team and strongly believe in better technology leading to better solutions for realworld problems. We worship functional languages and static typing, extensively employ code generation and metaprogramming, value code clarity and predictability, constantly seek out to automate repetitive tasks and eliminate boilerplate, guided by DRY and following KISS. If there is any new technology that will make our life easier  no doubt, we'll give it a try. We rely heavily on opensource code (as the only viable way to build maintainable system) and contribute back, see e.g. https://github.com/ahrefs . It goes without saying that our team is all passionate and experienced OCaml programmers, ready to lend a hand or explain that intricate ocamlbuild rule.
Our motto is "first do it, then do it right, then do it better".
What we need
Ahrefs Research is looking for backend developer with deep understanding of operating systems, networks and taste for simple and efficient architectural designs. Our backend is implemented mostly in OCaml and some C++, as such proficiency in OCaml is very much appreciated, otherwise a strong inclination to intensively learn OCaml in a short term will be required. Understanding of functional programming in general and/or experience with other FP languages (F#,Haskell,Scala,Scheme,etc) will help a lot. Knowledge of C++ is a plus.
The candidate will have to deal with the following technologies on the daily basis:
 networks & distributed systems
 4+ petabyte of live data
 OCaml
 C++
 linux
 git
The ideal candidate is expected to:
 Independently deal with and investigate bugs, schedule tasks and dig code
 Make argumented technical choice and take responsibility for it
 Understand the whole technology stack at all levels : from network and userspace code to OS internals and hardware
 Handle full development cycle of a single component, i.e. formalize task, write code and tests, setup and support production (devops)
 Approach problems with practical mindset and suppress perfectionism when time is a priority
These requirements stem naturally from our approach to development with fast feedback cycle, highlyfocused personal areas of responsibility and strong tendency to vertical component splitting.
What you get
We provide:
 Competitive salary
 Modern office in San Francisco SOMA (Embarcadero)
 Informal and thriving atmosphere
 Firstclass workplace equipment (hardware, tools)
 No dress code
Get information on how to apply for this position.
HideIntroducing Incremental — Jane Street (Yaron Minsky), Jul 18, 2015
I'm pleased to announce the release of Incremental (well commented mli here), a powerful library for building selfadjusting computations, i.e., computations that can be updated efficiently when their inputs change.
At its simplest, you can think of a selfadjusting computation as a fancy spreadsheet. In a spreadsheet, each cell contains either simple data, or an equation that describes how the value in this cell should be derived from values in other cells. Collectively, this amounts to a graph…
Read more...I'm pleased to announce the release of Incremental (well commented mli here), a powerful library for building selfadjusting computations, i.e., computations that can be updated efficiently when their inputs change.
At its simplest, you can think of a selfadjusting computation as a fancy spreadsheet. In a spreadsheet, each cell contains either simple data, or an equation that describes how the value in this cell should be derived from values in other cells. Collectively, this amounts to a graphstructured computation, and one of the critical optimizations in Excel is that when some of the cells change, Excel only recomputes the parts of the graph that depend on those changed cells.
What makes selfadjusting computation (or SAC) different from a spreadsheet is its dynamism. The structure of the computational graph in an SAC can change at runtime, in response to the changing input data.
This dynamism gives you quite a bit of flexibility, which can be used in different ways. Here are a few.
Online combinatorial algorithms. Incremental is based on work by Umut Acar et. al.., on self adjusting computations (that's where the term comes from), and there, they were mostly interested in building efficient online versions of various combinatoral algorithms. In many cases, they could match the asymptotic complexity of custom online algorithms by fairly simple incrementalizations of allatonce algorithms.
Incremental GUI construction. One simple and natural way to model a GUI application is to structure your display as a function that geneates a view from some more abstract data model.
Having a function that constructs a view from scratch at every iteration is simple, but prohibitively expensive. But if you can write this function so that it produces an incrementalized computation, then you have a solution that is both simple and efficient. We've used this technique in a number of our UIs, to good effect.
This might remind you of how functional reactive programming (FRP) is used for construction of GUIs in languages like Elm. SAC and FRP have different semantics  FRP is mostly concerned with timelike computations, and SAC is mostly about optimizing DAGstructured computations  but they are nonetheless closely related, especially at the implementation level. You can see my post here for a description of the broader conceptual landscape that includes both FRP and SAC.
Configurable computations. An example that comes from our own work is risk calculations. Calculating measures of risk of a portfolio involves combining data from a complex collection of interdependent models. Each of these models is dependent both on live data from the markets, and on configurations determined by users.
A config change could merely tweak a coefficient, or it could change the overall structure of the computation, say by changing the list of factors used by a given model. Incremental allows you to build a computation that can update efficiently in response to both simple data changes as well as more structural config changes, in one unified framework.
A taste of Incremental
It's hard to give a compelling example of Incremental in action in just a few lines of code, because what makes Incremental really useful is how it helps you build large and complex computations. Nonetheless, small examples can give you a sense of how the library works.
To that end, let's walk through a few small examples. To begin, we need to instantiate the Incremental functor.
 open Core.Std
 module Inc = Incremental_lib.Incremental.Make ()
Each instance thus generated is its own independent computational world. The Incremental functor is generative, meaning it mints fresh types each time it's applied, which prevents values from different incremental worlds from being mixed accidentally.
An Incremental computation always starts at its variables. Modifications to variables are how updates to input data are communicated to Incremental.
Let's write down a few variables corresponding to the dimensions of a rectangular prism.
 module Var = Inc.Var
 (* dimensions of a rectangular prism *)
 let width_v = Var.create 3.
 let depth_v = Var.create 5.
 let height_v = Var.create 4.
We can use Var.watch
to get the (trivial) incremental computions associated with each variable.
 let width = Var.watch width_v
 let depth = Var.watch depth_v
 let height = Var.watch height_v
The following is an incremental computation of the base area of the prism, and of the volume.
 let base_area =
 Inc.map2 width depth ~f:( *. )
 let volume =
 Inc.map2 base_area height ~f:( *. )
In order to get information out of an incremental computation, we need to explicitly mark which nodes we want data from by creating observer nodes. Because it knows which nodes are observed, the framework can track what parts of the computation are still necessary to the results.
 let base_area_obs = Inc.observe base_area
 let volume_obs = Inc.observe volume
In order to force the computation to run, we need to explicitly call Inc.stabilize
. Here's some code that uses stabilize to run the computation and then gets the information from the observers.
 let () =
 let v = Inc.Observer.value_exn in
 let display s =
 printf "%20s: base area: %F; volume: %F\n"
 s (v base_area_obs) (v volume_obs)
 in
 Inc.stabilize ();
 display "1st stabilize";
 Var.set height_v 10.;
 display "after set height";
 Inc.stabilize ();
 display "2nd stabilize"
If we run this, we'll se the following output:
1st stabilize: base area: 25.; volume: 125. after set height: base area: 25.; volume: 125. 2nd stabilize: base area: 25.; volume: 250.
Note that setting the height isn't enough to change the observed values; we need a stabilization to make that happen.
That's a fairly trivial computation, and there certainly isn't much to incrementalize. Let's try something a little more complicated: a function for merging together an array of incrementals, using some commutative and associative operator like addition or max.
 let rec merge ar ~f =
 if Array.length ar <= 1 then ar.(0)
 else
 let len = Array.length ar in
 let len' = len / 2 + len % 2 in
 let ar' =
 Array.init len' ~f:(fun i >
 if i * 2 + 1 >= len then ar.(i*2)
 else Inc.map2 ar.(i*2) ar.(i*2+1) ~f)
 in
 merge ar' ~f;;
Because this is done using a binary tree as the dependency graph, the complexity of updating an element is log(n)
, where n
is the size of the array. We can use this for, computing an average:
 let average ar =
 let sum = merge ar ~f:(+.) in
 Inc.map sum ~f:(fun s > s /. float (Array.length ar))
This works, but we can do better performancewise, at least, if our merge operation has an inverse. In that case, maintaining the sum can in principle be done on constant time, by first, removing the old value before adding in the new. Incremental has a function for taking advantage of this structure.
 let sum ar =
 Inc.unordered_array_fold ~f:(+.) ~f_inverse:(.) ar;;
Now, let's say we want to do something a little more dynamic: in particular, what if we want to compute the average of a prefix of the given array? For that, we need to use the bind function, which allows us to produce new incremental nodes within an incremental computation.
 let average_of_prefix ar length =
 Inc.bind length (fun length >
 average (Array.init length ~f:(fun i > ar.(i))))
The type of this function is float Inc.t array > int Inc.t > float Inc.t
, so the length of the prefix is a fully fledged part of the incremental computation. As a result, the dependency structure of this computation changes dynamically, e.g., if the value of length
is 7
, then the computation only depends on length
and the first 7
elements of the array.
Hopefully this gives you enough of a sense of what Incremental is about to start thinking about where it might be useful for you. Note that the overhead of incremental is not inconsiderable  on my laptop, firing a single node takes on the order of 30ns, which is far more than, say, summing numbers together. Incremental tends to be useful when the computation that is put into a single node is large relative to that overhead, or when the computational graph is large relative to the subgraph that needs to be recomputed. Our experience has been that there are plenty of applications in this space that can benefit from Incremental.
HideFormally verifying the complexity of OCaml programs with CFML  part 1 — GaGallium (Armaël Guéneau), Jul 18, 2015
In a recent paper, Arthur Charguéraud and François Pottier present a formal proof of an OCaml implementation of Tarjan's unionfind algorithm. The proof covers two aspects: first, functional correctness ("the algorithm is correct"), but also asymptotic complexity. For example, one of the results of the paper is that the link
function runs in \(O(α(n))\) elementary steps, \(α\) being the inverse of the Ackermann function.
Actually, the complexity results of the paper are no…
Read more...In a recent paper, Arthur Charguéraud and François Pottier present a formal proof of an OCaml implementation of Tarjan's unionfind algorithm. The proof covers two aspects: first, functional correctness ("the algorithm is correct"), but also asymptotic complexity. For example, one of the results of the paper is that the link
function runs in \(O(α(n))\) elementary steps, \(α\) being the inverse of the Ackermann function.
Actually, the complexity results of the paper are not presented using the "bigO" notation, commonly used in asymptotic complexity proofs: the various constants are explicit and manipulated through the proof.
In these blog posts (current, interlude and final), I'll describe a tentative extension of CFML, the Coq library François and Arthur used for their proof. Its goal is to enable asymptotic reasoning, bigO notations and such to be used to prove complexity of programs.
CFML: verify OCaml programs in Coq
CFML is a Coq library useful to prove properties about OCaml programs. Its typical usage is as follows: (1) write some OCaml code; (2) use CFML's generator to parse the OCaml code and produce a "characteristic formula"; expressed as a Coq axiom, that describes the semantics of the program; (3) import this formula and write a specification in Coq for the program; (4) prove the specification using CFML tactics. The specifications are expressed using separation logic, a logical framework convenient when talking about mutable data.
Let's illustrate this by proving correctness of the incr
function (a function that increments a reference).
OCaml code:
let incr r =
r := !r + 1
After calling CFML's generator, step (2) produces a Coq source file containing the characteristic formula for incr
. This formula has the same shape and structure as the OCaml program, which will be useful to read the goals produced by the CFML tactics.
Parameter incr_cf :
tag tag_top_fun Label_default
Body incr r =>
(LetApp _x4 := ml_get r in
App ml_set r (_x4 + 1);)
The user doesn't need to read the produced file nor the characteristic formula, though. We'll just need to Require
it at the beginning of our proof.
Step (3) is writing a specification: a pair of a precondition, that must hold for the function to run, and a postcondition, describing the returned value and the state of the memory after executing the function. In our case, before running incr
, r
must be a reference containing some integer value i
; after running incr
, r
must contain i+1
.
Lemma incr_spec :
Spec incr r R>> forall (i: int),
R (r ~~> i) (# r ~~> (i+1)).
Some hairy notations are involved, but basically, (r ~~> i)
is the precondition, and (# r ~~> (i+1))
the postcondition. The ~~>
notation denotes a "pointsto" relation: here, r
is a pointer to a memory cell containing the value i. The #
means that incr
returns unit
.
Let us prove interactively this specification. To prove a goal of the form Spec foo ...
, we first use the xcf
tactic to introduce the characteristic formula of foo
.
Lemma incr_spec :
Spec incr r R>> forall (i: int),
R (r ~~> i) (# r ~~> (i+1)).
Proof.
xcf. intros.
The resulting goal is of the form formula pre post
, formula
being a part of the characteristic formula, pre
and post
the precondition and postcondition for the corresponding piece of code.
r : loc
i : int
============================
(LetApp _x4 := ml_get r in
App ml_set r (_x4 + 1);) (r ~~> i \* \[]) (# r ~~> (i + 1))
We make the proof progress step by step, traversing the program shape, by applying tactics that match the head constructor of formula
: here, it's a LetApp
; we use the xapps
tactic. If the goal started with a If
, we would have used xif
, etc. We need another xapp
after that, for the App
constructor. Finally, what remains to be proven is a "heap implication", stating that the final memory matches the postcondition.
============================
# r ~~> (i + 1) \* \[] ===> # r ~~> (i + 1) \* Hexists H', H'
The hsimpl
tactic does the job.
Lemma incr_spec :
Spec incr r R>> forall (i: int),
R (r ~~> i) (# r ~~> (i+1)).
Proof.
xcf. intros.
xapps. xapp. hsimpl.
Qed.
CFML + time credits: prove algorithmic complexity
We rely on (and admit) the following property of the OCaml compiler: if one ignores the cost of garbage collection, counting the number of function calls and for/while loop iterations performed by the source program is an accurate measure, up to a constant factor, of the number of machine instructions executed by the compiled program.
We thus consider a new kind of heap resource (something that can be put in pre/postconditions): time credits. The program is now required to consume a time credit at each step of computation.
If we produce the characteristic formula for incr
with the "credits" option, its specification and proof becomes:
Lemma incr_spec :
Spec incr r R>> forall (i: int),
R (\$ 1 \* r ~~> i) (# r ~~> (i+1)).
Proof.
xcf. intros.
xpay.
xapps. xapp. hsimpl.
Qed.
What changed? We now have to give one credit (\$ 1
) in the precondition of incr
, as it performs one step of computation. The credit does not appear in the postcondition: it is consumed by the function. In the proof, we use the xpay
tactic to justify that we are indeed able to pay for the computation step.
The complexity of a function is then represented as the number of credits contained in the precondition: that is, how much you have to pay for it to run. Interestingly, credits can also be stored in the heap for later consumption: this enables amortization (a standard reasoning in the literature  e.g. in Introduction to Algorithms by Cormen et al.).
Asymptotic reasoning on time credits
The proof of UnionFind states and tracks explicitly how many credits are needed at every step of the proof. For example, the precondition of find
requires "\(α(N) + 2\) credits". We would like to be able to write instead "\(O(α(N))\) credits"; first to match usual informal algorithmic reasoning, but also to enable asymptotic reasoning, that is, proofs of properties true "for a big enough \(N\)".
Reasoning with bigOs is also more modular: the exact number of credits an auxiliary function needs may change without impacting proofs using it, as long as the asymptotic bound stays the same.
The prototype extension of CFML I developed tries to address this. Its features are as follows:
 Given an explicit cost function (a "number of credits"), allow to prove an asymptotic "bigO" bound for it. The cost function may be defined after other functions asymptotically bounded by bigOs.
 Allow various manipulations on bigO bounds: composition, parameter transformation, etc.
This does not include for the moment the proof of Swissarmyknife theorems like the master theorem; but they can now be stated, using the new asymptotic notions.
This seems to be only a matter of writing down the right definitions and lemmas. This is partly true, but it also appears that the "bigO" notation is often used in a quite informal way, so formalizing it requires extra attention on some aspects.
For example, the "\(O\)" notation requires implicitly some notion of "going towards infinity". When manipulating cost functions with multiple parameters (e.g., "\(f(m,n)\) is \(O(g(m,n))\)"), which notion of "going to infinity" we should choose is not obvious, and all are not equivalent: it'll depend on later use.
In our example, we could say both \(m\) and \(n\) have to go to infinity. But maybe we will want to fix \(m\) afterwards, and still have \(f(m,n)\) be a \(O(g(m, n))\); in this case we need an other notion of "going to infinity": \(m \times n + n\) is indeed a \(O(m \times n)\) for both \(m\) and \(n\) growing to infinity, but it is clearly not the case when fixing \(m = 0\)...
New notions introduced in CFML + asymptotic credits
To formalize these notions, and be able to talk more clearly about this little dilemma, we reuse the notion of filter from the literature.
The generic notion of filter describes a way of "going to infinity" in a set (\(\mathbb{Z}, \mathbb{Z}^2, \ldots\)). On \(\mathbb{Z}\), we will have one obvious filter, that will work in any situation, so the benefits of the method are not clear. However, as illustrated above, on \(\mathbb{Z}^2\) (i.e. for functions with two parameters), which notion of "going to infinity" should be used is not clear, and moreover depends on later usage of the specification. Filters allow to formally describe this situation: one can define various filters for \(\mathbb{Z}^2\), and clearly state which is used in a \(O()\).
Following are the new predicates introduced in CFML: filter
, and more. They are all needed to understand what's going on during the proofs, however in simple scripts most of them needn't be manipulated directly by the user.
Definition filter A := (A > Prop) > Prop.
Class Filter {A} (ultimately: filter A).
A filter is a set of sets (a set of elements of type A
being of type A > Prop
), that represents a "set of neighborhoods"; in our case, neighborhoods of infinity. The Filter class bundles additional properties that must be satisfied, like e.g. stability by intersection. This definition allows us to write ultimately P
, when ultimately
is a filter
, meaning that P
is true at some point "when going towards infinity".
Definition dominated (ultimately: filter A) (f g: A > B).
The textbook definition of "\(f\) is \(O(g)\)", using filters: more or less, exists k, ultimately (fun x => f x ≤ k g x)
.
Definition monotonic (leA: A > A > Prop) (leB: B > B > Prop) (f: A > B).
Definition monotonic_after leA leB (f: A > B) (a0: A).
Monotonicity of f wrt. the relations leA
on the domain and leB
on the codomain. monotonic_after
means "monotonic after a0"; usually combined with a filter to have a notion of asymptotic monotonicity.
Definition idominated (ultimately: filter A) leA (f g: A > B).
A variant of dominated
, that we'll use in practice: it allows more lemmas to be automatic (without sideconditions to prove). idominated _ _ f g
basically means that either f
and g
are \(O(1)\), either dominated _ f g
(\(f\) is \(O(g)\)); and g
is asymptotically monotonic.
Definition SpecO (ultimately: filter A) leA (g: A > Z)
(spec: (A > Z) > Prop) :=
exists (f: A > Z),
(forall x, 0 <= f x) /\
monotonic _ _ f /\
idominated _ _ f g /\
spec f.
A wrapper useful to state specifications using bigOs: it basically means "there exists some f
function, which is a \(O(g)\), and spec f
is true" (plus some necessary additional facts). Usually, spec
is of the form Spec .. R>> ...
, i.e. a standard CFML specification.
As an appetizer for the second part, I'll leave you with two specifications that use bigOs. The first is for our incr
function, the second is for a mktree
recursive function, that builds a complete tree of depth \(n\) is \(O(2^n)\) time.
Lemma incr_spec :
SpecO1 (fun F =>
Spec incr r R>> forall (i: int),
R (\$ F tt \* r ~~> i) (# r ~~> (i+1))).
Lemma mktree_spec :
SpecO (fun n => 2 ^ n) (fun F =>
Spec mktree (depth: int) (x: a) R>>
0 <= depth >
R (\$ F depth) (fun (t: tree a) => \[])).
Weekly News — OCaml Weekly News, Jul 14, 2015
ocamlfileutils 0.5.0 released — OCamlCore Forge News ( (Sylvain Le Gall), Jul 13, 2015
Converting a code base from camlp4 to ppx — Jane Street (Jeremie Dimino), Jul 08, 2015
As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.
For instance to translate code using pa_ounit
and pa_test
, we have to rewrite:
TEST = <:test_result< int >> ~exp…
Read more...As with many projects in the OCaml world, at Jane Street we have been working on migrating from camlp4 to ppx. After having developed equivalent ppx rewriters for our camlp4 syntax extensions, the last step is to actually translate the code source of all our libraries and applications from the camlp4 syntax to the standard OCaml syntax with extension points and attributes.
For instance to translate code using pa_ounit
and pa_test
, we have to rewrite:
TEST = <:test_result< int >> ~expect:42 (f x)
to:
let%test _ = [%test_result: int] ~expect:42 (f x)
For small to medium projects it is enough to just take a couple hours to translate the source code by hand. But at Jane Street where we have a huge OCaml code base making extensive use of camlp4, it is simply not realistic. So we needed a tool to do that for us.
Writing a tool to automatically convert the syntax
Since the output of such as tool has to be accepted as the new code source that is committed in our repository, it must preserve the layout of the original file as much as possible and of course keep the comments. This mean that any approach using an AST prettyprinter would be extremely complex.
The path we choosed is to textually substitute the foreign syntaxes in the original file for the new ones. One could imagine doing that with a tool such as sed, awk, perl, ... however doing it properly would be fastidious and it would be prettymuch impossible to be 100% sure it would never translate things it is not supposed to. Plus writing perl is not as fun as writing OCaml programs.
Instead there is an easy way to find the foreign syntaxes: using camlp4 itself. To subsitute the text of foreign syntaxes the only thing we need to know is their location in the original file, and camlp4 can help us with that.
Writing dummy camlp4 syntax extensions
The idea is to write for each camlp4 syntax extension a dummy one that define the same grammar productions as the real one, but instead of generating code it simply record substitutions at certain locations.
Then we do the following:
 parse a file with camlp4 and our dummy syntax extensions
 apply all the recorded substitutions to the original file
This approach has the advantage of interpreting the original file in the exact same way as our regular syntax extensions. Giving us good confidence we did not change the syntactic constructions by mistake.
To do so we define this API:
(** [replace loc repl] records a text substitution that replaces the
portion of text pointed by [loc] by [repl]. *)
val replace : Loc.t > string > unit
Then writing a dummy camlp4 syntax extension is pretty easy. For instance for a subset of pa_ounit
:
EXTEND Gram
GLOBAL: str_item;
test: [[ "TEST" > replace _loc "let%test" ]];
name_equal:
[[ `STRING _; "=" > ()
 "=" > replace _loc "_ ="
]];
str_item:
[[ test; name_equal; expr > <:str_item< >>
]];
END
On the fly conversion and diffing the generated code
Since this tool was convenient to use, we used it to check that our newly written ppx rewriters did the same thing as the old camlp4 syntax extensions:
 for a given OCaml source file of a library or application, we converted it using
camlp4toppx
and saved the result  we processed the original file using camlp4 and the translated one using our ppx rewriters
 in both case we saved the output of
dparsetree
(humanreadable version of the internal OCaml AST) anddsource
(prettyprinted code)  we diffed the camlp4 and ppx outputs of
dparsetree
, as well as the outputs ofdsource
This was all quite easy to do with jenga. We kept looking at the generated diffs until they were all empty.
We have quite a lot of code in our camlp4 syntax extensions and converting them to ppx was a long mechanical job and so quite errorprone. Given that this diffing turned out to be really helpful to find errors.
The Camlp4 Syntax is not quite the OCaml syntax
While using this we noticed that quite a few syntaxes accepted by camlp4 are not accepted by OCaml, for instance:
let _ x = x
let f l = List.map l ~f:fun x > x + 1
These where quite easy to fix automatically as well using camlp4toppx
.
Github repo and extension
We published a slightly modified version of this tool on github.
The method we used doesn't work out of the box with all syntax extensions. For instance to convert code using lwt.syntax
some more work needs to be done on camlp4toppx
. But it is a good starting point.
Weekly News — OCaml Weekly News, Jul 07, 2015
View older blog posts.
Syndications
 Alex Leighton
 Amir Chaudhry
 Andrei Formiga
 Andrej Bauer
 Andy Ray
 Anil Madhavapeddy
 Ashish Agarwal
 CUFP
 Cameleon news
 Caml INRIA
 Caml Spotting
 Cedeela
 Coherent Graphics
 Coq
 Cranial Burnout
 Daniel Bünzli
 Daniel Bünzli (log)
 Dario Teixeira
 David Baelde
 David Mentré
 David Teller
 Erik de Castro Lopo
 Etienne Millon
 FramaC
 Functional Jobs
 GaGallium
 Gaius Hammond
 Gerd Stolpmann
 GitHub Jobs
 Grant Rettke
 Hong bo Zhang
 Incubaid Research
 Jake Donham
 Jamie Brandon
 Jane Street
 KC Sivaramakrishnan
 Leo White
 LexiFi
 Mads Hartmann
 Magnus Skjegstad
 Marc Simpson
 Matthias Puech
 Matías Giovannini
 Mike Lin
 Mike McClurg
 Mindy Preston
 OCaml Book
 OCaml Labs compiler hacking
 OCaml Platform
 OCaml Weekly News
 OCamlJava
 OCamlCore Forge News
 OCamlCore Forge Projects
 OCamlCore.com
 OCamlPro
 ODNS project
 Ocaml XMPP project
 Ocsigen blog
 Ocsigen project
 Opa
 Open Mirage
 Orbitz
 Paolo Donadeo
 Perpetually Curious
 Peter Zotov
 Phil Tomson
 Pietro Abate
 Psellos
 Richard Jones
 Rudi Grinberg
 Sebastien Mondet
 Shayne Fletcher
 Simon Grondin
 Stefano Zacchiroli
 Sylvain Le Gall
 The Caml Humps
 Thomas Leonard
 Till Varoquaux
 WODI
 Xinuo Chen
 Yan Shvartzshnaider