Library
Module
Module type
Parameter
Class
Class type
In addition to the main message, the API should allow an implementer to easily specify the following five factors of a diagnostic, and it should be possible to specify them independently.
V0003
. A succinct representation is useful for an end user to report a bug or ask for help.extra_remarks
.We realized there are two distinct use modes when coming to message reporting:
We should support at least the free-form use mode, and ideally support both. The free-form reporting is implemented as Reporter and the structured one is implemented as StructuredReporter.
asai
It should be easy for an application to use other libraries who themselves use asai
, even if the application and the library made different choices between free-form and structured reporting. Our current implementation uses the same diagnostic type for both reporting styles and allows an application to adopt diagnostics from a library.
The original design of asai did not retain any location information across API calls. That is, the location used to create a backtrace frame will not be inherited by inner API calls:
Reporter.trace ~loc "outer trace with a location" @@ fun () ->
Reporter.emit Greeting "hello"
(* the message ["hello"] originally would not have a location *)
The idea was to prevent an excessive amount of location information. However, all early adopters of this library implemented their own mechanisms to retain the location information, which indicated that the original design was cumbersome in practice.
The new design (partially introduced in 0.1 and then fully implemented in 0.3) is to remember the locations of backtrace frames (the argument loc
when calling trace) and use the innermost one as the default location for emit or fatal. An inner backtrace frame, however, will not use the inherited location of another frame to avoid duplicated locations in a backtrace (unless an implementer explicitly specifies the same location). We believe this strikes a good balance between convenience and succinctness. For example,
Reporter.trace ~loc "outer trace with a location" @@ fun () ->
Reporter.trace "inner trace" @@ fun () ->
(* the frame ["inner trace"] will not have a location *)
Reporter.emit Greeting "hello"
(* the message ["hello"] will have [loc] as its location *)
Note: inherited locations can be overwritten by the optional argument loc
at any time.
There is a long history of using ASCII printable characters and ANSI escape sequences, and recently also non-ASCII Unicode characters, to draw pictures on terminals. To display compiler diagnostics, this technique has been used to assemble line numbers, code from the end user, code highlighting, and other pieces of information in a visually pleasing way. Non-ASCII Unicode characters (from the implementer or from the end user) greatly expand the vocabulary of ASCII art, and we will call the new art form Unicode art to signify the use of non-ASCII characters.
In asai, we made the unusual choice to abandon column numbers (and any Unicode art that depends on them) so that our Unicode art remains flawless in the presence of tricky Unicode character sequences. In particular, the following highlighting method goes against our design principle:
let foo = f x ^^^ variable name too ugly
The highlighting assumes that the visual location of foo
is at the column 4 (if you count from 0), an assumption that is forbidden to make in asai. The following will explain why a flawless support of Unicode necessarily leads to tossing out the concept of column numbers completely.
Note: "Unicode characters" are not really defined in the Unicode standard, and here they mean Unicode scalar values, that is, all Unicode code points except the surrogate code points (special code points for UTF-16 to represent all scalar values). Although the word "character" has many incompatible meanings and usages, we decided to call scalar values "Unicode characters" anyway because (1) most people are not familiar with the official term "scalar values" and (2) scalar values are the only context-independent unit one can work with in a programming language.
The arrival of non-ASCII Unicode characters imposes new challenges as their visual widths are unpredictable without knowing the exact terminal (or terminal emulator), the exact font, etc. Unicode emoji sequences might be one of the most challenging cases: a pirate flag (🏴☠️) may be shown as a single emoji flag on supported platforms but as a sequence with a black flag (🏴) and a skull (☠️) on other platforms. This means the visual width of the pirate flag is unpredictable. (See Unicode Emoji Section 2.2.) The rainbow flag (🏳️🌈), skin tones, and many other emoji sequences have the same issue. Other less chaotic but still challenging cases include characters whose East Asian width is "Ambiguous". (See UAX #11 for more information about East Asian width.) These challenges bear some similarity with the unpredictability of the visual width of horizontal tabulations, but in a much wilder way.
Due to the unpredictability of visual widths, it is wise to think twice before using emoji sequences or other tricky characters in Unicode art. However, it is difficult to make visually pleasing Unicode art without any assumption. To quantify the degree to which a Unicode art can remain visually pleasing on different platforms, we specify the following four levels of display stability. The levels go from 0 (the most unstable) to 3 (the most stable), where Level 0 (the most unstable) makes the most assumptions, and Level 3 (the most stable) makes almost none. Note that if an implementer decide to integrate content from the end user into their Unicode art, the end user should have the freedom to include arbitrary emoji sequences and tricky characters in their content. The final Unicode art must remain visually pleasing (under the assumptions allowed by the display stability levels) for any user content.
wcwidth
and wcswidth
). These heuristics are created to help programmers handle more characters, in particular CJK characters, without dramatically changing the code. They however do not solve the core problem (that is, visual width is fundamentally ill-defined) and they often could not handle tricky cases such as emoji sequences. Many compilers are at this level.Level 2b: Stability under only theses assumptions:
Level 2b is the explicit version of Level 2a; we might update the details of Level 2b later to better match our understanding of Level 2a. Collectively, Levels 2a and 2b are called "Level 2". This is where asai is.
Unlike most implementations, which are only at Level 1, our terminal handler strives to achieve Level 2. That means we must not make any assumption about the visual width of the end user's code. The reason is that without (incorrect) strong assumptions about how Unicode characters are rendered, the visual column numbers are ill-defined. On the other hand, Level 3 seems to be too restricted for compiler diagnostics because we cannot show line numbers along with the end user's code. (We cannot assume the numbers "10" and "99" will have the same visual width at Level 3.)
Note: a fixed-width font with enough glyphs that covers many Unicode characters is often technically duospaced, not monospaced, because many CJK characters would occupy a double character visual width. Thus, we do not use the terminology "monospaced".
Proper support of bidirectional text will benefit many potential end users, but unfortunately, we currently do not have the capacity to implement it. The general support of bidirectional text in most system libraries and tools is lacking, and without dedicated effort, it is hard to display bidirectional text properly. This is the area where our current implementation falls short.
On a related note, Unicode Source Code Handling suggests that source code should be segmented into atoms and their display order should remain the same throughout the document to maintain the lexical structure. Each atom should then be displayed via the usual Unicode Bidirectional Algorithm with a few exceptions. Our current implementation cannot follow this advice because it does not know the lexical structure of the end user content.
All positions should be byte-oriented. We believe other popular alternatives proposals are worse:
U+FFFF
(such as 😎
) will require two code units to form a surrogate pair. Therefore, it is arguably worse than just using Unicode characters. This scheme was unfortunately chosen by the Language Service Protocol (LSP) as the default unit, and until LSP version 3.17 was the only choice. The developers of the protocol made this decision probably because Visual Studio Code was written in JavaScript (and TypeScript), whose strings use UTF-16 encoding. It still takes linear time to count characters from other encodings (such as UTF-8), and the count still does not match the visual perception; even worse, UTF-16 usually takes more space than UTF-8 when ASCII is dominant.