package quickjs

  1. Overview
  2. Docs

Module Quickjs.UnicodeSource

Unicode utilities from QuickJS's libunicode

This module provides Unicode character classification, case conversion, and normalization functions. It uses the same battle-tested Unicode tables as QuickJS's ES2023-compliant JavaScript engine.

Normalization

Sourcetype normalization =
  1. | NFC
    (*

    Canonical Decomposition, followed by Canonical Composition

    *)
  2. | NFD
    (*

    Canonical Decomposition

    *)
  3. | NFKC
    (*

    Compatibility Decomposition, followed by Canonical Composition

    *)
  4. | NFKD
    (*

    Compatibility Decomposition

    *)

Unicode normalization forms

Sourceval normalize : normalization -> string -> string option

normalize form str normalizes a UTF-8 string to the specified form. Returns None on memory allocation failure or invalid input.

Example:

  normalize NFC "café" (* composed form *) normalize NFD
    "café" (* decomposed form *)

Case Conversion

Sourceval lowercase : string -> string

lowercase str converts a UTF-8 string to lowercase. Handles Unicode characters like "ÉCOLE" → "école".

Sourceval uppercase : string -> string

uppercase str converts a UTF-8 string to uppercase. Handles special cases like "ß" → "SS".

Single Character Operations

Sourceval lowercase_char : Uchar.t -> Uchar.t list

lowercase_char c returns the lowercase form of a code point. Returns a list because some characters expand (though lowercase rarely does).

Sourceval uppercase_char : Uchar.t -> Uchar.t list

uppercase_char c returns the uppercase form of a code point. Returns a list because some characters expand, e.g., 'ß' → 'S'; 'S'.

Character Classification

Sourceval is_cased : Uchar.t -> bool

is_cased c returns true if the character has uppercase/lowercase forms. Examples: 'a', 'A', 'é' are cased; '1', '!' are not.

Sourceval is_case_ignorable : Uchar.t -> bool

is_case_ignorable c returns true if the character is ignored during case mapping operations (e.g., combining marks).

Sourceval is_id_start : Uchar.t -> bool

is_id_start c returns true if the character can start a JavaScript/Unicode identifier (letters, $, _).

Sourceval is_id_continue : Uchar.t -> bool

is_id_continue c returns true if the character can continue a JavaScript/Unicode identifier (letters, digits, $, _, combining marks).

Sourceval is_whitespace : Uchar.t -> bool

is_whitespace c returns true if the character is Unicode whitespace. Includes ASCII space, tab, newline, and Unicode spaces like U+00A0 (NBSP).

Regex Support

Sourceval canonicalize : ?unicode:bool -> Uchar.t -> Uchar.t

canonicalize ?unicode c returns the canonical form of a character for case-insensitive regex matching.

  • unicode: if true (default), use full Unicode case folding; if false, only ASCII case folding.