package matrix

  1. Overview
  2. Docs

Module Glyph.StringSource

Measuring

Sourceval measure : width_method:width_method -> tab_width:int -> string -> int

measure ~width_method ~tab_width s is the total display width of s. Control characters contribute 0.

Note. Invalid UTF-8 byte sequences are replaced with U+FFFD, each contributing width 1.

See also measure_sub.

Sourceval measure_sub : width_method:width_method -> tab_width:int -> string -> pos:int -> len:int -> int

measure_sub ~width_method ~tab_width s ~pos ~len is like measure but operates on the substring s.[pos] .. s.[pos + len - 1] without allocating. The result is 0 when len <= 0.

Counting

Sourceval grapheme_count : string -> int

grapheme_count s is the number of user-perceived characters (grapheme clusters) in s. Uses full UAX #29 segmentation.

Iterating

Sourceval iter_graphemes : ?ignore_zwj:bool -> (offset:int -> len:int -> unit) -> string -> unit

iter_graphemes f s calls f ~offset ~len for each grapheme cluster in s.

ignore_zwj defaults to false. When true, ZWJ does not join emoji sequences (same boundary behaviour as `No_zwj).

Note. Invalid UTF-8 byte sequences are treated as individual replacement characters (U+FFFD).

See also iter_grapheme_info.

Sourceval iter_grapheme_info : width_method:width_method -> tab_width:int -> (offset:int -> len:int -> width:int -> unit) -> string -> unit

iter_grapheme_info ~width_method ~tab_width f s calls f ~offset ~len ~width for each grapheme cluster in s. Uses the same width calculation and ZWJ handling as Pool.encode. Graphemes whose width resolves to 0 (control and zero-width sequences) are skipped.

Note. Invalid UTF-8 byte sequences are treated as individual replacement characters (U+FFFD).

See also iter_graphemes.

Sourceval iter_wrap_breaks : ?width_method:width_method -> (break_byte_offset:int -> next_byte_offset:int -> grapheme_offset:int -> unit) -> string -> unit

iter_wrap_breaks f s calls f ~break_byte_offset ~next_byte_offset ~grapheme_offset for each word-wrap break opportunity in s, in order from start to end, with:

  • break_byte_offset — zero-based byte position of the grapheme containing the wrap-break character.
  • next_byte_offset — zero-based byte position of the next grapheme after the break (the resume position).
  • grapheme_offset — zero-based grapheme index of the grapheme containing the wrap-break character.

Breaks occur after graphemes containing ASCII space, tab, hyphen, path separators, punctuation, brackets, and Unicode NBSP, ZWSP, soft hyphen, and typographic spaces.

width_method controls grapheme boundary detection: `Unicode (the default) treats ZWJ sequences as single graphemes, `No_zwj breaks them apart.

See also iter_line_breaks.

Sourceval iter_line_breaks : (pos:int -> kind:line_break_kind -> unit) -> string -> unit

iter_line_breaks f s calls f ~pos ~kind for each line terminator in s, in order from start to end, with:

  • pos — zero-based byte position. For `CRLF this is the position of the LF byte; for `LF and `CR, the respective byte.
  • kind — the line_break_kind.

CRLF sequences are reported once as `CRLF, not as separate `CR and `LF breaks.

See also iter_wrap_breaks.