Module `Uucd`Source

Unicode character database decoder.

Uucd decodes the data of the Unicode character database from its XML representation. It provides high-level (but not necessarily efficient) access to the data so that efficient representations can be extracted.

Uucd decodes the representation described in the Annex #42 of Unicode 13.0.0. Subsequent versions may be decoded as long as no new cases are introduced in parsed enumerated properties.

Consult the basics.

Note. All strings returned by the module are UTF-8 encoded.

Unicode version 13.0.0

References

The Unicode Consortium. The Unicode Standard. (latest version)
Mark Davis, Ken Whistler. UAX #44 Unicode Character Database. (latest version)
Eric Muller. UAX #42 Unicode Character Database in XML. (latest version)

Code points

Sourcetype cp = int

The type for Unicode code points, ranges from 0x0000 to 0x10_FFFF.

Sourceval is_cp : int -> bool

is_cp n is true iff n a Unicode code point.

Sourceval is_scalar_value : int -> bool

is_scalar_value n is true iff n is a Unicode scalar value.

Sourcemodule Cpmap : Map.S with type key = cp

Code point maps.

Properties

Properties are referenced by their name and property values by their abbreviated name. To understand their semantics refer to the standard.

Sourcetype props

The type for sets of properties.

Sourcetype 'a prop

The type for properties with property value of type 'a.

Sourceval find : props -> 'a prop -> 'a option

find ps p is the value of property p in ps, if any.

Sourceval unknown_prop : (string * string) -> string prop

unknown_prop (ns, n) is a property read from an XML attribute whose expanded name is (ns, n). This can be used to access a property unknown to the module.

Non Unihan properties

In alphabetical order.

Sourceval age : [ `Version of int * int | `Unassigned ] prop

Sourceval alphabetic : bool prop

Sourceval ascii_hex_digit : bool prop

Source

val bidi_class : 
  [ `AL
  | `AN
  | `B
  | `BN
  | `CS
  | `EN
  | `ES
  | `ET
  | `FSI
  | `L
  | `LRE
  | `LRI
  | `LRO
  | `NSM
  | `ON
  | `PDF
  | `PDI
  | `R
  | `RLE
  | `RLI
  | `RLO
  | `S
  | `WS ]
    prop

Sourceval bidi_control : bool prop

Sourceval bidi_mirrored : bool prop

Sourceval bidi_mirroring_glyph : cp option prop

Sourceval bidi_paired_bracket : [ `Self | `Cp of cp ] prop

Sourceval bidi_paired_bracket_type : [ `O | `C | `N ] prop

Source

val block : 
  [ `ASCII
  | `Adlam
  | `Aegean_Numbers
  | `Ahom
  | `Alchemical
  | `Alphabetic_PF
  | `Anatolian_Hieroglyphs
  | `Ancient_Greek_Music
  | `Ancient_Greek_Numbers
  | `Ancient_Symbols
  | `Arabic
  | `Arabic_Ext_A
  | `Arabic_Math
  | `Arabic_PF_A
  | `Arabic_PF_B
  | `Arabic_Sup
  | `Armenian
  | `Arrows
  | `Avestan
  | `Balinese
  | `Bamum
  | `Bamum_Sup
  | `Bassa_Vah
  | `Batak
  | `Bengali
  | `Bhaiksuki
  | `Block_Elements
  | `Bopomofo
  | `Bopomofo_Ext
  | `Box_Drawing
  | `Brahmi
  | `Braille
  | `Buginese
  | `Buhid
  | `Byzantine_Music
  | `CJK
  | `CJK_Compat
  | `CJK_Compat_Forms
  | `CJK_Compat_Ideographs
  | `CJK_Compat_Ideographs_Sup
  | `CJK_Ext_A
  | `CJK_Ext_B
  | `CJK_Ext_C
  | `CJK_Ext_D
  | `CJK_Ext_E
  | `CJK_Ext_F
  | `CJK_Ext_G
  | `CJK_Radicals_Sup
  | `CJK_Strokes
  | `CJK_Symbols
  | `Carian
  | `Caucasian_Albanian
  | `Chakma
  | `Cham
  | `Cherokee
  | `Cherokee_Sup
  | `Chess_Symbols
  | `Chorasmian
  | `Compat_Jamo
  | `Control_Pictures
  | `Coptic
  | `Coptic_Epact_Numbers
  | `Counting_Rod
  | `Cuneiform
  | `Cuneiform_Numbers
  | `Currency_Symbols
  | `Cypriot_Syllabary
  | `Cyrillic
  | `Cyrillic_Ext_A
  | `Cyrillic_Ext_B
  | `Cyrillic_Ext_C
  | `Cyrillic_Sup
  | `Deseret
  | `Devanagari
  | `Devanagari_Ext
  | `Diacriticals
  | `Diacriticals_Ext
  | `Diacriticals_For_Symbols
  | `Diacriticals_Sup
  | `Dingbats
  | `Dives_Akuru
  | `Dogra
  | `Domino
  | `Duployan
  | `Early_Dynastic_Cuneiform
  | `Egyptian_Hieroglyph_Format_Controls
  | `Egyptian_Hieroglyphs
  | `Elbasan
  | `Elymaic
  | `Emoticons
  | `Enclosed_Alphanum
  | `Enclosed_Alphanum_Sup
  | `Enclosed_CJK
  | `Enclosed_Ideographic_Sup
  | `Ethiopic
  | `Ethiopic_Ext
  | `Ethiopic_Ext_A
  | `Ethiopic_Sup
  | `Geometric_Shapes
  | `Geometric_Shapes_Ext
  | `Georgian
  | `Georgian_Ext
  | `Georgian_Sup
  | `Glagolitic
  | `Glagolitic_Sup
  | `Gothic
  | `Grantha
  | `Greek
  | `Greek_Ext
  | `Gujarati
  | `Gunjala_Gondi
  | `Gurmukhi
  | `Half_And_Full_Forms
  | `Half_Marks
  | `Hangul
  | `Hanifi_Rohingya
  | `Hanunoo
  | `Hatran
  | `Hebrew
  | `High_PU_Surrogates
  | `High_Surrogates
  | `Hiragana
  | `IDC
  | `IPA_Ext
  | `Ideographic_Symbols
  | `Imperial_Aramaic
  | `Indic_Number_Forms
  | `Indic_Siyaq_Numbers
  | `Inscriptional_Pahlavi
  | `Inscriptional_Parthian
  | `Jamo
  | `Jamo_Ext_A
  | `Jamo_Ext_B
  | `Javanese
  | `Kaithi
  | `Kana_Ext_A
  | `Kana_Sup
  | `Kanbun
  | `Kangxi
  | `Kannada
  | `Katakana
  | `Katakana_Ext
  | `Kayah_Li
  | `Kharoshthi
  | `Khitan_Small_Script
  | `Khmer
  | `Khmer_Symbols
  | `Khojki
  | `Khudawadi
  | `Lao
  | `Latin_1_Sup
  | `Latin_Ext_A
  | `Latin_Ext_Additional
  | `Latin_Ext_B
  | `Latin_Ext_C
  | `Latin_Ext_D
  | `Latin_Ext_E
  | `Lepcha
  | `Letterlike_Symbols
  | `Limbu
  | `Linear_A
  | `Linear_B_Ideograms
  | `Linear_B_Syllabary
  | `Lisu
  | `Lisu_Sup
  | `Low_Surrogates
  | `Lycian
  | `Lydian
  | `Mahajani
  | `Mahjong
  | `Makasar
  | `Malayalam
  | `Mandaic
  | `Manichaean
  | `Marchen
  | `Masaram_Gondi
  | `Math_Alphanum
  | `Math_Operators
  | `Mayan_Numerals
  | `Medefaidrin
  | `Meetei_Mayek
  | `Meetei_Mayek_Ext
  | `Mende_Kikakui
  | `Meroitic_Cursive
  | `Meroitic_Hieroglyphs
  | `Miao
  | `Misc_Arrows
  | `Misc_Math_Symbols_A
  | `Misc_Math_Symbols_B
  | `Misc_Pictographs
  | `Misc_Symbols
  | `Misc_Technical
  | `Modi
  | `Modifier_Letters
  | `Modifier_Tone_Letters
  | `Mongolian
  | `Mongolian_Sup
  | `Mro
  | `Multani
  | `Music
  | `Myanmar
  | `Myanmar_Ext_A
  | `Myanmar_Ext_B
  | `NB
  | `NKo
  | `Nabataean
  | `Nandinagari
  | `New_Tai_Lue
  | `Newa
  | `Number_Forms
  | `Nushu
  | `Nyiakeng_Puachue_Hmong
  | `OCR
  | `Ogham
  | `Ol_Chiki
  | `Old_Hungarian
  | `Old_Italic
  | `Old_North_Arabian
  | `Old_Permic
  | `Old_Persian
  | `Old_Sogdian
  | `Old_South_Arabian
  | `Old_Turkic
  | `Oriya
  | `Ornamental_Dingbats
  | `Osage
  | `Osmanya
  | `Ottoman_Siyaq_Numbers
  | `PUA
  | `Pahawh_Hmong
  | `Palmyrene
  | `Pau_Cin_Hau
  | `Phags_Pa
  | `Phaistos
  | `Phoenician
  | `Phonetic_Ext
  | `Phonetic_Ext_Sup
  | `Playing_Cards
  | `Psalter_Pahlavi
  | `Punctuation
  | `Rejang
  | `Rumi
  | `Runic
  | `Samaritan
  | `Saurashtra
  | `Sharada
  | `Shavian
  | `Shorthand_Format_Controls
  | `Siddham
  | `Sinhala
  | `Sinhala_Archaic_Numbers
  | `Small_Forms
  | `Small_Kana_Ext
  | `Sogdian
  | `Sora_Sompeng
  | `Soyombo
  | `Specials
  | `Sundanese
  | `Sundanese_Sup
  | `Sup_Arrows_A
  | `Sup_Arrows_B
  | `Sup_Arrows_C
  | `Sup_Math_Operators
  | `Sup_PUA_A
  | `Sup_PUA_B
  | `Sup_Punctuation
  | `Sup_Symbols_And_Pictographs
  | `Super_And_Sub
  | `Sutton_SignWriting
  | `Syloti_Nagri
  | `Symbols_And_Pictographs_Ext_A
  | `Symbols_For_Legacy_Computing
  | `Syriac
  | `Syriac_Sup
  | `Tagalog
  | `Tagbanwa
  | `Tags
  | `Tai_Le
  | `Tai_Tham
  | `Tai_Viet
  | `Tai_Xuan_Jing
  | `Takri
  | `Tamil
  | `Tamil_Sup
  | `Tangut
  | `Tangut_Components
  | `Tangut_Sup
  | `Telugu
  | `Thaana
  | `Thai
  | `Tibetan
  | `Tifinagh
  | `Tirhuta
  | `Transport_And_Map
  | `UCAS
  | `UCAS_Ext
  | `Ugaritic
  | `VS
  | `VS_Sup
  | `Vai
  | `Vedic_Ext
  | `Vertical_Forms
  | `Wancho
  | `Warang_Citi
  | `Yezidi
  | `Yi_Radicals
  | `Yi_Syllables
  | `Yijing
  | `Zanabazar_Square ]
    prop

Sourceval canonical_combining_class : int prop

Sourceval cased : bool prop

Sourceval case_folding : [ `Self | `Cps of cp list ] prop

Sourceval case_ignorable : bool prop

Sourceval changes_when_casefolded : bool prop

Sourceval changes_when_casemapped : bool prop

Sourceval changes_when_lowercased : bool prop

Sourceval changes_when_nfkc_casefolded : bool prop

Sourceval changes_when_titlecased : bool prop

Sourceval changes_when_uppercased : bool prop

Sourceval composition_exclusion : bool prop

Sourceval dash : bool prop

Sourceval decomposition_mapping : [ `Self | `Cps of cp list ] prop

Source

val decomposition_type : 
  [ `Can
  | `Com
  | `Enc
  | `Fin
  | `Font
  | `Fra
  | `Init
  | `Iso
  | `Med
  | `Nar
  | `Nb
  | `Sml
  | `Sqr
  | `Sub
  | `Sup
  | `Vert
  | `Wide
  | `None ]
    prop

Sourceval default_ignorable_code_point : bool prop

Sourceval deprecated : bool prop

Sourceval diacritic : bool prop

Sourceval east_asian_width : [ `A | `F | `H | `N | `Na | `W ] prop

Sourceval emoji : bool prop

Sourceval emoji_presentation : bool prop

Sourceval emoji_modifier : bool prop

Sourceval emoji_modifier_base : bool prop

Sourceval emoji_component : bool prop

Sourceval equivalent_unified_ideograph : cp option prop

Sourceval expands_on_nfc : bool prop

Sourceval expands_on_nfd : bool prop

Sourceval expands_on_nfkc : bool prop

Sourceval expands_on_nfkd : bool prop

Sourceval extended_pictographic : bool prop

Sourceval extender : bool prop

Sourceval fc_nfkc_closure : [ `Self | `Cps of cp list ] prop

Sourceval full_composition_exclusion : bool prop

Source

val general_category : 
  [ `Lu
  | `Ll
  | `Lt
  | `Lm
  | `Lo
  | `Mn
  | `Mc
  | `Me
  | `Nd
  | `Nl
  | `No
  | `Pc
  | `Pd
  | `Ps
  | `Pe
  | `Pi
  | `Pf
  | `Po
  | `Sm
  | `Sc
  | `Sk
  | `So
  | `Zs
  | `Zl
  | `Zp
  | `Cc
  | `Cf
  | `Cs
  | `Co
  | `Cn ]
    prop

Sourceval grapheme_base : bool prop

Source

val grapheme_cluster_break : 
  [ `CN
  | `CR
  | `EB
  | `EBG
  | `EM
  | `EX
  | `GAZ
  | `L
  | `LF
  | `LV
  | `LVT
  | `PP
  | `RI
  | `SM
  | `T
  | `V
  | `XX
  | `ZWJ ]
    prop

Sourceval grapheme_extend : bool prop

Sourceval grapheme_link : bool prop

Sourceval hangul_syllable_type : [ `L | `LV | `LVT | `T | `V | `NA ] prop

Sourceval hex_digit : bool prop

Sourceval hyphen : bool prop

Sourceval id_continue : bool prop

Sourceval id_start : bool prop

Sourceval ideographic : bool prop

Sourceval ids_binary_operator : bool prop

Sourceval ids_trinary_operator : bool prop

Source

val indic_syllabic_category : 
  [ `Avagraha
  | `Bindu
  | `Brahmi_Joining_Number
  | `Cantillation_Mark
  | `Consonant
  | `Consonant_Dead
  | `Consonant_Final
  | `Consonant_Head_Letter
  | `Consonant_Initial_Postfixed
  | `Consonant_Killer
  | `Consonant_Medial
  | `Consonant_Placeholder
  | `Consonant_Preceding_Repha
  | `Consonant_Prefixed
  | `Consonant_Repha
  | `Consonant_Subjoined
  | `Consonant_Succeeding_Repha
  | `Consonant_With_Stacker
  | `Gemination_Mark
  | `Invisible_Stacker
  | `Joiner
  | `Modifying_Letter
  | `Non_Joiner
  | `Nukta
  | `Number
  | `Number_Joiner
  | `Other
  | `Pure_Killer
  | `Register_Shifter
  | `Syllable_Modifier
  | `Tone_Letter
  | `Tone_Mark
  | `Virama
  | `Visarga
  | `Vowel
  | `Vowel_Dependent
  | `Vowel_Independent ]
    prop

Source

val indic_matra_category : 
  [ `Right
  | `Left
  | `Visual_Order_Left
  | `Left_And_Right
  | `Top
  | `Bottom
  | `Top_And_Bottom
  | `Top_And_Right
  | `Top_And_Left
  | `Top_And_Left_And_Right
  | `Bottom_And_Right
  | `Top_And_Bottom_And_Right
  | `Overstruck
  | `Invisible
  | `NA ]
    prop

Source

val indic_positional_category : 
  [ `Bottom
  | `Bottom_And_Left
  | `Bottom_And_Right
  | `Left
  | `Left_And_Right
  | `NA
  | `Overstruck
  | `Right
  | `Top
  | `Top_And_Bottom
  | `Top_And_Bottom_And_Left
  | `Top_And_Bottom_And_Right
  | `Top_And_Left
  | `Top_And_Left_And_Right
  | `Top_And_Right
  | `Visual_Order_Left ]
    prop

Sourceval iso_comment : string prop

Sourceval jamo_short_name : string prop

Sourceval join_control : bool prop

Source

val joining_group : 
  [ `African_Feh
  | `African_Noon
  | `African_Qaf
  | `Ain
  | `Alaph
  | `Alef
  | `Alef_Maqsurah
  | `Beh
  | `Beth
  | `Burushaski_Yeh_Barree
  | `Dal
  | `Dalath_Rish
  | `E
  | `Farsi_Yeh
  | `Fe
  | `Feh
  | `Final_Semkath
  | `Gaf
  | `Gamal
  | `Hah
  | `Hanifi_Rohingya_Kinna_Ya
  | `Hanifi_Rohingya_Pa
  | `Hamza_On_Heh_Goal
  | `He
  | `Heh
  | `Heh_Goal
  | `Heth
  | `Kaf
  | `Kaph
  | `Khaph
  | `Knotted_Heh
  | `Lam
  | `Lamadh
  | `Malayalam_Bha
  | `Malayalam_Ja
  | `Malayalam_Lla
  | `Malayalam_Llla
  | `Malayalam_Nga
  | `Malayalam_Nna
  | `Malayalam_Nnna
  | `Malayalam_Nya
  | `Malayalam_Ra
  | `Malayalam_Ssa
  | `Malayalam_Tta
  | `Manichaean_Aleph
  | `Manichaean_Ayin
  | `Manichaean_Beth
  | `Manichaean_Daleth
  | `Manichaean_Dhamedh
  | `Manichaean_Five
  | `Manichaean_Gimel
  | `Manichaean_Heth
  | `Manichaean_Hundred
  | `Manichaean_Kaph
  | `Manichaean_Lamedh
  | `Manichaean_Mem
  | `Manichaean_Nun
  | `Manichaean_One
  | `Manichaean_Pe
  | `Manichaean_Qoph
  | `Manichaean_Resh
  | `Manichaean_Sadhe
  | `Manichaean_Samekh
  | `Manichaean_Taw
  | `Manichaean_Ten
  | `Manichaean_Teth
  | `Manichaean_Thamedh
  | `Manichaean_Twenty
  | `Manichaean_Waw
  | `Manichaean_Yodh
  | `Manichaean_Zayin
  | `Meem
  | `Mim
  | `No_Joining_Group
  | `Noon
  | `Nun
  | `Nya
  | `Pe
  | `Qaf
  | `Qaph
  | `Reh
  | `Reversed_Pe
  | `Rohingya_Yeh
  | `Sad
  | `Sadhe
  | `Seen
  | `Semkath
  | `Shin
  | `Straight_Waw
  | `Swash_Kaf
  | `Syriac_Waw
  | `Tah
  | `Taw
  | `Teh_Marbuta
  | `Teh_Marbuta_Goal
  | `Teth
  | `Waw
  | `Yeh
  | `Yeh_Barree
  | `Yeh_With_Tail
  | `Yudh
  | `Yudh_He
  | `Zain
  | `Zhain ]
    prop

Sourceval joining_type : [ `U | `C | `T | `D | `L | `R ] prop

Source

val line_break : 
  [ `AI
  | `AL
  | `B2
  | `BA
  | `BB
  | `BK
  | `CB
  | `CJ
  | `CL
  | `CM
  | `CP
  | `CR
  | `EX
  | `GL
  | `H2
  | `H3
  | `HL
  | `HY
  | `ID
  | `IN
  | `IS
  | `JL
  | `JT
  | `JV
  | `LF
  | `NL
  | `NS
  | `NU
  | `OP
  | `PO
  | `PR
  | `QU
  | `RI
  | `SA
  | `SG
  | `SP
  | `SY
  | `WJ
  | `XX
  | `ZW
  | `EB
  | `EM
  | `ZWJ ]
    prop

Sourceval logical_order_exception : bool prop

Sourceval lowercase : bool prop

Sourceval lowercase_mapping : [ `Self | `Cps of cp list ] prop

Sourceval math : bool prop

Sourceval name : [ `Pattern of string | `Name of string ] prop

In the `Pattern case occurrences of the character '#' (U+0023) in the string must be replaced by the value of the code point as four to six uppercase hexadecimal digits (the minimal needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#" associated to code point U+3400 gives the name "CJK UNIFIED IDEOGRAPH-3400".

Source

val name_alias : 
  (string * [ `Abbreviation | `Alternate | `Control | `Correction | `Figment ])
    list
    prop

Sourceval nfc_quick_check : [ `True | `False | `Maybe ] prop

Sourceval nfd_quick_check : [ `True | `False | `Maybe ] prop

Sourceval nfkc_quick_check : [ `True | `False | `Maybe ] prop

Sourceval nfkc_casefold : [ `Self | `Cps of cp list ] prop

Sourceval nfkd_quick_check : [ `True | `False | `Maybe ] prop

Sourceval noncharacter_code_point : bool prop

Sourceval numeric_type : [ `None | `De | `Di | `Nu ] prop

Sourceval numeric_value : [ `NaN | `Frac of int * int | `Num of int64 ] prop

Sourceval other_alphabetic : bool prop

Sourceval other_default_ignorable_code_point : bool prop

Sourceval other_grapheme_extend : bool prop

Sourceval other_id_continue : bool prop

Sourceval other_id_start : bool prop

Sourceval other_lowercase : bool prop

Sourceval other_math : bool prop

Sourceval other_uppercase : bool prop

Sourceval pattern_syntax : bool prop

Sourceval pattern_white_space : bool prop

Sourceval prepended_concatenation_mark : bool prop

Sourceval quotation_mark : bool prop

Sourceval radical : bool prop

Sourceval regional_indicator : bool prop

Sourcetype script = [

| `Adlm
| `Aghb
| `Ahom
| `Arab
| `Armi
| `Armn
| `Avst
| `Bali
| `Bamu
| `Bass
| `Batk
| `Beng
| `Bhks
| `Bopo
| `Brah
| `Brai
| `Bugi
| `Buhd
| `Cakm
| `Cans
| `Cari
| `Cham
| `Cher
| `Chrs
| `Copt
| `Cprt
| `Cyrl
| `Deva
| `Diak
| `Dogr
| `Dsrt
| `Dupl
| `Egyp
| `Elba
| `Elym
| `Ethi
| `Geor
| `Glag
| `Gong
| `Gonm
| `Goth
| `Gran
| `Grek
| `Gujr
| `Guru
| `Hang
| `Hani
| `Hano
| `Hatr
| `Hebr
| `Hira
| `Hluw
| `Hmng
| `Hmnp
| `Hrkt
| `Hung
| `Ital
| `Java
| `Kali
| `Kana
| `Khar
| `Khmr
| `Khoj
| `Knda
| `Kthi
| `Kits
| `Lana
| `Laoo
| `Latn
| `Lepc
| `Limb
| `Lina
| `Linb
| `Lisu
| `Lyci
| `Lydi
| `Mahj
| `Maka
| `Mand
| `Mani
| `Marc
| `Medf
| `Mend
| `Merc
| `Mero
| `Mlym
| `Modi
| `Mong
| `Mroo
| `Mtei
| `Mult
| `Mymr
| `Nand
| `Narb
| `Nbat
| `Newa
| `Nkoo
| `Nshu
| `Ogam
| `Olck
| `Orkh
| `Orya
| `Osge
| `Osma
| `Palm
| `Pauc
| `Perm
| `Phag
| `Phli
| `Phlp
| `Phnx
| `Plrd
| `Prti
| `Qaai
| `Rjng
| `Rohg
| `Runr
| `Samr
| `Sarb
| `Saur
| `Sgnw
| `Shaw
| `Shrd
| `Sidd
| `Sind
| `Sinh
| `Sogd
| `Sogo
| `Sora
| `Soyo
| `Sund
| `Sylo
| `Syrc
| `Tagb
| `Takr
| `Tale
| `Talu
| `Taml
| `Tang
| `Tavt
| `Telu
| `Tfng
| `Tglg
| `Thaa
| `Thai
| `Tibt
| `Tirh
| `Ugar
| `Vaii
| `Wara
| `Wcho
| `Xpeo
| `Xsux
| `Yezi
| `Yiii
| `Zanb
| `Zinh
| `Zyyy
| `Zzzz

]

Sourceval script : script prop

Sourceval script_extensions : script list prop

Source

val sentence_break : 
  [ `AT
  | `CL
  | `CR
  | `EX
  | `FO
  | `LE
  | `LF
  | `LO
  | `NU
  | `SC
  | `SE
  | `SP
  | `ST
  | `UP
  | `XX ]
    prop

Sourceval simple_case_folding : [ `Self | `Cp of cp ] prop

Sourceval simple_lowercase_mapping : [ `Self | `Cp of cp ] prop

Sourceval simple_titlecase_mapping : [ `Self | `Cp of cp ] prop

Sourceval simple_uppercase_mapping : [ `Self | `Cp of cp ] prop

Sourceval soft_dotted : bool prop

Sourceval sterm : bool prop

Sourceval terminal_punctuation : bool prop

Sourceval titlecase_mapping : [ `Self | `Cps of cp list ] prop

Sourceval uax_42_element : [ `Reserved | `Noncharacter | `Surrogate | `Char ] prop

Not normative, artefact of Uucd. Corresponds to the XML element name that describes the code point.

Sourceval unicode_1_name : string prop

Sourceval unified_ideograph : bool prop

Sourceval uppercase : bool prop

Sourceval uppercase_mapping : [ `Self | `Cps of cp list ] prop

Sourceval variation_selector : bool prop

Sourceval vertical_orientation : [ `U | `R | `Tu | `Tr ] prop

Sourceval white_space : bool prop

Source

val word_break : 
  [ `CR
  | `DQ
  | `EB
  | `EBG
  | `EM
  | `EX
  | `Extend
  | `FO
  | `GAZ
  | `HL
  | `KA
  | `LE
  | `LF
  | `MB
  | `ML
  | `MN
  | `NL
  | `NU
  | `RI
  | `SQ
  | `WSegSpace
  | `XX
  | `ZWJ ]
    prop

Sourceval xid_continue : bool prop

Sourceval xid_start : bool prop

Unihan properties

In alphabetic order. For now unihan properties are always represented as strings.

Sourceval kAccountingNumeric : string prop

Sourceval kAlternateHanYu : string prop

Sourceval kAlternateJEF : string prop

Sourceval kAlternateKangXi : string prop

Sourceval kAlternateMorohashi : string prop

Sourceval kBigFive : string prop

Sourceval kCCCII : string prop

Sourceval kCNS1986 : string prop

Sourceval kCNS1992 : string prop

Sourceval kCangjie : string prop

Sourceval kCantonese : string prop

Sourceval kCheungBauer : string prop

Sourceval kCheungBauerIndex : string prop

Sourceval kCihaiT : string prop

Sourceval kCompatibilityVariant : string prop

Sourceval kCowles : string prop

Sourceval kDaeJaweon : string prop

Sourceval kDefinition : string prop

Sourceval kEACC : string prop

Sourceval kFenn : string prop

Sourceval kFennIndex : string prop

Sourceval kFourCornerCode : string prop

Sourceval kFrequency : string prop

Sourceval kGB0 : string prop

Sourceval kGB1 : string prop

Sourceval kGB3 : string prop

Sourceval kGB5 : string prop

Sourceval kGB7 : string prop

Sourceval kGB8 : string prop

Sourceval kGSR : string prop

Sourceval kGradeLevel : string prop

Sourceval kHDZRadBreak : string prop

Sourceval kHKGlyph : string prop

Sourceval kHKSCS : string prop

Sourceval kHanYu : string prop

Sourceval kHangul : string prop

Sourceval kHanyuPinlu : string prop

Sourceval kHanyuPinyin : string prop

Sourceval kIBMJapan : string prop

Sourceval kIICore : string prop

Sourceval kIRGDaeJaweon : string prop

Sourceval kIRGDaiKanwaZiten : string prop

Sourceval kIRGHanyuDaZidian : string prop

Sourceval kIRGKangXi : string prop

Sourceval kIRG_GSource : string prop

Sourceval kIRG_HSource : string prop

Sourceval kIRG_JSource : string prop

Sourceval kIRG_KPSource : string prop

Sourceval kIRG_KSource : string prop

Sourceval kIRG_MSource : string prop

Sourceval kIRG_SSource : string prop

Sourceval kIRG_TSource : string prop

Sourceval kIRG_USource : string prop

Sourceval kIRG_UKSource : string prop

Sourceval kIRG_VSource : string prop

Sourceval kJHJ : string prop

Sourceval kJIS0213 : string prop

Sourceval kJa : string prop

Sourceval kJapaneseKun : string prop

Sourceval kJapaneseOn : string prop

Sourceval kJinmeiyoKanji : string prop

Sourceval kJis0 : string prop

Sourceval kJis1 : string prop

Sourceval kJoyoKanji : string prop

Sourceval kKPS0 : string prop

Sourceval kKPS1 : string prop

Sourceval kKSC0 : string prop

Sourceval kKSC1 : string prop

Sourceval kKangXi : string prop

Sourceval kKarlgren : string prop

Sourceval kKorean : string prop

Sourceval kKoreanEducationHanja : string prop

Sourceval kKoreanName : string prop

Sourceval kLau : string prop

Sourceval kMainlandTelegraph : string prop

Sourceval kMandarin : string prop

Sourceval kMatthews : string prop

Sourceval kMeyerWempe : string prop

Sourceval kMorohashi : string prop

Sourceval kNelson : string prop

Sourceval kOtherNumeric : string prop

Sourceval kPhonetic : string prop

Sourceval kPrimaryNumeric : string prop

Sourceval kPseudoGB1 : string prop

Sourceval kRSAdobe_Japan1_6 : string prop

Sourceval kRSJapanese : string prop

Sourceval kRSKanWa : string prop

Sourceval kRSKangXi : string prop

Sourceval kRSKorean : string prop

Sourceval kRSMerged : string prop

Sourceval kRSTUnicode : string prop

Sourceval kRSUnicode : string prop

Sourceval kReading : string prop

Sourceval kSBGY : string prop

Sourceval kSemanticVariant : string prop

Sourceval kSimplifiedVariant : string prop

Sourceval kSpecializedSemanticVariant : string prop

Sourceval kSpoofingVariant : string prop

Sourceval kSrc_NushuDuben : string prop

Sourceval kUnihanCore2020 : string prop

Sourceval kTGH : string prop

Sourceval kTGHZ2013 : string prop

Sourceval kTGT_MergedSrc : string prop

Sourceval kTaiwanTelegraph : string prop

Sourceval kTang : string prop

Sourceval kTotalStrokes : string prop

Sourceval kTraditionalVariant : string prop

Sourceval kVietnamese : string prop

Sourceval kWubi : string prop

Sourceval kXHC1983 : string prop

Sourceval kXerox : string prop

Sourceval kZVariant : string prop

Unicode character databases

Sourcetype block = (cp * cp) * string

The type for blocks. Code point range, name of the block.

Sourcetype named_sequence = string * cp list

The type for named sequences. Sequence name, code point sequence.

Sourcetype normalization_correction = cp * cp list * cp list * (int * int * int)

The type for normalization corrections. Code point, old normalization, new normalization, version

Source

type standardized_variant =
  cp list * string * [ `Isolate | `Initial | `Medial | `Final ] list

The type for standarized variants. Code point sequence, description, when.

Sourcetype cjk_radical = string * cp * cp

The type for CJK radicals. Radical number, CJK radical character, CJK unified ideograph.

Sourcetype emoji_source = cp list * int option * int option * int option

The type for emoji sources. Unicode, docomo, kddi, softbank.

Sourcetype t = {

description : string;
repertoire : props Cpmap.t;
blocks : block list;
named_sequences : named_sequence list;
provisional_named_sequences : named_sequence list;
normalization_corrections : normalization_correction list;
standardized_variants : standardized_variant list;
cjk_radicals : cjk_radical list;
emoji_sources : emoji_source list;

}

The type for Unicode character databases.

Note. Absence of an optional top-level field in the database is denoted by the neutral element of its type (empty string, empty list, Cpmap.empty). This means that the module doesn't distinguish between absence of a field and presence of the field with empty data (but incurs no problems in this context).

Sourceval cp_prop : t -> cp -> 'a prop -> 'a option

cp_prop ucd cp p is the property p of the code point cp in db's repertoire, if p is in the repertoire and the property exists for cp.

Decode

Sourcetype src = [

| `Channel of in_channel
| `String of string

]

The type for input sources.

Sourcetype decoder

The type for Unicode character database decoders.

Sourceval decoder : [< src ] -> decoder

decoder src is a decoder that inputs from src.

Sourceval decode : decoder -> [ `Ok of t | `Error of string ]

decode d decodes a database from d or returns an error.

Sourceval decoded_range : decoder -> (int * int) * (int * int)

decoded_range d is the range of characters spanning the `Error decoded by d. A pair of line and column numbers respectively one and zero based.

Basics

The database and subsets of it for Unicode 13.0.0 are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.

A database is decoded as follows:

let ucd_or_die inf = try
  let ic = if inf = "-" then stdin else open_in inf in
  let d = Uucd.decoder (`Channel ic) in
  match Uucd.decode d with
  | `Ok db -> db
  | `Error e ->
    let (l0, c0), (l1, c1) = Uucd.decoded_range d in
    Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
    exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1

let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"

The convenience function cp_prop can be used to query the property of a given code point. For example the general category of U+1F42B is given by:

let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category

package uucd

Module UucdSource

References

Code points

Properties

Non Unihan properties

Unihan properties

Unicode character databases

Decode

Basics

Module `Uucd`Source