package uucd

  1. Overview
  2. Docs

Module UucdSource

Unicode character database decoder.

Uucd decodes the data of the Unicode character database from its XML representation. It provides high-level (but not necessarily efficient) access to the data so that efficient representations can be extracted.

Uucd decodes the representation described in the Annex #42 of Unicode 16.0.0. Subsequent versions may be decoded as long as no new cases are introduced in parsed enumerated properties.

Consult the basics.

Note. All strings returned by the module are UTF-8 encoded.

Unicode version 16.0.0

References

Code points

Sourcetype cp = int

The type for Unicode code points, ranges from 0x0000 to 0x10_FFFF.

Sourceval is_cp : int -> bool

is_cp n is true iff n a Unicode code point.

Sourceval is_scalar_value : int -> bool

is_scalar_value n is true iff n is a Unicode scalar value.

Sourcemodule Cpmap : Map.S with type key = cp

Code point maps.

Properties

Properties are referenced by their name and property values by their abbreviated name. To understand their semantics refer to the standard.

Sourcetype props

The type for sets of properties.

Sourcetype 'a prop

The type for properties with property value of type 'a.

Sourceval find : props -> 'a prop -> 'a option

find ps p is the value of property p in ps, if any.

Sourceval unknown_prop : (string * string) -> string prop

unknown_prop (ns, n) is a property read from an XML attribute whose expanded name is (ns, n). This can be used to access a property unknown to the module.

Non Unihan properties

In alphabetical order.

Sourceval age : [ `Version of int * int | `Unassigned ] prop
Sourceval alphabetic : bool prop
Sourceval ascii_hex_digit : bool prop
Sourceval bidi_class : [ `AL | `AN | `B | `BN | `CS | `EN | `ES | `ET | `FSI | `L | `LRE | `LRI | `LRO | `NSM | `ON | `PDF | `PDI | `R | `RLE | `RLI | `RLO | `S | `WS ] prop
Sourceval bidi_control : bool prop
Sourceval bidi_mirrored : bool prop
Sourceval bidi_mirroring_glyph : cp option prop
Sourceval bidi_paired_bracket : [ `Self | `Cp of cp ] prop
Sourceval bidi_paired_bracket_type : [ `O | `C | `N ] prop
Sourceval block : [ `ASCII | `Adlam | `Aegean_Numbers | `Ahom | `Alchemical | `Alphabetic_PF | `Anatolian_Hieroglyphs | `Ancient_Greek_Music | `Ancient_Greek_Numbers | `Ancient_Symbols | `Arabic | `Arabic_Ext_A | `Arabic_Ext_B | `Arabic_Ext_C | `Arabic_Math | `Arabic_PF_A | `Arabic_PF_B | `Arabic_Sup | `Armenian | `Arrows | `Avestan | `Balinese | `Bamum | `Bamum_Sup | `Bassa_Vah | `Batak | `Bengali | `Beria_Erfe | `Bhaiksuki | `Block_Elements | `Bopomofo | `Bopomofo_Ext | `Box_Drawing | `Brahmi | `Braille | `Buginese | `Buhid | `Byzantine_Music | `CJK | `CJK_Compat | `CJK_Compat_Forms | `CJK_Compat_Ideographs | `CJK_Compat_Ideographs_Sup | `CJK_Ext_A | `CJK_Ext_B | `CJK_Ext_C | `CJK_Ext_D | `CJK_Ext_E | `CJK_Ext_F | `CJK_Ext_G | `CJK_Ext_H | `CJK_Ext_I | `CJK_Ext_J | `CJK_Radicals_Sup | `CJK_Strokes | `CJK_Symbols | `Carian | `Caucasian_Albanian | `Chakma | `Cham | `Cherokee | `Cherokee_Sup | `Chess_Symbols | `Chorasmian | `Compat_Jamo | `Control_Pictures | `Coptic | `Coptic_Epact_Numbers | `Counting_Rod | `Cuneiform | `Cuneiform_Numbers | `Currency_Symbols | `Cypriot_Syllabary | `Cypro_Minoan | `Cyrillic | `Cyrillic_Ext_A | `Cyrillic_Ext_B | `Cyrillic_Ext_C | `Cyrillic_Ext_D | `Cyrillic_Sup | `Deseret | `Devanagari | `Devanagari_Ext | `Devanagari_Ext_A | `Diacriticals | `Diacriticals_Ext | `Diacriticals_For_Symbols | `Diacriticals_Sup | `Dingbats | `Dives_Akuru | `Dogra | `Domino | `Duployan | `Early_Dynastic_Cuneiform | `Egyptian_Hieroglyph_Format_Controls | `Egyptian_Hieroglyphs | `Egyptian_Hieroglyphs_Ext_A | `Elbasan | `Elymaic | `Emoticons | `Enclosed_Alphanum | `Enclosed_Alphanum_Sup | `Enclosed_CJK | `Enclosed_Ideographic_Sup | `Ethiopic | `Ethiopic_Ext | `Ethiopic_Ext_A | `Ethiopic_Ext_B | `Ethiopic_Sup | `Garay | `Geometric_Shapes | `Geometric_Shapes_Ext | `Georgian | `Georgian_Ext | `Georgian_Sup | `Glagolitic | `Glagolitic_Sup | `Gothic | `Grantha | `Greek | `Greek_Ext | `Gujarati | `Gunjala_Gondi | `Gurmukhi | `Gurung_Khema | `Half_And_Full_Forms | `Half_Marks | `Hangul | `Hanifi_Rohingya | `Hanunoo | `Hatran | `Hebrew | `High_PU_Surrogates | `High_Surrogates | `Hiragana | `IDC | `IPA_Ext | `Ideographic_Symbols | `Imperial_Aramaic | `Indic_Number_Forms | `Indic_Siyaq_Numbers | `Inscriptional_Pahlavi | `Inscriptional_Parthian | `Jamo | `Jamo_Ext_A | `Jamo_Ext_B | `Javanese | `Kaithi | `Kaktovik_Numerals | `Kana_Ext_A | `Kana_Ext_B | `Kana_Sup | `Kanbun | `Kangxi | `Kannada | `Katakana | `Katakana_Ext | `Kawi | `Kayah_Li | `Kharoshthi | `Khitan_Small_Script | `Khmer | `Khmer_Symbols | `Khojki | `Khudawadi | `Kirat_Rai | `Lao | `Latin_1_Sup | `Latin_Ext_A | `Latin_Ext_Additional | `Latin_Ext_B | `Latin_Ext_C | `Latin_Ext_D | `Latin_Ext_E | `Latin_Ext_F | `Latin_Ext_G | `Lepcha | `Letterlike_Symbols | `Limbu | `Linear_A | `Linear_B_Ideograms | `Linear_B_Syllabary | `Lisu | `Lisu_Sup | `Low_Surrogates | `Lycian | `Lydian | `Mahajani | `Mahjong | `Makasar | `Malayalam | `Mandaic | `Manichaean | `Marchen | `Masaram_Gondi | `Math_Alphanum | `Math_Operators | `Mayan_Numerals | `Medefaidrin | `Meetei_Mayek | `Meetei_Mayek_Ext | `Mende_Kikakui | `Meroitic_Cursive | `Meroitic_Hieroglyphs | `Miao | `Misc_Arrows | `Misc_Math_Symbols_A | `Misc_Math_Symbols_B | `Misc_Pictographs | `Misc_Symbols | `Misc_Symbols_Sup | `Misc_Technical | `Modi | `Modifier_Letters | `Modifier_Tone_Letters | `Mongolian | `Mongolian_Sup | `Mro | `Multani | `Music | `Myanmar | `Myanmar_Ext_A | `Myanmar_Ext_B | `Myanmar_Ext_C | `NB | `NKo | `Nabataean | `Nag_Mundari | `Nandinagari | `New_Tai_Lue | `Newa | `Number_Forms | `Nushu | `Nyiakeng_Puachue_Hmong | `OCR | `Ogham | `Ol_Onal | `Ol_Chiki | `Old_Hungarian | `Old_Italic | `Old_North_Arabian | `Old_Permic | `Old_Persian | `Old_Sogdian | `Old_South_Arabian | `Old_Turkic | `Old_Uyghur | `Oriya | `Ornamental_Dingbats | `Osage | `Osmanya | `Ottoman_Siyaq_Numbers | `PUA | `Pahawh_Hmong | `Palmyrene | `Pau_Cin_Hau | `Phags_Pa | `Phaistos | `Phoenician | `Phonetic_Ext | `Phonetic_Ext_Sup | `Playing_Cards | `Psalter_Pahlavi | `Punctuation | `Rejang | `Rumi | `Runic | `Samaritan | `Saurashtra | `Sharada | `Sharada_Sup | `Shavian | `Shorthand_Format_Controls | `Siddham | `Sidetic | `Sinhala | `Sinhala_Archaic_Numbers | `Small_Forms | `Small_Kana_Ext | `Sogdian | `Sora_Sompeng | `Soyombo | `Specials | `Sundanese | `Sundanese_Sup | `Sunuwar | `Sup_Arrows_A | `Sup_Arrows_B | `Sup_Arrows_C | `Sup_Math_Operators | `Sup_PUA_A | `Sup_PUA_B | `Sup_Punctuation | `Sup_Symbols_And_Pictographs | `Super_And_Sub | `Sutton_SignWriting | `Syloti_Nagri | `Symbols_And_Pictographs_Ext_A | `Symbols_For_Legacy_Computing | `Symbols_For_Legacy_Computing_Sup | `Syriac | `Syriac_Sup | `Tagalog | `Tagbanwa | `Tags | `Tai_Le | `Tai_Tham | `Tai_Viet | `Tai_Xuan_Jing | `Tai_Yo | `Takri | `Tamil | `Tamil_Sup | `Tangsa | `Tangut | `Tangut_Components | `Tangut_Components_Sup | `Tangut_Sup | `Telugu | `Thaana | `Thai | `Tibetan | `Tifinagh | `Tirhuta | `Todhri | `Tolong_Siki | `Toto | `Transport_And_Map | `Tulu_Tigalari | `UCAS | `UCAS_Ext | `UCAS_Ext_A | `Ugaritic | `VS | `VS_Sup | `Vai | `Vedic_Ext | `Vertical_Forms | `Vithkuqi | `Wancho | `Warang_Citi | `Yezidi | `Yi_Radicals | `Yi_Syllables | `Yijing | `Zanabazar_Square | `Znamenny_Music ] prop
Sourceval canonical_combining_class : int prop
Sourceval cased : bool prop
Sourceval case_folding : [ `Self | `Cps of cp list ] prop
Sourceval case_ignorable : bool prop
Sourceval changes_when_casefolded : bool prop
Sourceval changes_when_casemapped : bool prop
Sourceval changes_when_lowercased : bool prop
Sourceval changes_when_nfkc_casefolded : bool prop
Sourceval changes_when_titlecased : bool prop
Sourceval changes_when_uppercased : bool prop
Sourceval composition_exclusion : bool prop
Sourceval dash : bool prop
Sourceval decomposition_mapping : [ `Self | `Cps of cp list ] prop
Sourceval decomposition_type : [ `Can | `Com | `Enc | `Fin | `Font | `Fra | `Init | `Iso | `Med | `Nar | `Nb | `Sml | `Sqr | `Sub | `Sup | `Vert | `Wide | `None ] prop
Sourceval default_ignorable_code_point : bool prop
Sourceval deprecated : bool prop
Sourceval diacritic : bool prop
Sourceval east_asian_width : [ `A | `F | `H | `N | `Na | `W ] prop
Sourceval emoji : bool prop
Sourceval emoji_presentation : bool prop
Sourceval emoji_modifier : bool prop
Sourceval emoji_modifier_base : bool prop
Sourceval emoji_component : bool prop
Sourceval equivalent_unified_ideograph : cp option prop
Sourceval extended_pictographic : bool prop
Sourceval extender : bool prop
Sourceval full_composition_exclusion : bool prop
Sourceval general_category : [ `Lu | `Ll | `Lt | `Lm | `Lo | `Mn | `Mc | `Me | `Nd | `Nl | `No | `Pc | `Pd | `Ps | `Pe | `Pi | `Pf | `Po | `Sm | `Sc | `Sk | `So | `Zs | `Zl | `Zp | `Cc | `Cf | `Cs | `Co | `Cn ] prop
Sourceval grapheme_base : bool prop
Sourceval grapheme_cluster_break : [ `CN | `CR | `EB | `EBG | `EM | `EX | `GAZ | `L | `LF | `LV | `LVT | `PP | `RI | `SM | `T | `V | `XX | `ZWJ ] prop
Sourceval grapheme_extend : bool prop
Sourceval hangul_syllable_type : [ `L | `LV | `LVT | `T | `V | `NA ] prop
Sourceval hex_digit : bool prop
Sourceval id_continue : bool prop
Sourceval id_compat_math_continue : bool prop
Sourceval id_compat_math_start : bool prop
Sourceval id_start : bool prop
Sourceval ideographic : bool prop
Sourceval ids_binary_operator : bool prop
Sourceval ids_trinary_operator : bool prop
Sourceval ids_unary_operator : bool prop
Sourceval indic_conjunct_break : [ `Consonant | `Extend | `Linker | `None ] prop
Sourceval indic_syllabic_category : [ `Avagraha | `Bindu | `Brahmi_Joining_Number | `Cantillation_Mark | `Consonant | `Consonant_Dead | `Consonant_Final | `Consonant_Head_Letter | `Consonant_Initial_Postfixed | `Consonant_Killer | `Consonant_Medial | `Consonant_Placeholder | `Consonant_Preceding_Repha | `Consonant_Prefixed | `Consonant_Repha | `Consonant_Subjoined | `Consonant_Succeeding_Repha | `Consonant_With_Stacker | `Gemination_Mark | `Invisible_Stacker | `Joiner | `Modifying_Letter | `Non_Joiner | `Nukta | `Number | `Number_Joiner | `Other | `Pure_Killer | `Reordering_Killer | `Register_Shifter | `Syllable_Modifier | `Tone_Letter | `Tone_Mark | `Virama | `Visarga | `Vowel | `Vowel_Dependent | `Vowel_Independent ] prop
Sourceval indic_matra_category : [ `Right | `Left | `Visual_Order_Left | `Left_And_Right | `Top | `Bottom | `Top_And_Bottom | `Top_And_Right | `Top_And_Left | `Top_And_Left_And_Right | `Bottom_And_Right | `Top_And_Bottom_And_Right | `Overstruck | `Invisible | `NA ] prop
Sourceval indic_positional_category : [ `Bottom | `Bottom_And_Left | `Bottom_And_Right | `Invisible | `Left | `Left_And_Right | `NA | `Overstruck | `Right | `Top | `Top_And_Bottom | `Top_And_Bottom_And_Left | `Top_And_Bottom_And_Right | `Top_And_Left | `Top_And_Left_And_Right | `Top_And_Right | `Visual_Order_Left ] prop
Sourceval jamo_short_name : string prop
Sourceval join_control : bool prop
Sourceval joining_group : [ `African_Feh | `African_Noon | `African_Qaf | `Ain | `Alaph | `Alef | `Alef_Maqsurah | `Beh | `Beth | `Burushaski_Yeh_Barree | `Dal | `Dalath_Rish | `E | `Farsi_Yeh | `Fe | `Feh | `Final_Semkath | `Gaf | `Gamal | `Hah | `Hanifi_Rohingya_Kinna_Ya | `Hanifi_Rohingya_Pa | `Hamza_On_Heh_Goal | `He | `Heh | `Heh_Goal | `Heth | `Kaf | `Kaph | `Kashmiri_Yeh | `Khaph | `Knotted_Heh | `Lam | `Lamadh | `Malayalam_Bha | `Malayalam_Ja | `Malayalam_Lla | `Malayalam_Llla | `Malayalam_Nga | `Malayalam_Nna | `Malayalam_Nnna | `Malayalam_Nya | `Malayalam_Ra | `Malayalam_Ssa | `Malayalam_Tta | `Manichaean_Aleph | `Manichaean_Ayin | `Manichaean_Beth | `Manichaean_Daleth | `Manichaean_Dhamedh | `Manichaean_Five | `Manichaean_Gimel | `Manichaean_Heth | `Manichaean_Hundred | `Manichaean_Kaph | `Manichaean_Lamedh | `Manichaean_Mem | `Manichaean_Nun | `Manichaean_One | `Manichaean_Pe | `Manichaean_Qoph | `Manichaean_Resh | `Manichaean_Sadhe | `Manichaean_Samekh | `Manichaean_Taw | `Manichaean_Ten | `Manichaean_Teth | `Manichaean_Thamedh | `Manichaean_Twenty | `Manichaean_Waw | `Manichaean_Yodh | `Manichaean_Zayin | `Meem | `Mim | `No_Joining_Group | `Noon | `Nun | `Nya | `Pe | `Qaf | `Qaph | `Reh | `Reversed_Pe | `Rohingya_Yeh | `Sad | `Sadhe | `Seen | `Semkath | `Shin | `Straight_Waw | `Swash_Kaf | `Syriac_Waw | `Tah | `Taw | `Teh_Marbuta | `Teh_Marbuta_Goal | `Teth | `Thin_Noon | `Thin_Yeh | `Vertical_Tail | `Waw | `Yeh | `Yeh_Barree | `Yeh_With_Tail | `Yudh | `Yudh_He | `Zain | `Zhain | `BAA | `FA | `HAA | `HA_GOAL | `HA | `CAF | `KNOTTED_HA | `RA | `SWASH_CAF | `HAMZAH_ON_HA_GOAL | `TAA_MARBUTAH | `YA_BARREE | `YA | `ALEF_MAQSURAH ] prop
Sourceval joining_type : [ `U | `C | `T | `D | `L | `R ] prop
Sourceval line_break : [ `AI | `AK | `AL | `AP | `AS | `B2 | `BA | `BB | `BK | `CB | `CJ | `CL | `CM | `CP | `CR | `EX | `GL | `H2 | `H3 | `HH | `HL | `HY | `ID | `IN | `IS | `JL | `JT | `JV | `LF | `NL | `NS | `NU | `OP | `PO | `PR | `QU | `RI | `SA | `SG | `SP | `SY | `VF | `VI | `WJ | `XX | `ZW | `EB | `EM | `ZWJ ] prop
Sourceval logical_order_exception : bool prop
Sourceval lowercase : bool prop
Sourceval lowercase_mapping : [ `Self | `Cps of cp list ] prop
Sourceval math : bool prop
Sourceval name : [ `Pattern of string | `Name of string ] prop

In the `Pattern case occurrences of the character '#' (U+0023) in the string must be replaced by the value of the code point as four to six uppercase hexadecimal digits (the minimal needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#" associated to code point U+3400 gives the name "CJK UNIFIED IDEOGRAPH-3400".

Sourceval modifier_combining_mark : bool prop
Sourceval name_alias : (string * [ `Abbreviation | `Alternate | `Control | `Correction | `Figment ]) list prop
Sourceval nfc_quick_check : [ `True | `False | `Maybe ] prop
Sourceval nfd_quick_check : [ `True | `False | `Maybe ] prop
Sourceval nfkc_quick_check : [ `True | `False | `Maybe ] prop
Sourceval nfkc_casefold : [ `Self | `Cps of cp list ] prop
Sourceval nfkc_simple_casefold : [ `Self | `Cps of cp list ] prop
Sourceval nfkd_quick_check : [ `True | `False | `Maybe ] prop
Sourceval noncharacter_code_point : bool prop
Sourceval numeric_type : [ `None | `De | `Di | `Nu ] prop
Sourceval numeric_value : [ `NaN | `Nums of [ `Frac of int * int | `Num of int64 ] list ] prop
Sourceval other_alphabetic : bool prop
Sourceval other_default_ignorable_code_point : bool prop
Sourceval other_grapheme_extend : bool prop
Sourceval other_id_continue : bool prop
Sourceval other_id_start : bool prop
Sourceval other_lowercase : bool prop
Sourceval other_math : bool prop
Sourceval other_uppercase : bool prop
Sourceval pattern_syntax : bool prop
Sourceval pattern_white_space : bool prop
Sourceval prepended_concatenation_mark : bool prop
Sourceval quotation_mark : bool prop
Sourceval radical : bool prop
Sourceval regional_indicator : bool prop
Sourcetype script = [
  1. | `Adlm
  2. | `Aghb
  3. | `Ahom
  4. | `Arab
  5. | `Armi
  6. | `Armn
  7. | `Avst
  8. | `Bali
  9. | `Bamu
  10. | `Bass
  11. | `Batk
  12. | `Beng
  13. | `Berf
  14. | `Bhks
  15. | `Bopo
  16. | `Brah
  17. | `Brai
  18. | `Bugi
  19. | `Buhd
  20. | `Cakm
  21. | `Cans
  22. | `Cari
  23. | `Cham
  24. | `Cher
  25. | `Chrs
  26. | `Copt
  27. | `Cpmn
  28. | `Cprt
  29. | `Cyrl
  30. | `Deva
  31. | `Diak
  32. | `Dogr
  33. | `Dsrt
  34. | `Dupl
  35. | `Egyp
  36. | `Elba
  37. | `Elym
  38. | `Ethi
  39. | `Gara
  40. | `Geor
  41. | `Glag
  42. | `Gong
  43. | `Gonm
  44. | `Goth
  45. | `Gran
  46. | `Grek
  47. | `Gujr
  48. | `Gukh
  49. | `Guru
  50. | `Hang
  51. | `Hani
  52. | `Hano
  53. | `Hatr
  54. | `Hebr
  55. | `Hira
  56. | `Hluw
  57. | `Hmng
  58. | `Hmnp
  59. | `Hrkt
  60. | `Hung
  61. | `Ital
  62. | `Java
  63. | `Kali
  64. | `Kana
  65. | `Kawi
  66. | `Khar
  67. | `Khmr
  68. | `Khoj
  69. | `Knda
  70. | `Krai
  71. | `Kthi
  72. | `Kits
  73. | `Lana
  74. | `Laoo
  75. | `Latn
  76. | `Lepc
  77. | `Limb
  78. | `Lina
  79. | `Linb
  80. | `Lisu
  81. | `Lyci
  82. | `Lydi
  83. | `Mahj
  84. | `Maka
  85. | `Mand
  86. | `Mani
  87. | `Marc
  88. | `Medf
  89. | `Mend
  90. | `Merc
  91. | `Mero
  92. | `Mlym
  93. | `Modi
  94. | `Mong
  95. | `Mroo
  96. | `Mtei
  97. | `Mult
  98. | `Mymr
  99. | `Nagm
  100. | `Nand
  101. | `Narb
  102. | `Nbat
  103. | `Newa
  104. | `Nkoo
  105. | `Nshu
  106. | `Ogam
  107. | `Olck
  108. | `Onao
  109. | `Orkh
  110. | `Orya
  111. | `Osge
  112. | `Osma
  113. | `Ougr
  114. | `Palm
  115. | `Pauc
  116. | `Perm
  117. | `Phag
  118. | `Phli
  119. | `Phlp
  120. | `Phnx
  121. | `Plrd
  122. | `Prti
  123. | `Qaai
  124. | `Rjng
  125. | `Rohg
  126. | `Runr
  127. | `Samr
  128. | `Sarb
  129. | `Saur
  130. | `Sgnw
  131. | `Shaw
  132. | `Shrd
  133. | `Sidd
  134. | `Sidt
  135. | `Sind
  136. | `Sinh
  137. | `Sogd
  138. | `Sogo
  139. | `Sora
  140. | `Soyo
  141. | `Sund
  142. | `Sunu
  143. | `Sylo
  144. | `Syrc
  145. | `Tagb
  146. | `Takr
  147. | `Tale
  148. | `Talu
  149. | `Taml
  150. | `Tang
  151. | `Tavt
  152. | `Tayo
  153. | `Telu
  154. | `Tfng
  155. | `Tglg
  156. | `Thaa
  157. | `Thai
  158. | `Tibt
  159. | `Tirh
  160. | `Tnsa
  161. | `Todr
  162. | `Tols
  163. | `Toto
  164. | `Tutg
  165. | `Ugar
  166. | `Vaii
  167. | `Vith
  168. | `Wara
  169. | `Wcho
  170. | `Xpeo
  171. | `Xsux
  172. | `Yezi
  173. | `Yiii
  174. | `Zanb
  175. | `Zinh
  176. | `Zyyy
  177. | `Zzzz
]
Sourceval script : script prop
Sourceval script_extensions : script list prop
Sourceval sentence_break : [ `AT | `CL | `CR | `EX | `FO | `LE | `LF | `LO | `NU | `SC | `SE | `SP | `ST | `UP | `XX ] prop
Sourceval simple_case_folding : [ `Self | `Cp of cp ] prop
Sourceval simple_lowercase_mapping : [ `Self | `Cp of cp ] prop
Sourceval simple_titlecase_mapping : [ `Self | `Cp of cp ] prop
Sourceval simple_uppercase_mapping : [ `Self | `Cp of cp ] prop
Sourceval soft_dotted : bool prop
Sourceval sterm : bool prop
Sourceval terminal_punctuation : bool prop
Sourceval titlecase_mapping : [ `Self | `Cps of cp list ] prop
Sourceval uax_42_element : [ `Reserved | `Noncharacter | `Surrogate | `Char ] prop

Not normative, artefact of Uucd. Corresponds to the XML element name that describes the code point.

Sourceval unicode_1_name : string prop
Sourceval unified_ideograph : bool prop
Sourceval uppercase : bool prop
Sourceval uppercase_mapping : [ `Self | `Cps of cp list ] prop
Sourceval variation_selector : bool prop
Sourceval vertical_orientation : [ `U | `R | `Tu | `Tr ] prop
Sourceval white_space : bool prop
Sourceval word_break : [ `CR | `DQ | `EB | `EBG | `EM | `EX | `Extend | `FO | `GAZ | `HL | `KA | `LE | `LF | `MB | `ML | `MN | `NL | `NU | `RI | `SQ | `WSegSpace | `XX | `ZWJ ] prop
Sourceval xid_continue : bool prop
Sourceval xid_start : bool prop

Unihan properties

In alphabetic order. For now unihan properties are always represented as strings.

Sourceval kAccountingNumeric : string prop
Sourceval kAlternateHanYu : string prop
Sourceval kAlternateJEF : string prop
Sourceval kAlternateKangXi : string prop
Sourceval kAlternateMorohashi : string prop
Sourceval kAlternateTotalStrokes : string prop
Sourceval kBigFive : string prop
Sourceval kCCCII : string prop
Sourceval kCNS1986 : string prop
Sourceval kCNS1992 : string prop
Sourceval kCangjie : string prop
Sourceval kCantonese : string prop
Sourceval kCheungBauer : string prop
Sourceval kCheungBauerIndex : string prop
Sourceval kCihaiT : string prop
Sourceval kCompatibilityVariant : string prop
Sourceval kCowles : string prop
Sourceval kDaeJaweon : string prop
Sourceval kDefinition : string prop
Sourceval kEACC : string prop
Sourceval kFanqie : string prop
Sourceval kFenn : string prop
Sourceval kFennIndex : string prop
Sourceval kFourCornerCode : string prop
Sourceval kFrequency : string prop
Sourceval kGB0 : string prop
Sourceval kGB1 : string prop
Sourceval kGB3 : string prop
Sourceval kGB5 : string prop
Sourceval kGB8 : string prop
Sourceval kGSR : string prop
Sourceval kGradeLevel : string prop
Sourceval kHDZRadBreak : string prop
Sourceval kHKGlyph : string prop
Sourceval kHKSCS : string prop
Sourceval kHanYu : string prop
Sourceval kHangul : string prop
Sourceval kHanyuPinlu : string prop
Sourceval kHanyuPinyin : string prop
Sourceval kIBMJapan : string prop
Sourceval kIICore : string prop
Sourceval kIRGDaeJaweon : string prop
Sourceval kIRGDaiKanwaZiten : string prop
Sourceval kIRGHanyuDaZidian : string prop
Sourceval kIRGKangXi : string prop
Sourceval kIRG_GSource : string prop
Sourceval kIRG_HSource : string prop
Sourceval kIRG_JSource : string prop
Sourceval kIRG_KPSource : string prop
Sourceval kIRG_KSource : string prop
Sourceval kIRG_MSource : string prop
Sourceval kIRG_SSource : string prop
Sourceval kIRG_TSource : string prop
Sourceval kIRG_USource : string prop
Sourceval kIRG_UKSource : string prop
Sourceval kIRG_VSource : string prop
Sourceval kJapanese : string prop
Sourceval kJapaneseKun : string prop
Sourceval kJapaneseOn : string prop
Sourceval kJHJ : string prop
Sourceval kJIS0213 : string prop
Sourceval kJinmeiyoKanji : string prop
Sourceval kJis0 : string prop
Sourceval kJis1 : string prop
Sourceval kJoyoKanji : string prop
Sourceval kKPS0 : string prop
Sourceval kKPS1 : string prop
Sourceval kKSC0 : string prop
Sourceval kKSC1 : string prop
Sourceval kKangXi : string prop
Sourceval kKarlgren : string prop
Sourceval kKorean : string prop
Sourceval kKoreanEducationHanja : string prop
Sourceval kKoreanName : string prop
Sourceval kLau : string prop
Sourceval kMainlandTelegraph : string prop
Sourceval kMandarin : string prop
Sourceval kMatthews : string prop
Sourceval kMeyerWempe : string prop
Sourceval kMojiJoho : string prop
Sourceval kMorohashi : string prop
Sourceval kNelson : string prop
Sourceval kNSHU_DubenSrc : string prop
Sourceval kNSHU_Reading : string prop
Sourceval kOtherNumeric : string prop
Sourceval kPhonetic : string prop
Sourceval kPrimaryNumeric : string prop
Sourceval kPseudoGB1 : string prop
Sourceval kRSAdobe_Japan1_6 : string prop
Sourceval kRSJapanese : string prop
Sourceval kRSKanWa : string prop
Sourceval kRSKangXi : string prop
Sourceval kRSKorean : string prop
Sourceval kRSMerged : string prop
Sourceval kRSUnicode : string prop
Sourceval kSBGY : string prop
Sourceval kSemanticVariant : string prop
Sourceval kSimplifiedVariant : string prop
Sourceval kSMSZD2003Index : string prop
Sourceval kSMSZD2003Readings : string prop
Sourceval kSpecializedSemanticVariant : string prop
Sourceval kSpoofingVariant : string prop
Sourceval kStrange : string prop
Sourceval kUnihanCore2020 : string prop
Sourceval kTGH : string prop
Sourceval kTGHZ2013 : string prop
Sourceval kTGT_MergedSrc : string prop
Sourceval kTGT_RSUnicode : string prop
Sourceval kTaiwanTelegraph : string prop
Sourceval kTang : string prop
Sourceval kTayNumeric : string prop
Sourceval kTotalStrokes : string prop
Sourceval kTraditionalVariant : string prop
Sourceval kVietnamese : string prop
Sourceval kVietnameseNumeric : string prop
Sourceval kWubi : string prop
Sourceval kXHC1983 : string prop
Sourceval kZhuang : string prop
Sourceval kXerox : string prop
Sourceval kZhuangNumeric : string prop
Sourceval kZVariant : string prop

Unikemet properties

Sourceval kEH_Cat : string prop
Sourceval kEH_Core : string prop
Sourceval kEH_Desc : string prop
Sourceval kEH_Func : string prop
Sourceval kEH_FVal : string prop
Sourceval kEH_UniK : string prop
Sourceval kEH_JSesh : string prop
Sourceval kEH_HG : string prop
Sourceval kEH_IFAO : string prop
Sourceval kEH_NoMirror : bool prop
Sourceval kEH_NoRotate : bool prop
Sourceval kEH_AltSeq : string prop

Unicode character databases

Sourcetype block = (cp * cp) * string

The type for blocks. Code point range, name of the block.

Sourcetype named_sequence = string * cp list

The type for named sequences. Sequence name, code point sequence.

Sourcetype standardized_variant = cp list * string * [ `Isolate | `Initial | `Medial | `Final ] list

The type for standarized variants. Code point sequence, description, when.

Sourcetype cjk_radical = string * cp * cp

The type for CJK radicals. Radical number, CJK radical character, CJK unified ideograph.

Sourcetype do_not_emit = {
  1. instead_of : cp list;
  2. use : cp list;
  3. because : string;
}

The type for do not emit character sequences.

Sourcetype t = {
  1. description : string;
  2. repertoire : props Cpmap.t;
  3. blocks : block list;
  4. named_sequences : named_sequence list;
  5. provisional_named_sequences : named_sequence list;
  6. standardized_variants : standardized_variant list;
  7. cjk_radicals : cjk_radical list;
  8. do_not_emit : do_not_emit list;
}

The type for Unicode character databases.

Note. Absence of an optional top-level field in the database is denoted by the neutral element of its type (empty string, empty list, Cpmap.empty). This means that the module doesn't distinguish between absence of a field and presence of the field with empty data (but incurs no problems in this context).

Sourceval cp_prop : t -> cp -> 'a prop -> 'a option

cp_prop ucd cp p is the property p of the code point cp in db's repertoire, if p is in the repertoire and the property exists for cp.

Decode

Sourcetype src = [
  1. | `Channel of in_channel
  2. | `String of string
]

The type for input sources.

Sourcetype decoder

The type for Unicode character database decoders.

Sourceval decoder : [< src ] -> decoder

decoder src is a decoder that inputs from src.

Sourceval decode : decoder -> [ `Ok of t | `Error of string ]

decode d decodes a database from d or returns an error.

Sourceval decoded_range : decoder -> (int * int) * (int * int)

decoded_range d is the range of characters spanning the `Error decoded by d. A pair of line and column numbers respectively one and zero based.

Basics

The database and subsets of it for Unicode 16.0.0 are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.

A database is decoded as follows:

let ucd_or_die inf = try
  let ic = if inf = "-" then stdin else open_in inf in
  let d = Uucd.decoder (`Channel ic) in
  match Uucd.decode d with
  | `Ok db -> db
  | `Error e ->
    let (l0, c0), (l1, c1) = Uucd.decoded_range d in
    Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
    exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1

let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"

The convenience function cp_prop can be used to query the property of a given code point. For example the general category of U+1F42B is given by:

let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category