Page
Library
Module
Module type
Parameter
Class
Class type
Source
Uucd
SourceUnicode character database decoder.
Uucd
decodes the data of the Unicode character database from its XML representation. It provides high-level (but not necessarily efficient) access to the data so that efficient representations can be extracted.
Uucd
decodes the representation described in the Annex #42 of Unicode 16.0.0. Subsequent versions may be decoded as long as no new cases are introduced in parsed enumerated properties.
Consult the basics.
Note. All strings returned by the module are UTF-8 encoded.
Unicode version 16.0.0
The type for Unicode code points, ranges from 0x0000
to 0x10_FFFF
.
is_cp n
is true
iff n
a Unicode code point.
is_scalar_value n
is true
iff n
is a Unicode scalar value.
Properties are referenced by their name and property values by their abbreviated name. To understand their semantics refer to the standard.
The type for sets of properties.
The type for properties with property value of type 'a
.
unknown_prop (ns, n)
is a property read from an XML attribute whose expanded name is (ns, n)
. This can be used to access a property unknown to the module.
In alphabetical order.
val block :
[ `ASCII
| `Adlam
| `Aegean_Numbers
| `Ahom
| `Alchemical
| `Alphabetic_PF
| `Anatolian_Hieroglyphs
| `Ancient_Greek_Music
| `Ancient_Greek_Numbers
| `Ancient_Symbols
| `Arabic
| `Arabic_Ext_A
| `Arabic_Ext_B
| `Arabic_Ext_C
| `Arabic_Math
| `Arabic_PF_A
| `Arabic_PF_B
| `Arabic_Sup
| `Armenian
| `Arrows
| `Avestan
| `Balinese
| `Bamum
| `Bamum_Sup
| `Bassa_Vah
| `Batak
| `Bengali
| `Beria_Erfe
| `Bhaiksuki
| `Block_Elements
| `Bopomofo
| `Bopomofo_Ext
| `Box_Drawing
| `Brahmi
| `Braille
| `Buginese
| `Buhid
| `Byzantine_Music
| `CJK
| `CJK_Compat
| `CJK_Compat_Forms
| `CJK_Compat_Ideographs
| `CJK_Compat_Ideographs_Sup
| `CJK_Ext_A
| `CJK_Ext_B
| `CJK_Ext_C
| `CJK_Ext_D
| `CJK_Ext_E
| `CJK_Ext_F
| `CJK_Ext_G
| `CJK_Ext_H
| `CJK_Ext_I
| `CJK_Ext_J
| `CJK_Radicals_Sup
| `CJK_Strokes
| `CJK_Symbols
| `Carian
| `Caucasian_Albanian
| `Chakma
| `Cham
| `Cherokee
| `Cherokee_Sup
| `Chess_Symbols
| `Chorasmian
| `Compat_Jamo
| `Control_Pictures
| `Coptic
| `Coptic_Epact_Numbers
| `Counting_Rod
| `Cuneiform
| `Cuneiform_Numbers
| `Currency_Symbols
| `Cypriot_Syllabary
| `Cypro_Minoan
| `Cyrillic
| `Cyrillic_Ext_A
| `Cyrillic_Ext_B
| `Cyrillic_Ext_C
| `Cyrillic_Ext_D
| `Cyrillic_Sup
| `Deseret
| `Devanagari
| `Devanagari_Ext
| `Devanagari_Ext_A
| `Diacriticals
| `Diacriticals_Ext
| `Diacriticals_For_Symbols
| `Diacriticals_Sup
| `Dingbats
| `Dives_Akuru
| `Dogra
| `Domino
| `Duployan
| `Early_Dynastic_Cuneiform
| `Egyptian_Hieroglyph_Format_Controls
| `Egyptian_Hieroglyphs
| `Egyptian_Hieroglyphs_Ext_A
| `Elbasan
| `Elymaic
| `Emoticons
| `Enclosed_Alphanum
| `Enclosed_Alphanum_Sup
| `Enclosed_CJK
| `Enclosed_Ideographic_Sup
| `Ethiopic
| `Ethiopic_Ext
| `Ethiopic_Ext_A
| `Ethiopic_Ext_B
| `Ethiopic_Sup
| `Garay
| `Geometric_Shapes
| `Geometric_Shapes_Ext
| `Georgian
| `Georgian_Ext
| `Georgian_Sup
| `Glagolitic
| `Glagolitic_Sup
| `Gothic
| `Grantha
| `Greek
| `Greek_Ext
| `Gujarati
| `Gunjala_Gondi
| `Gurmukhi
| `Gurung_Khema
| `Half_And_Full_Forms
| `Half_Marks
| `Hangul
| `Hanifi_Rohingya
| `Hanunoo
| `Hatran
| `Hebrew
| `High_PU_Surrogates
| `High_Surrogates
| `Hiragana
| `IDC
| `IPA_Ext
| `Ideographic_Symbols
| `Imperial_Aramaic
| `Indic_Number_Forms
| `Indic_Siyaq_Numbers
| `Inscriptional_Pahlavi
| `Inscriptional_Parthian
| `Jamo
| `Jamo_Ext_A
| `Jamo_Ext_B
| `Javanese
| `Kaithi
| `Kaktovik_Numerals
| `Kana_Ext_A
| `Kana_Ext_B
| `Kana_Sup
| `Kanbun
| `Kangxi
| `Kannada
| `Katakana
| `Katakana_Ext
| `Kawi
| `Kayah_Li
| `Kharoshthi
| `Khitan_Small_Script
| `Khmer
| `Khmer_Symbols
| `Khojki
| `Khudawadi
| `Kirat_Rai
| `Lao
| `Latin_1_Sup
| `Latin_Ext_A
| `Latin_Ext_Additional
| `Latin_Ext_B
| `Latin_Ext_C
| `Latin_Ext_D
| `Latin_Ext_E
| `Latin_Ext_F
| `Latin_Ext_G
| `Lepcha
| `Letterlike_Symbols
| `Limbu
| `Linear_A
| `Linear_B_Ideograms
| `Linear_B_Syllabary
| `Lisu
| `Lisu_Sup
| `Low_Surrogates
| `Lycian
| `Lydian
| `Mahajani
| `Mahjong
| `Makasar
| `Malayalam
| `Mandaic
| `Manichaean
| `Marchen
| `Masaram_Gondi
| `Math_Alphanum
| `Math_Operators
| `Mayan_Numerals
| `Medefaidrin
| `Meetei_Mayek
| `Meetei_Mayek_Ext
| `Mende_Kikakui
| `Meroitic_Cursive
| `Meroitic_Hieroglyphs
| `Miao
| `Misc_Arrows
| `Misc_Math_Symbols_A
| `Misc_Math_Symbols_B
| `Misc_Pictographs
| `Misc_Symbols
| `Misc_Symbols_Sup
| `Misc_Technical
| `Modi
| `Modifier_Letters
| `Modifier_Tone_Letters
| `Mongolian
| `Mongolian_Sup
| `Mro
| `Multani
| `Music
| `Myanmar
| `Myanmar_Ext_A
| `Myanmar_Ext_B
| `Myanmar_Ext_C
| `NB
| `NKo
| `Nabataean
| `Nag_Mundari
| `Nandinagari
| `New_Tai_Lue
| `Newa
| `Number_Forms
| `Nushu
| `Nyiakeng_Puachue_Hmong
| `OCR
| `Ogham
| `Ol_Onal
| `Ol_Chiki
| `Old_Hungarian
| `Old_Italic
| `Old_North_Arabian
| `Old_Permic
| `Old_Persian
| `Old_Sogdian
| `Old_South_Arabian
| `Old_Turkic
| `Old_Uyghur
| `Oriya
| `Ornamental_Dingbats
| `Osage
| `Osmanya
| `Ottoman_Siyaq_Numbers
| `PUA
| `Pahawh_Hmong
| `Palmyrene
| `Pau_Cin_Hau
| `Phags_Pa
| `Phaistos
| `Phoenician
| `Phonetic_Ext
| `Phonetic_Ext_Sup
| `Playing_Cards
| `Psalter_Pahlavi
| `Punctuation
| `Rejang
| `Rumi
| `Runic
| `Samaritan
| `Saurashtra
| `Sharada
| `Sharada_Sup
| `Shavian
| `Shorthand_Format_Controls
| `Siddham
| `Sidetic
| `Sinhala
| `Sinhala_Archaic_Numbers
| `Small_Forms
| `Small_Kana_Ext
| `Sogdian
| `Sora_Sompeng
| `Soyombo
| `Specials
| `Sundanese
| `Sundanese_Sup
| `Sunuwar
| `Sup_Arrows_A
| `Sup_Arrows_B
| `Sup_Arrows_C
| `Sup_Math_Operators
| `Sup_PUA_A
| `Sup_PUA_B
| `Sup_Punctuation
| `Sup_Symbols_And_Pictographs
| `Super_And_Sub
| `Sutton_SignWriting
| `Syloti_Nagri
| `Symbols_And_Pictographs_Ext_A
| `Symbols_For_Legacy_Computing
| `Symbols_For_Legacy_Computing_Sup
| `Syriac
| `Syriac_Sup
| `Tagalog
| `Tagbanwa
| `Tags
| `Tai_Le
| `Tai_Tham
| `Tai_Viet
| `Tai_Xuan_Jing
| `Tai_Yo
| `Takri
| `Tamil
| `Tamil_Sup
| `Tangsa
| `Tangut
| `Tangut_Components
| `Tangut_Components_Sup
| `Tangut_Sup
| `Telugu
| `Thaana
| `Thai
| `Tibetan
| `Tifinagh
| `Tirhuta
| `Todhri
| `Tolong_Siki
| `Toto
| `Transport_And_Map
| `Tulu_Tigalari
| `UCAS
| `UCAS_Ext
| `UCAS_Ext_A
| `Ugaritic
| `VS
| `VS_Sup
| `Vai
| `Vedic_Ext
| `Vertical_Forms
| `Vithkuqi
| `Wancho
| `Warang_Citi
| `Yezidi
| `Yi_Radicals
| `Yi_Syllables
| `Yijing
| `Zanabazar_Square
| `Znamenny_Music ]
prop
val indic_syllabic_category :
[ `Avagraha
| `Bindu
| `Brahmi_Joining_Number
| `Cantillation_Mark
| `Consonant
| `Consonant_Dead
| `Consonant_Final
| `Consonant_Head_Letter
| `Consonant_Initial_Postfixed
| `Consonant_Killer
| `Consonant_Medial
| `Consonant_Placeholder
| `Consonant_Preceding_Repha
| `Consonant_Prefixed
| `Consonant_Repha
| `Consonant_Subjoined
| `Consonant_Succeeding_Repha
| `Consonant_With_Stacker
| `Gemination_Mark
| `Invisible_Stacker
| `Joiner
| `Modifying_Letter
| `Non_Joiner
| `Nukta
| `Number
| `Number_Joiner
| `Other
| `Pure_Killer
| `Reordering_Killer
| `Register_Shifter
| `Syllable_Modifier
| `Tone_Letter
| `Tone_Mark
| `Virama
| `Visarga
| `Vowel
| `Vowel_Dependent
| `Vowel_Independent ]
prop
val indic_positional_category :
[ `Bottom
| `Bottom_And_Left
| `Bottom_And_Right
| `Invisible
| `Left
| `Left_And_Right
| `NA
| `Overstruck
| `Right
| `Top
| `Top_And_Bottom
| `Top_And_Bottom_And_Left
| `Top_And_Bottom_And_Right
| `Top_And_Left
| `Top_And_Left_And_Right
| `Top_And_Right
| `Visual_Order_Left ]
prop
val joining_group :
[ `African_Feh
| `African_Noon
| `African_Qaf
| `Ain
| `Alaph
| `Alef
| `Alef_Maqsurah
| `Beh
| `Beth
| `Burushaski_Yeh_Barree
| `Dal
| `Dalath_Rish
| `E
| `Farsi_Yeh
| `Fe
| `Feh
| `Final_Semkath
| `Gaf
| `Gamal
| `Hah
| `Hanifi_Rohingya_Kinna_Ya
| `Hanifi_Rohingya_Pa
| `Hamza_On_Heh_Goal
| `He
| `Heh
| `Heh_Goal
| `Heth
| `Kaf
| `Kaph
| `Kashmiri_Yeh
| `Khaph
| `Knotted_Heh
| `Lam
| `Lamadh
| `Malayalam_Bha
| `Malayalam_Ja
| `Malayalam_Lla
| `Malayalam_Llla
| `Malayalam_Nga
| `Malayalam_Nna
| `Malayalam_Nnna
| `Malayalam_Nya
| `Malayalam_Ra
| `Malayalam_Ssa
| `Malayalam_Tta
| `Manichaean_Aleph
| `Manichaean_Ayin
| `Manichaean_Beth
| `Manichaean_Daleth
| `Manichaean_Dhamedh
| `Manichaean_Five
| `Manichaean_Gimel
| `Manichaean_Heth
| `Manichaean_Hundred
| `Manichaean_Kaph
| `Manichaean_Lamedh
| `Manichaean_Mem
| `Manichaean_Nun
| `Manichaean_One
| `Manichaean_Pe
| `Manichaean_Qoph
| `Manichaean_Resh
| `Manichaean_Sadhe
| `Manichaean_Samekh
| `Manichaean_Taw
| `Manichaean_Ten
| `Manichaean_Teth
| `Manichaean_Thamedh
| `Manichaean_Twenty
| `Manichaean_Waw
| `Manichaean_Yodh
| `Manichaean_Zayin
| `Meem
| `Mim
| `No_Joining_Group
| `Noon
| `Nun
| `Nya
| `Pe
| `Qaf
| `Qaph
| `Reh
| `Reversed_Pe
| `Rohingya_Yeh
| `Sad
| `Sadhe
| `Seen
| `Semkath
| `Shin
| `Straight_Waw
| `Swash_Kaf
| `Syriac_Waw
| `Tah
| `Taw
| `Teh_Marbuta
| `Teh_Marbuta_Goal
| `Teth
| `Thin_Noon
| `Thin_Yeh
| `Vertical_Tail
| `Waw
| `Yeh
| `Yeh_Barree
| `Yeh_With_Tail
| `Yudh
| `Yudh_He
| `Zain
| `Zhain
| `BAA
| `FA
| `HAA
| `HA_GOAL
| `HA
| `CAF
| `KNOTTED_HA
| `RA
| `SWASH_CAF
| `HAMZAH_ON_HA_GOAL
| `TAA_MARBUTAH
| `YA_BARREE
| `YA
| `ALEF_MAQSURAH ]
prop
val line_break :
[ `AI
| `AK
| `AL
| `AP
| `AS
| `B2
| `BA
| `BB
| `BK
| `CB
| `CJ
| `CL
| `CM
| `CP
| `CR
| `EX
| `GL
| `H2
| `H3
| `HH
| `HL
| `HY
| `ID
| `IN
| `IS
| `JL
| `JT
| `JV
| `LF
| `NL
| `NS
| `NU
| `OP
| `PO
| `PR
| `QU
| `RI
| `SA
| `SG
| `SP
| `SY
| `VF
| `VI
| `WJ
| `XX
| `ZW
| `EB
| `EM
| `ZWJ ]
prop
In the `Pattern
case occurrences of the character '#'
(U+0023
) in the string must be replaced by the value of the code point as four to six uppercase hexadecimal digits (the minimal needed). E.g. the pattern "CJK UNIFIED IDEOGRAPH-#"
associated to code point U+3400
gives the name "CJK UNIFIED IDEOGRAPH-3400"
.
type script = [
| `Adlm
| `Aghb
| `Ahom
| `Arab
| `Armi
| `Armn
| `Avst
| `Bali
| `Bamu
| `Bass
| `Batk
| `Beng
| `Berf
| `Bhks
| `Bopo
| `Brah
| `Brai
| `Bugi
| `Buhd
| `Cakm
| `Cans
| `Cari
| `Cham
| `Cher
| `Chrs
| `Copt
| `Cpmn
| `Cprt
| `Cyrl
| `Deva
| `Diak
| `Dogr
| `Dsrt
| `Dupl
| `Egyp
| `Elba
| `Elym
| `Ethi
| `Gara
| `Geor
| `Glag
| `Gong
| `Gonm
| `Goth
| `Gran
| `Grek
| `Gujr
| `Gukh
| `Guru
| `Hang
| `Hani
| `Hano
| `Hatr
| `Hebr
| `Hira
| `Hluw
| `Hmng
| `Hmnp
| `Hrkt
| `Hung
| `Ital
| `Java
| `Kali
| `Kana
| `Kawi
| `Khar
| `Khmr
| `Khoj
| `Knda
| `Krai
| `Kthi
| `Kits
| `Lana
| `Laoo
| `Latn
| `Lepc
| `Limb
| `Lina
| `Linb
| `Lisu
| `Lyci
| `Lydi
| `Mahj
| `Maka
| `Mand
| `Mani
| `Marc
| `Medf
| `Mend
| `Merc
| `Mero
| `Mlym
| `Modi
| `Mong
| `Mroo
| `Mtei
| `Mult
| `Mymr
| `Nagm
| `Nand
| `Narb
| `Nbat
| `Newa
| `Nkoo
| `Nshu
| `Ogam
| `Olck
| `Onao
| `Orkh
| `Orya
| `Osge
| `Osma
| `Ougr
| `Palm
| `Pauc
| `Perm
| `Phag
| `Phli
| `Phlp
| `Phnx
| `Plrd
| `Prti
| `Qaai
| `Rjng
| `Rohg
| `Runr
| `Samr
| `Sarb
| `Saur
| `Sgnw
| `Shaw
| `Shrd
| `Sidd
| `Sidt
| `Sind
| `Sinh
| `Sogd
| `Sogo
| `Sora
| `Soyo
| `Sund
| `Sunu
| `Sylo
| `Syrc
| `Tagb
| `Takr
| `Tale
| `Talu
| `Taml
| `Tang
| `Tavt
| `Tayo
| `Telu
| `Tfng
| `Tglg
| `Thaa
| `Thai
| `Tibt
| `Tirh
| `Tnsa
| `Todr
| `Tols
| `Toto
| `Tutg
| `Ugar
| `Vaii
| `Vith
| `Wara
| `Wcho
| `Xpeo
| `Xsux
| `Yezi
| `Yiii
| `Zanb
| `Zinh
| `Zyyy
| `Zzzz
]
Not normative, artefact of Uucd
. Corresponds to the XML element name that describes the code point.
In alphabetic order. For now unihan properties are always represented as strings.
The type for named sequences. Sequence name, code point sequence.
type standardized_variant =
cp list * string * [ `Isolate | `Initial | `Medial | `Final ] list
The type for standarized variants. Code point sequence, description, when.
The type for CJK radicals. Radical number, CJK radical character, CJK unified ideograph.
The type for do not emit character sequences.
type t = {
description : string;
repertoire : props Cpmap.t;
blocks : block list;
named_sequences : named_sequence list;
provisional_named_sequences : named_sequence list;
standardized_variants : standardized_variant list;
cjk_radicals : cjk_radical list;
do_not_emit : do_not_emit list;
}
The type for Unicode character databases.
Note. Absence of an optional top-level field in the database is denoted by the neutral element of its type (empty string, empty list, Cpmap.empty
). This means that the module doesn't distinguish between absence of a field and presence of the field with empty data (but incurs no problems in this context).
cp_prop ucd cp p
is the property p
of the code point cp
in db
's repertoire, if p
is in the repertoire and the property exists for cp
.
The type for input sources.
The type for Unicode character database decoders.
decode d
decodes a database from d
or returns an error.
decoded_range d
is the range of characters spanning the `Error
decoded by d
. A pair of line and column numbers respectively one and zero based.
The database and subsets of it for Unicode 16.0.0 are available here. Databases with groups should be preferred, they maximize value sharing and improve parsing performance.
A database is decoded as follows:
let ucd_or_die inf = try
let ic = if inf = "-" then stdin else open_in inf in
let d = Uucd.decoder (`Channel ic) in
match Uucd.decode d with
| `Ok db -> db
| `Error e ->
let (l0, c0), (l1, c1) = Uucd.decoded_range d in
Printf.eprintf "%s:%d.%d-%d.%d: %s\n%!" inf l0 c0 l1 c1 e;
exit 1
with Sys_error e -> Printf.eprintf "%s\n%!" e; exit 1
let ucd = ucd_or_die "/tmp/ucd.all.grouped.xml"
The convenience function cp_prop
can be used to query the property of a given code point. For example the general category of U+1F42B
is given by:
let u_1F42B_gc = Uucd.cp_prop ucd 0x1F42B Uucd.general_category