package fuzzy_compare
Install
Dune Dependency
Authors
Maintainers
Sources
md5=90ba5dd28b12af0bb795f6bdbca34deb
sha512=7550a654b200d82eac9ccdc58bb21627e29ff78bb9f0b06dd75b7aeb3960f58b42632b6836ce09103b29f254fb628bab5418e0e34f08a79833c7e01c3709b094
README.md.html
fuzzy_compare
Extremely fast Levenshtein comparator.
Checks whether 2 values are within a predetermined number of edits of one another.
most comparisons take under 5 ยตs, depending on the length of the values
a Functor is provided to enable comparisons across any arbitrary types
string comparisons are provided (functorized) out of the box
reuse the same automaton across all comparisons with the same
max_edits
, regardless of the type of the values being comparedmax_edits
must be between0
and3
(inclusively) due to the astronomical scaling factor during graph building
Example
(* Create an automaton *)
let within3 = Fuzzy_compare.create ~max_edits:3 in
(* true *)
let b1 = Fuzzy_compare.String.eval within3 "kitten" "kitsch" in
(* false *)
let b2 = Fuzzy_compare.String.eval within3 "kittens" "kitsch" in
Unicode / arbitrary types
This example demonstrates how to compare unicode string as well as how to compare your own types.
For unicode we'll need uunf
and uuseg
:
opam install uunf uuseg
First apply the Make
functor:
(* Fuzzy_compare.Make is documented in [fuzzy_compare.mli] *)
module Utf8 = Fuzzy_compare.Make (struct
type t = string
type cmp = string [@@deriving equal]
type index = string array
let index x =
(* Normalize to Unicode NFC,
then break the string into an array of Grapheme Clusters *)
Uunf_string.normalize_utf_8 `NFC x
|> Uuseg_string.fold_utf_8 `Grapheme_cluster (fun acc s -> s :: acc) []
|> Array.of_list_rev
let length = Array.length
let get = Array.get
let fold = Array.fold
end)
Then proceed as usual:
let within2 = Fuzzy_compare.create ~max_edits:2 in
(* false because each emoji is made of multiple bytes *)
let b1 = Fuzzy_compare.String.eval within2 "๐๐ด๐ฉ" "๐ฉ๐ด๐" in
(* true thanks to Unicode awareness *)
let b1 = Utf8.eval within2 "๐๐ด๐ฉ" "๐ฉ๐ด๐" in
(* false *)
let b2 = Utf8.eval within2 "๐๐ด๐ฉ" "๐ฉ๐ณ๏ธ" in
Thanks
Thanks to Paul Masurel and Jules Jacob for explaining clearly the beauty of the algorithms contained within the "Fast String Correction with Levenshtein-Automata" paper, and thanks to Klaus Schulz and Stoyan Mihov for writing the paper.
This library implements the most sophisticated (and most efficient) version described by Paul Masurel, with several additional low level optimizations.
Recommended reading: https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html https://fulmicoton.com/posts/levenshtein/