Legend:
Library
Module
Module type
Parameter
Class
Class type
Levenshtein distance
We take inspiration from http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata for the main algorithm and ideas. However some parts are adapted
type'a sequence = ('a-> unit)-> unit
type'a gen = unit ->'a option
Abstraction over Strings
Due to the existence of several encodings and string representations we abstract over the type of strings. A string is a finite array of characters (8-bits char, unicode runes, etc.) which provides a length operation and a function to access the n-th character.
This data structure is used to represent a list of result that is evaluated only as far as the user wants. If the user only wants a few elements, she doesn't pay for the remaining ones.
In particular, when matching a string against a (big) set of indexed strings, we return a continuation list so that, even if there are many results, only those actually asked for are evaluated.
type'a klist = unit ->[ `Nil | `Cons of 'a * 'aklist ]
The signature for a given string representation provides 3 main things:
a edit_distance function to compute the edit distance between strings
an automaton type that is built from a string s and a maximum distance n, and only accepts the strings s' such that edit_distance s s' <= n.
an Index module that can be used to map many strings to values, like a regular string map, but for which retrieval is fuzzy (for a given maximal distance).
A possible use of the index could be:
let words = CCIO.with_in "/usr/share/dict/words"
(fun i -> CCIO.read_all i |> CCString.Split.list_cpy ~by:"\n");;
let words = List.map (fun s->s,s) words;;
let idx = CCLevenshtein.Index.of_list words;;
CCLevenshtein.Index.retrieve ~limit:1 idx "hell" |> CCLevenshtein.klist_to_list;;
Edition distance between two strings. This satisfies the classical distance axioms: it is always positive, symmetric, and satisfies the formula distance a b + distance b c >= distance a c
Automaton
An automaton, built from a string s and a limit n, that accepts every string that is at distance at most n from s.
Build an automaton from a string, with a maximal distance limit. The automaton will accept strings whose edit_distance to the parameter is at most limit.
match_with a s matches the string s against a, and returns true if the distance from s to the word represented by a is smaller than the limit used to build a