Library
Module
Module type
Parameter
Class
Class type
We take inspiration from this blog for the main algorithm and ideas. However some parts are adapted
Due to the existence of several encodings and string representations we abstract over the type of strings. A string is a finite array of characters (8-bits char, unicode runes, etc.) which provides a length operation and a function to access the n-th character.
module type STRING = sig ... end
We use Seq.t
to provide universal iterators. This data structure is used to represent a list of result that is evaluated only as far as the user wants. If the user only wants a few elements, they don't pay for the remaining ones.
In particular, when matching a string against a (big) set of indexed strings, we return a continuation list so that, even if there are many results, only those actually asked for are evaluated.
The signature for a given string representation provides 3 main things:
edit_distance
function to compute the edit distance between stringsautomaton
type that is built from a string s
and a maximum distance n
, and only accepts the strings s'
such that edit_distance s s' <= n
.Index
module that can be used to map many strings to values, like a regular string map, but for which retrieval is fuzzy (for a given maximal distance).A possible use of the index could be:
let words = CCIO.with_in "/usr/share/dict/english" CCIO.read_lines_l ;;
let words = List.map (fun s->s,s) words;;
let idx = Spelll.Index.of_list words;;
Spelll.Index.retrieve_l ~limit:1 idx "hell" ;;
Here we use Containers to read a dictionary file into a list of words; then we create an index that maps every string to itself (a set of strings, really), and finally we find every string at distance at most 1 from "hell" (including "hello" for instance).
module type S = sig ... end
include S with type char_ = char and type string_ = string
Edition distance between two strings. This satisfies the classical distance axioms: it is always positive, symmetric, and satisfies the formula distance a b + distance b c >= distance a c
An automaton, built from a string s
and a limit n
, that accepts every string that is at distance at most n
from s
.
Build an automaton from a string, with a maximal distance limit
. The automaton will accept strings whose edit_distance
to the parameter is at most limit
.
Build an automaton from a list, with a maximal distance limit
match_with a s
matches the string s
against a
, and returns true
if the distance from s
to the word represented by a
is smaller than the limit used to build a
module Index : sig ... end
val debug_print : Stdlib.out_channel -> automaton -> unit