package spelll
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=bb189bc9f70d2409a2cc06112b6bd0862b6743042cef931bbb550df3d0add1ba
md5=dbe46f10969efba4b84882ba14001c1b
doc/spelll/Spelll/index.html
Module Spelll
Levenshtein distance and index
We take inspiration from this blog for the main algorithm and ideas. However some parts are adapted
Abstraction over Strings
Due to the existence of several encodings and string representations we abstract over the type of strings. A string is a finite array of characters (8-bits char, unicode runes, etc.) which provides a length operation and a function to access the n-th character.
module type STRING = sig ... endContinuation list
This data structure is used to represent a list of result that is evaluated only as far as the user wants. If the user only wants a few elements, she doesn't pay for the remaining ones.
In particular, when matching a string against a (big) set of indexed strings, we return a continuation list so that, even if there are many results, only those actually asked for are evaluated.
val klist_to_list : 'a klist -> 'a listHelper.
Signature
The signature for a given string representation provides 3 main things:
- a
edit_distancefunction to compute the edit distance between strings - an
automatontype that is built from a stringsand a maximum distancen, and only accepts the stringss'such thatedit_distance s s' <= n. - an
Indexmodule that can be used to map many strings to values, like a regular string map, but for which retrieval is fuzzy (for a given maximal distance).
A possible use of the index could be:
open Batteries;;
let words = File.with_file_in "/usr/share/dict/english"
(fun i -> IO.read_all i |> String.nsplit ~by:"\\n");;
let words = List.map (fun s->s,s) words;;
let idx = Spelll.Index.of_list words;;
Spelll.Index.retrieve ~limit:1 idx "hell" |> Spelll.klist_to_list;;Here we use Batteries to read a dictionary file into a list of words; then we create an index that maps every string to itself (a set of strings, really), and finally we find every string at distance at most 1 from "hell" (including "hello" for instance).
module type S = sig ... endinclude S with type char_ = char and type string_ = string
Edit Distance
Edition distance between two strings. This satisfies the classical distance axioms: it is always positive, symmetric, and satisfies the formula distance a b + distance b c >= distance a c
Automaton
An automaton, built from a string s and a limit n, that accepts every string that is at distance at most n from s.
Build an automaton from a string, with a maximal distance limit. The automaton will accept strings whose edit_distance to the parameter is at most limit.
Build an automaton from a list, with a maximal distance limit
match_with a s matches the string s against a, and returns true if the distance from s to the word represented by a is smaller than the limit used to build a
Index for one-to-many matching
module Index : sig ... endval debug_print : out_channel -> automaton -> unit