package search

  1. Overview
  2. Docs
Simple, in-memory search library in pure OCaml

Install

Dune Dependency

Authors

Maintainers

Sources

search-0.1.1.tbz
sha256=7114fecc6bcf0950c534cfa9a2584e1421337fe822f217779095c710de0a7192
sha512=ef1103b4571a51a285827f830f5a9a3ce6f214f1df9d40c204086be733797b39fc7f2e2de3d2298768d6c29df35bae82886c82c458257efaa93a5ac234556f0a

Description

An in-memory search library in Pure OCaml providing both search indexes for documents of the type and heterogenerous search indexes. A term-frequency inverse document-frequency implementation is provided by the library.

Published: 11 Jan 2023

README

ocaml-search

A very simple, search library for OCaml heavily inspired by js-search, originally by Craigfe's posts and Hmap.

Usage

The following is a quick guide on how to use this library. It is not particularly optimised, memory efficient or tested. Use with care!

# #require "search";;

Monomorphic Search Indexes

Monomorphism is the opposite of polymorphism. Here, we mean that your search index will only work for one type of document. This is provided via a functor along with the unique identifier module.

Unique Identifiers

Unique identifiers should uniquely identify a document amongst other documents.

# #show_module_type Search__Search_intf.Uid;;
module type Uid =
  sig type t val compare : t -> t -> int val to_string : t -> string end

The module only needs to provide a type t and a compare and to_string function. Search.Uids contains some common modules for your convenience.

Documents

Every search index is specialised to some document type. For example, it could be a record representing people.

module Doc = struct
type t = {
  uid : string;
  name : string;
  nick : string;
  age : int;
}
let docs = [
{
    uid = "0";
    name = "Alice";
    nick = "";
    age = 10;
  };
  {
    uid = "1";
    name = "Alan";
    nick = "Al";
    age = 12;
  };
  {
    uid = "2";
    name = "William";
    nick = "Bob";
    age = 13;
  }
]
end

module M = Search.Tfidf.Mono (Search.Uids.String) (Doc)

Search indexes come with the empty function which creates a new index.

# M.empty;;
- : ?santiser:(string -> string) ->
    ?strategy:(string -> string list) ->
    ?tokeniser:(string -> string list) -> unit -> M.t
= <fun>
# let search = M.empty () ;;
val search : M.t = <abstr>

There are three optional functions you can add to change the how documents are treated. The tokeniser splits a string into tokens. By default this is just by whitespace. The sanitiser creates a uniform representation of strings, by default String.lowercase_ascii. Finally, stratgey is the indexing strategy. By default this is a prefixing strategy such that abc is indexed with a, ab and abc.

From here you add the indexes. These are the functions from your document to a string that will be used to search for documents matching some string later.

# M.add_index search (fun t -> t.name);;
- : unit = ()

After you've added all your indexes, you can add some documents.

# List.iter (fun d -> M.add_document search d.Doc.uid d) Doc.docs;;
- : unit = ()

At which point you are ready to search!

# M.search search "Al";;
- : M.doc list =
[{Doc.uid = "0"; name = "Alice"; nick = ""; age = 10};
 {Doc.uid = "1"; name = "Alan"; nick = "Al"; age = 12}]

Note that this implementation uses TFIDF, if we were to also add the nick as an index, then "Alan" will come to the top. Adding a new index after already adding documents causes the documents to be re-indexed!

# M.add_index search (fun t -> t.nick);
  M.search search "Al";;
- : M.doc list =
[{Doc.uid = "1"; name = "Alan"; nick = "Al"; age = 12};
 {Doc.uid = "0"; name = "Alice"; nick = ""; age = 10}]

Heterogeneous Search Indexes

Heterogeneous search indexes allow you to store more than one type in the index. This is based on Janestreet's Universal Map and Hmap.

Type Witness

The main difference when programming with heterogeneous indexes is that you must provide a type witness when adding indexes and adding documents. A type witness is essentially a value that can be used to check the type of another value at runtime.

Search provides a low-level type witness module.

# module W = Search.Private.Witness;;
module W = Search.Private.Witness
# let int_witness : int W.t = W.make ();;
val int_witness : int W.t = <module>
# let float_witness : float W.t = W.make ();;
val float_witness : float W.t = <module>

Here we've constructed two witnesses, one for integers and one for floats.

# W.eq int_witness int_witness;;
- : (int, int) W.teq option = Some W.Teq
# W.eq int_witness float_witness;;
- : (int, float) W.teq option = None

Generic Interface

The interface is very similar to that of the monomorphic search index. With the Tfidf implementation, we only need to provide a unique identifier for documents. The type witness will take care of differentiating the different kinds of documents.

module G = Search.Tfidf.Generic (Search.Uids.String)
module Cat = struct
  type t = { name : string; lives : int }
end
module Dog = struct
  type t = { name : string; kind : string }
end

Creating a new index is straightforward.

# let search = G.empty ();;
val search : G.t = <abstr>

Generic search indexes must wrap the user-supplied unique identifier (which differentiates documents) to also differentiate the different kinds of documents. The user must generate the type witnesses using Generic.Uid.

# let cat : Cat.t G.uid = G.Uid.create ();;
val cat : Cat.t G.uid = <abstr>
# let dog : Dog.t G.uid = G.Uid.create ();;
val dog : Dog.t G.uid = <abstr>

Adding Indexes

To add an index, you must also specify the kind of document you wish the index to be used for.

# G.add_index;;
- : G.t -> 'doc G.uid -> ('doc -> string) -> unit = <fun>

This allows you to access the type from within your index.

# G.add_index search cat (fun c -> c.Cat.name);
  G.add_index search dog (fun c -> c.Dog.name);;
- : unit = ()

Adding Documents

When adding documents you provide the type witness for the kind of document you are adding along with a unique identifier for that document.

let add_cat c = G.add_document search cat c.Cat.name c
let add_dog d = G.add_document search dog d.Dog.name d

With these helper functions we can add some new documents.

# add_cat Cat.{ name = "Alice"; lives = 9 };
  add_dog Dog.{ name = "Alan"; kind = "Irish Setter" };;
- : unit = ()

Searching

Whenever you search for a collection of documents in a heterogenous search index you will get back a list of documents of different kinds. These are wrapped up in binding to hide the fact they are of different kinds.

# #show_type G.binding;;
type nonrec binding = G.doc = KV : ('v G.uid * 'v) -> G.doc

This means if you want to access a document you'll need to prove you know what kind it is first! There's a little helper function for doing that.

# G.apply;;
- : 'v G.uid -> default:'a -> ('v -> 'a) -> G.doc -> 'a = <fun>

G.apply uid ~default f doc will apply the function f to a document doc provided it of kind uid. If it is not that kind then the default value will be returned.

# let docs = G.search search "Al";;
val docs : G.doc list = [G.KV (<abstr>, <poly>); G.KV (<abstr>, <poly>)]

We'll use G.apply to get the names of the animals.

# List.filter_map
    (fun t ->
       G.apply
       cat
       ~default:(G.apply dog ~default:None (fun d -> Some (d.Dog.name ^ " (the dog)")) t)
       (fun c -> Some (c.Cat.name ^ " (the cat)")) t) docs;;
- : string list = ["Alan (the dog)"; "Alice (the cat)"]

Dependencies (2)

  1. dune >= "3.0.0"
  2. ocaml >= "4.08"

Dev Dependencies (1)

  1. mdx with-test

Used by

None

Conflicts

None

OCaml

Innovation. Community. Security.