Module `Owl_nlp_tfidf`Source

NLP: TFIDF module

Type definition

Sourcetype tf_typ =

| Binary
| Count
| Frequency
| Log_norm
(*
Type of term frequency.
*)

Sourcetype df_typ =

| Unary
| Idf
| Idf_Smooth
(*
Type of inverse document frequency.
*)

Sourcetype t

Type of a TFIDF model

Query model

Sourceval length : t -> int

Size of Tfidf model, i.e. number of documents contained.

Sourceval term_freq : tf_typ -> float -> float -> float

``term_freq term_count num_words`` calculates the term frequency weight.

Sourceval doc_freq : df_typ -> float -> float -> float

``doc_freq doc_count num_docs`` calculates the document frequency weight.

Sourceval get_uri : t -> string

Return the path of the TFIDF model.

Sourceval get_corpus : t -> Owl_nlp_corpus.t

Return the corpus contained in TFIDF model

Sourceval vocab_len : t -> int

Return the size of the vocabulary contained in the TFIDF model.

Sourceval get_handle : t -> in_channel

Get the file handle associated with TFIDF model.

Sourceval doc_count_of : t -> string -> float

``doc_count_of tfidf w`` calculate document frequency for a given word ``w``.

Sourceval doc_count : Owl_nlp_vocabulary.t -> string -> float array * int

``doc_count vocab fname`` count occurrency in all documents contained in the raw text corpus of file ``fname``, for all words

Sourceval term_count : ('a, float) Hashtbl.t -> 'a array -> unit

``term_count count doc`` counts the term occurrency in a document, and saves the result in count hashtbl.

Sourceval density : t -> float

Return the percentage of non-zero elements in doc-term matrix.

Source

val doc_to_vec : 
  (float, 'a) Bigarray.kind ->
  t ->
  (int * float) array ->
  (float, 'a) Owl_dense.Ndarray.Generic.t

``doc_to_vec kind tfidf vec`` converts a TFIDF vector from its sparse represents to dense ndarray vector whose length equals the vocabulary size.

Iteration functions

Sourceval get : t -> int -> (int * float) array

Return the ith TFIDF vector in the model. The format of return is ``(vocabulary index, weight)`` tuple array of a document.

Sourceval next : t -> (int * float) array

Return the next document vector in the model. The format of return is ``(vocabulary index, weight)`` tuple array of a document.

Sourceval next_batch : ?size:int -> t -> (int * float) array array

Return the next batch of document vectors in the model, the default size is 100.

Sourceval iteri : (int -> (int * float) array -> unit) -> t -> unit

Iterate all the document vectors in a TFIDF model. The format of document vector is ``(vocabulary index, weight)`` tuple array of a document.

Sourceval mapi : (int -> (int * float) array -> 'a) -> t -> 'a array

Map all the document vectors in a TFIDF model. The format of document vector is ``(vocabulary index, weight)`` tuple array of a document.

Sourceval reset_iterators : t -> unit

Reset the iterator to the beginning of the TFIDF model.

Core functions

Source

val build : 
  ?norm:bool ->
  ?sort:bool ->
  ?tf:tf_typ ->
  ?df:df_typ ->
  Owl_nlp_corpus.t ->
  t

This function builds up a TFIDF model according to the passed in parameters.

Parameters: * ``norm``: whether to normalise the vectors in the TFIDF model, default is ``false``. * ``sort``: whether to sort the terms in a TFIDF vector in increasing order w.r.t their vocabulary indices. The default is ``false``. * ``tf``: type of term frequency used in building TFIDF. The default is ``Count``. * ``df``: type of document frequency used in building TFIDF. The default is ``Idf``. * ``corpus``: the corpus built by ``Owl_nlp_corpus`` model atop of which TFIDF will be built.

I/O functions

Sourceval save : t -> string -> unit

``save tfidf fname`` saves the TFIDF to a file of given file name ``fname``.

Sourceval load : string -> t

``load fname`` loads a TFIDF from a file of name ``fname``.

Sourceval to_string : t -> string

Convert a TFIDF to its string representation, contains summary information.

Sourceval print : t -> unit

Pretty print out the summary information of a TFIDF model.

Helper functions

Sourceval tf_typ_string : tf_typ -> string

Convert term frequency type into string.

Sourceval df_typ_string : df_typ -> string

Convert document frequency type into string.

Sourceval apply : t -> string -> (int * float) array

Convert a single document according to a given model

Sourceval normalise : ('a * float) array -> ('a * float) array

``normalise x`` makes ``x`` a unit vector by dividing its l2norm.

Sourceval create : tf_typ -> df_typ -> Owl_nlp_corpus.t -> t

Wrap up a TFIDF model type. Low-level function and you are not supposed to use it.

Source

val all_pairwise_distance : 
  Owl_nlp_similarity.t ->
  t ->
  ('a * float) array ->
  (int * float) array

Calculate pairwise distance for the whole model, return format is ``(id,dist)`` array.

Source

val nearest : 
  ?typ:Owl_nlp_similarity.t ->
  t ->
  ('a * float) array ->
  int ->
  (int * float) array

Return K-nearest neighbours, it is very slow due to linear search.

Install

dune-project
Dependency

Authors

Maintainers

Sources

doc/owl/Owl_nlp_tfidf/index.html

Module `Owl_nlp_tfidf`Source

Type definition

Query model

Iteration functions

Core functions

I/O functions

Helper functions

package owl

Install

dune-project Dependency

Authors

Maintainers

Sources

doc/owl/Owl_nlp_tfidf/index.html

Module Owl_nlp_tfidfSource

Type definition

Query model

Iteration functions

Core functions

I/O functions

Helper functions

dune-project
Dependency

Module `Owl_nlp_tfidf`Source