package owl
Install
dune-project
Dependency
Authors
Maintainers
Sources
sha256=38d210ce6c1c2f09631fd59951430e4f364b5ae036c71ed1b32ce559b2a29263
sha512=c468100556445384b9c6adad9c37b5a9b8c27db8be35f61979e65fafa88c60221b8bda0a9c06cfbbc8d4e216a1ed08a315dfefb45bb4f5f15aa82d4358f57567
doc/owl/Owl_nlp_tfidf/index.html
Module Owl_nlp_tfidfSource
NLP: TFIDF module
Type definition
Type of a TFIDF model
Query model
``term_freq term_count num_words`` calculates the term frequency weight.
``doc_freq doc_count num_docs`` calculates the document frequency weight.
Return the corpus contained in TFIDF model
Get the file handle associated with TFIDF model.
``doc_count_of tfidf w`` calculate document frequency for a given word ``w``.
``doc_count vocab fname`` count occurrency in all documents contained in the raw text corpus of file ``fname``, for all words
``term_count count doc`` counts the term occurrency in a document, and saves the result in count hashtbl.
val doc_to_vec :
(float, 'a) Bigarray.kind ->
t ->
(int * float) array ->
(float, 'a) Owl_dense.Ndarray.Generic.t``doc_to_vec kind tfidf vec`` converts a TFIDF vector from its sparse represents to dense ndarray vector whose length equals the vocabulary size.
Iteration functions
Return the ith TFIDF vector in the model. The format of return is ``(vocabulary index, weight)`` tuple array of a document.
Return the next document vector in the model. The format of return is ``(vocabulary index, weight)`` tuple array of a document.
Return the next batch of document vectors in the model, the default size is 100.
Iterate all the document vectors in a TFIDF model. The format of document vector is ``(vocabulary index, weight)`` tuple array of a document.
Map all the document vectors in a TFIDF model. The format of document vector is ``(vocabulary index, weight)`` tuple array of a document.
Core functions
This function builds up a TFIDF model according to the passed in parameters.
Parameters: * ``norm``: whether to normalise the vectors in the TFIDF model, default is ``false``. * ``sort``: whether to sort the terms in a TFIDF vector in increasing order w.r.t their vocabulary indices. The default is ``false``. * ``tf``: type of term frequency used in building TFIDF. The default is ``Count``. * ``df``: type of document frequency used in building TFIDF. The default is ``Idf``. * ``corpus``: the corpus built by ``Owl_nlp_corpus`` model atop of which TFIDF will be built.
I/O functions
``save tfidf fname`` saves the TFIDF to a file of given file name ``fname``.
Convert a TFIDF to its string representation, contains summary information.
Helper functions
Convert a single document according to a given model
``normalise x`` makes ``x`` a unit vector by dividing its l2norm.
Wrap up a TFIDF model type. Low-level function and you are not supposed to use it.
val all_pairwise_distance :
Owl_nlp_similarity.t ->
t ->
('a * float) array ->
(int * float) arrayCalculate pairwise distance for the whole model, return format is ``(id,dist)`` array.
val nearest :
?typ:Owl_nlp_similarity.t ->
t ->
('a * float) array ->
int ->
(int * float) arrayReturn K-nearest neighbours, it is very slow due to linear search.