package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
val get_py : string -> Py.Object.t

Get an attribute of this module as a Py.Object.t. This is useful to pass a Python function to another function.

module GenericUnivariateSelect : sig ... end
module RFE : sig ... end
module RFECV : sig ... end
module SelectFdr : sig ... end
module SelectFpr : sig ... end
module SelectFromModel : sig ... end
module SelectFwe : sig ... end
module SelectKBest : sig ... end
module SelectPercentile : sig ... end
module SelectorMixin : sig ... end
module VarianceThreshold : sig ... end
val chi2 : x:[> `ArrayLike ] Np.Obj.t -> y:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Compute chi-squared stats between each non-negative feature and class.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function 'weeds out' the features that are the most likely to be independent of class and therefore irrelevant for classification.

Read more in the :ref:`User Guide <univariate_feature_selection>`.

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) Sample vectors.

y : array-like of shape (n_samples,) Target vector (class labels).

Returns ------- chi2 : array, shape = (n_features,) chi2 statistics of each feature. pval : array, shape = (n_features,) p-values of each feature.

Notes ----- Complexity of this algorithm is O(n_classes * n_features).

See also -------- f_classif: ANOVA F-value between label/feature for classification tasks. f_regression: F-value between label/feature for regression tasks.

val f_classif : x:[> `ArrayLike ] Np.Obj.t -> y:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Compute the ANOVA F-value for the provided sample.

Read more in the :ref:`User Guide <univariate_feature_selection>`.

Parameters ---------- X : array-like, sparse matrix shape = n_samples, n_features The set of regressors that will be tested sequentially.

y : array of shape(n_samples) The data matrix.

Returns ------- F : array, shape = n_features, The set of F values.

pval : array, shape = n_features, The set of p-values.

See also -------- chi2: Chi-squared stats of non-negative features for classification tasks. f_regression: F-value between label/feature for regression tasks.

val f_oneway : Py.Object.t list -> Py.Object.t

Performs a 1-way ANOVA.

The one-way ANOVA tests the null hypothesis that 2 or more groups have the same population mean. The test is applied to samples from two or more groups, possibly with differing sizes.

Read more in the :ref:`User Guide <univariate_feature_selection>`.

Parameters ---------- *args : array_like, sparse matrices sample1, sample2... The sample measurements should be given as arguments.

Returns ------- F-value : float The computed F-value of the test. p-value : float The associated p-value from the F-distribution.

Notes ----- The ANOVA test has important assumptions that must be satisfied in order for the associated p-value to be valid.

1. The samples are independent 2. Each sample is from a normally distributed population 3. The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

If these assumptions are not true for a given set of data, it may still be possible to use the Kruskal-Wallis H-test (`scipy.stats.kruskal`_) although with some loss of power.

The algorithm is from Heiman2, pp.394-7.

See ``scipy.stats.f_oneway`` that should give the same results while being less efficient.

References ----------

.. 1 Lowry, Richard. 'Concepts and Applications of Inferential Statistics'. Chapter 14. http://faculty.vassar.edu/lowry/ch14pt1.html

.. 2 Heiman, G.W. Research Methods in Statistics. 2002.

val f_regression : ?center:[ `True | `Bool of bool ] -> x:[> `ArrayLike ] Np.Obj.t -> y:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Univariate linear regression tests.

Linear model for testing the individual effect of each of many regressors. This is a scoring function to be used in a feature selection procedure, not a free standing feature selection procedure.

This is done in 2 steps:

1. The correlation between each regressor and the target is computed, that is, ((X:, i - mean(X:, i)) * (y - mean_y)) / (std(X:, i) * std(y)). 2. It is converted to an F score then to a p-value.

For more on usage see the :ref:`User Guide <univariate_feature_selection>`.

Parameters ---------- X : array-like, sparse matrix shape = (n_samples, n_features) The set of regressors that will be tested sequentially.

y : array of shape(n_samples). The data matrix

center : True, bool, If true, X and y will be centered.

Returns ------- F : array, shape=(n_features,) F values of features.

pval : array, shape=(n_features,) p-values of F-scores.

See also -------- mutual_info_regression: Mutual information for a continuous target. f_classif: ANOVA F-value between label/feature for classification tasks. chi2: Chi-squared stats of non-negative features for classification tasks. SelectKBest: Select features based on the k highest scores. SelectFpr: Select features based on a false positive rate test. SelectFdr: Select features based on an estimated false discovery rate. SelectFwe: Select features based on family-wise error rate. SelectPercentile: Select features based on percentile of the highest scores.

val mutual_info_classif : ?discrete_features: [ `Arr of [> `ArrayLike ] Np.Obj.t | `Auto | `Bool of bool ] -> ?n_neighbors:int -> ?copy:bool -> ?random_state:int -> x:[> `ArrayLike ] Np.Obj.t -> y:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t

Estimate mutual information for a discrete target variable.

Mutual information (MI) 1_ between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances as described in 2_ and 3_. Both methods are based on the idea originally proposed in 4_.

It can be used for univariate features selection, read more in the :ref:`User Guide <univariate_feature_selection>`.

Parameters ---------- X : array_like or sparse matrix, shape (n_samples, n_features) Feature matrix.

y : array_like, shape (n_samples,) Target vector.

discrete_features : 'auto', bool, array_like, default 'auto' If bool, then determines whether to consider all features discrete or continuous. If array, then it should be either a boolean mask with shape (n_features,) or array with indices of discrete features. If 'auto', it is assigned to False for dense `X` and to True for sparse `X`.

n_neighbors : int, default 3 Number of neighbors to use for MI estimation for continuous variables, see 2_ and 3_. Higher values reduce variance of the estimation, but could introduce a bias.

copy : bool, default True Whether to make a copy of the given data. If set to False, the initial data will be overwritten.

random_state : int, RandomState instance or None, optional, default None Determines random number generation for adding small noise to continuous variables in order to remove repeated values. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`.

Returns ------- mi : ndarray, shape (n_features,) Estimated mutual information between each feature and the target.

Notes ----- 1. The term 'discrete features' is used instead of naming them 'categorical', because it describes the essence more accurately. For example, pixel intensities of an image are discrete features (but hardly categorical) and you will get better results if mark them as such. Also note, that treating a continuous variable as discrete and vice versa will usually give incorrect results, so be attentive about that. 2. True mutual information can't be negative. If its estimate turns out to be negative, it is replaced by zero.

References ---------- .. 1 `Mutual Information <https://en.wikipedia.org/wiki/Mutual_information>`_ on Wikipedia. .. 2 A. Kraskov, H. Stogbauer and P. Grassberger, 'Estimating mutual information'. Phys. Rev. E 69, 2004. .. 3 B. C. Ross 'Mutual Information between Discrete and Continuous Data Sets'. PLoS ONE 9(2), 2014. .. 4 L. F. Kozachenko, N. N. Leonenko, 'Sample Estimate of the Entropy of a Random Vector:, Probl. Peredachi Inf., 23:2 (1987), 9-16

val mutual_info_regression : ?discrete_features: [ `Arr of [> `ArrayLike ] Np.Obj.t | `Auto | `Bool of bool ] -> ?n_neighbors:int -> ?copy:bool -> ?random_state:int -> x:[> `ArrayLike ] Np.Obj.t -> y:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t

Estimate mutual information for a continuous target variable.

Mutual information (MI) 1_ between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances as described in 2_ and 3_. Both methods are based on the idea originally proposed in 4_.

It can be used for univariate features selection, read more in the :ref:`User Guide <univariate_feature_selection>`.

Parameters ---------- X : array_like or sparse matrix, shape (n_samples, n_features) Feature matrix.

y : array_like, shape (n_samples,) Target vector.

discrete_features : 'auto', bool, array_like, default 'auto' If bool, then determines whether to consider all features discrete or continuous. If array, then it should be either a boolean mask with shape (n_features,) or array with indices of discrete features. If 'auto', it is assigned to False for dense `X` and to True for sparse `X`.

n_neighbors : int, default 3 Number of neighbors to use for MI estimation for continuous variables, see 2_ and 3_. Higher values reduce variance of the estimation, but could introduce a bias.

copy : bool, default True Whether to make a copy of the given data. If set to False, the initial data will be overwritten.

random_state : int, RandomState instance or None, optional, default None Determines random number generation for adding small noise to continuous variables in order to remove repeated values. Pass an int for reproducible results across multiple function calls. See :term:`Glossary <random_state>`.

Returns ------- mi : ndarray, shape (n_features,) Estimated mutual information between each feature and the target.

Notes ----- 1. The term 'discrete features' is used instead of naming them 'categorical', because it describes the essence more accurately. For example, pixel intensities of an image are discrete features (but hardly categorical) and you will get better results if mark them as such. Also note, that treating a continuous variable as discrete and vice versa will usually give incorrect results, so be attentive about that. 2. True mutual information can't be negative. If its estimate turns out to be negative, it is replaced by zero.

References ---------- .. 1 `Mutual Information <https://en.wikipedia.org/wiki/Mutual_information>`_ on Wikipedia. .. 2 A. Kraskov, H. Stogbauer and P. Grassberger, 'Estimating mutual information'. Phys. Rev. E 69, 2004. .. 3 B. C. Ross 'Mutual Information between Discrete and Continuous Data Sets'. PLoS ONE 9(2), 2014. .. 4 L. F. Kozachenko, N. N. Leonenko, 'Sample Estimate of the Entropy of a Random Vector', Probl. Peredachi Inf., 23:2 (1987), 9-16