package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
module GridSearchCV : sig ... end
module GroupKFold : sig ... end
module GroupShuffleSplit : sig ... end
module KFold : sig ... end
module LeaveOneGroupOut : sig ... end
module LeaveOneOut : sig ... end
module LeavePGroupsOut : sig ... end
module LeavePOut : sig ... end
module ParameterGrid : sig ... end
module ParameterSampler : sig ... end
module PredefinedSplit : sig ... end
module RandomizedSearchCV : sig ... end
module RepeatedKFold : sig ... end
module RepeatedStratifiedKFold : sig ... end
module ShuffleSplit : sig ... end
module StratifiedKFold : sig ... end
module StratifiedShuffleSplit : sig ... end
module TimeSeriesSplit : sig ... end
val check_cv : ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?y:Ndarray.t -> ?classifier:bool -> unit -> Py.Object.t

Input checker utility for building a cross-validator

Parameters ---------- cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross-validation,
  • integer, to specify the number of folds.
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if classifier is True and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value changed from 3-fold to 5-fold.

y : array-like, optional The target variable for supervised learning problems.

classifier : boolean, optional, default False Whether the task is a classification task, in which case stratified KFold will be used.

Returns ------- checked_cv : a cross-validator instance. The return value is a cross-validator which generates the train/test splits via the ``split`` method.

val cross_val_predict : ?y:Ndarray.t -> ?groups:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?n_jobs:[ `Int of int | `None ] -> ?verbose:int -> ?fit_params:Py.Object.t -> ?pre_dispatch:[ `Int of int | `String of string ] -> ?method_:string -> estimator:Py.Object.t -> x:Ndarray.t -> unit -> Ndarray.t

Generate cross-validated estimates for each input data point

The data is split according to the cv parameter. Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.

Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from :func:`cross_validate` and :func:`cross_val_score` unless all tests sets have equal size and the metric decomposes over samples.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' and 'predict' The object to use to fit the data.

X : array-like The data to fit. Can be, for example a list, or an array at least 2d.

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.

groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`).

cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_jobs : int or None, optional (default=None) The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

verbose : integer, optional The verbosity level.

fit_params : dict, optional Parameters to pass to the fit method of the estimator.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A string, giving an expression as a function of n_jobs, as in '2*n_jobs'

method : string, optional, default: 'predict' Invokes the passed method name of the passed estimator. For method='predict_proba', the columns correspond to the classes in sorted order.

Returns ------- predictions : ndarray This is the result of calling ``method``

See also -------- cross_val_score : calculate score for each CV split

cross_validate : calculate one or more scores and timings for each CV split

Notes ----- In the case that one or more classes are absent in a training portion, a default score needs to be assigned to all instances for that class if ``method`` produces columns per class, as in 'decision_function', 'predict_proba', 'predict_log_proba'. For ``predict_proba`` this value is 0. In order to ensure finite output, we approximate negative infinity by the minimum finite float value for the dtype in other cases.

Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_predict >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data:150 >>> y = diabetes.target:150 >>> lasso = linear_model.Lasso() >>> y_pred = cross_val_predict(lasso, X, y, cv=3)

val cross_val_score : ?y:Ndarray.t -> ?groups:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?scoring:[ `String of string | `Callable of Py.Object.t | `None ] -> ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?n_jobs:[ `Int of int | `None ] -> ?verbose:int -> ?fit_params:Py.Object.t -> ?pre_dispatch:[ `Int of int | `String of string ] -> ?error_score:[ `Raise | `PyObject of Py.Object.t ] -> estimator:Py.Object.t -> x:Ndarray.t -> unit -> Py.Object.t

Evaluate a score by cross-validation

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.

X : array-like The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.

groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`).

scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)`` which should return only a single value.

Similar to :func:`cross_validate` but only a single metric is permitted.

If None, the estimator's default scorer (if available) is used.

cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_jobs : int or None, optional (default=None) The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

verbose : integer, optional The verbosity level.

fit_params : dict, optional Parameters to pass to the fit method of the estimator.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A string, giving an expression as a function of n_jobs, as in '2*n_jobs'

error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

Returns ------- scores : array of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation.

Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data:150 >>> y = diabetes.target:150 >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y, cv=3)) 0.33150734 0.08022311 0.03531764

See Also --------- :func:`sklearn.model_selection.cross_validate`: To run cross-validation on multiple metrics and also to return train scores, fit times and score times.

:func:`sklearn.model_selection.cross_val_predict`: Get predictions from each split of cross-validation for diagnostic purposes.

:func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.

val cross_validate : ?y:Ndarray.t -> ?groups:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?scoring: [ `String of string | `Callable of Py.Object.t | `Dict of Py.Object.t | `None | `PyObject of Py.Object.t ] -> ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?n_jobs:[ `Int of int | `None ] -> ?verbose:int -> ?fit_params:Py.Object.t -> ?pre_dispatch:[ `Int of int | `String of string ] -> ?return_train_score:bool -> ?return_estimator:bool -> ?error_score:[ `Raise | `PyObject of Py.Object.t ] -> estimator:Py.Object.t -> x:Ndarray.t -> unit -> Py.Object.t

Evaluate metric(s) by cross-validation and also record fit/score times.

Read more in the :ref:`User Guide <multimetric_cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.

X : array-like The data to fit. Can be for example a list, or an array.

y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning.

groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`).

scoring : string, callable, list/tuple, dict or None, default: None A single string (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See :ref:`multimetric_grid_search` for an example.

If None, the estimator's score method is used.

cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_jobs : int or None, optional (default=None) The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

verbose : integer, optional The verbosity level.

fit_params : dict, optional Parameters to pass to the fit method of the estimator.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A string, giving an expression as a function of n_jobs, as in '2*n_jobs'

return_train_score : boolean, default=False Whether to include train scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

return_estimator : boolean, default False Whether to return the estimators fitted on each split.

error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

Returns ------- scores : dict of float arrays of shape (n_splits,) Array of scores of the estimator for each run of the cross validation.

A dict of arrays containing the score/time arrays for each scorer is returned. The possible keys for this ``dict`` are:

``test_score`` The score array for test scores on each cv split. Suffix ``_score`` in ``test_score`` changes to a specific metric like ``test_r2`` or ``test_auc`` if there are multiple scoring metrics in the scoring parameter. ``train_score`` The score array for train scores on each cv split. Suffix ``_score`` in ``train_score`` changes to a specific metric like ``train_r2`` or ``train_auc`` if there are multiple scoring metrics in the scoring parameter. This is available only if ``return_train_score`` parameter is ``True``. ``fit_time`` The time for fitting the estimator on the train set for each cv split. ``score_time`` The time for scoring the estimator on the test set for each cv split. (Note time for scoring on the train set is not included even if ``return_train_score`` is set to ``True`` ``estimator`` The estimator objects for each cv split. This is available only if ``return_estimator`` parameter is set to ``True``.

Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_validate >>> from sklearn.metrics import make_scorer >>> from sklearn.metrics import confusion_matrix >>> from sklearn.svm import LinearSVC >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data:150 >>> y = diabetes.target:150 >>> lasso = linear_model.Lasso()

Single metric evaluation using ``cross_validate``

>>> cv_results = cross_validate(lasso, X, y, cv=3) >>> sorted(cv_results.keys()) 'fit_time', 'score_time', 'test_score' >>> cv_results'test_score' array(0.33150734, 0.08022311, 0.03531764)

Multiple metric evaluation using ``cross_validate`` (please refer the ``scoring`` parameter doc for more information)

>>> scores = cross_validate(lasso, X, y, cv=3, ... scoring=('r2', 'neg_mean_squared_error'), ... return_train_score=True) >>> print(scores'test_neg_mean_squared_error') -3635.5... -3573.3... -6114.7... >>> print(scores'train_r2') 0.28010158 0.39088426 0.22784852

See Also --------- :func:`sklearn.model_selection.cross_val_score`: Run cross-validation for single metric evaluation.

:func:`sklearn.model_selection.cross_val_predict`: Get predictions from each split of cross-validation for diagnostic purposes.

:func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.

val fit_grid_point : ?error_score:[ `Raise | `PyObject of Py.Object.t ] -> ?fit_params:(string * Py.Object.t) list -> x: [ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t | `ArrayLike of Py.Object.t ] -> y:[ `Ndarray of Ndarray.t | `None ] -> estimator:Py.Object.t -> parameters:Py.Object.t -> train:[ `Ndarray of Ndarray.t | `Bool of bool | `PyObject of Py.Object.t ] -> test:[ `Ndarray of Ndarray.t | `Bool of bool | `PyObject of Py.Object.t ] -> scorer:[ `Callable of Py.Object.t | `None ] -> verbose:int -> unit -> float * Py.Object.t * int

Run fit on one set of parameters.

Parameters ---------- X : array-like, sparse matrix or list Input data.

y : array-like or None Targets for input data.

estimator : estimator object A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.

parameters : dict Parameters to be set on estimator for this grid point.

train : ndarray, dtype int or bool Boolean mask or indices for training set.

test : ndarray, dtype int or bool Boolean mask or indices for test set.

scorer : callable or None The scorer callable object / function must have its signature as ``scorer(estimator, X, y)``.

If ``None`` the estimator's score method is used.

verbose : int Verbosity level.

**fit_params : kwargs Additional parameter passed to the fit function of the estimator.

error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error. Default is ``np.nan``.

Returns ------- score : float Score of this parameter setting on given test split.

parameters : dict The parameters that have been evaluated.

n_samples_test : int Number of test samples in this split.

val learning_curve : ?groups:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?train_sizes: [ `Ndarray of Ndarray.t | `Int of int | `PyObject of Py.Object.t ] -> ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?scoring:[ `String of string | `Callable of Py.Object.t | `None ] -> ?exploit_incremental_learning:bool -> ?n_jobs:[ `Int of int | `None ] -> ?pre_dispatch:[ `Int of int | `String of string ] -> ?verbose:int -> ?shuffle:bool -> ?random_state:[ `Int of int | `RandomState of Py.Object.t | `None ] -> ?error_score:[ `Raise | `PyObject of Py.Object.t ] -> ?return_times:bool -> estimator:Py.Object.t -> x:Ndarray.t -> y:Ndarray.t -> unit -> Py.Object.t * Ndarray.t * Ndarray.t * Ndarray.t * Ndarray.t

Learning curve.

Determines cross-validated training and test scores for different training set sizes.

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size.

Read more in the :ref:`User Guide <learning_curve>`.

Parameters ---------- estimator : object type that implements the "fit" and "predict" methods An object of that type which is cloned for each validation.

X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression; None for unsupervised learning.

groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`).

train_sizes : array-like, shape (n_ticks,), dtype float or int Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. (default: np.linspace(0.1, 1.0, 5))

cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``.

exploit_incremental_learning : boolean, optional, default: False If the estimator supports incremental learning, this will be used to speed up fitting for different training set sizes.

n_jobs : int or None, optional (default=None) Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

pre_dispatch : integer or string, optional Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The string can be an expression like '2*n_jobs'.

verbose : integer, optional Controls the verbosity: the higher, the more messages.

shuffle : boolean, optional Whether to shuffle training data before taking prefixes of it based on``train_sizes``.

random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. Used when ``shuffle`` is True.

error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

return_times : boolean, optional (default: False) Whether to return the fit and score times.

Returns ------- train_sizes_abs : array, shape (n_unique_ticks,), dtype int Numbers of training examples that has been used to generate the learning curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.

train_scores : array, shape (n_ticks, n_cv_folds) Scores on training sets.

test_scores : array, shape (n_ticks, n_cv_folds) Scores on test set.

fit_times : array, shape (n_ticks, n_cv_folds) Times spent for fitting in seconds. Only present if ``return_times`` is True.

score_times : array, shape (n_ticks, n_cv_folds) Times spent for scoring in seconds. Only present if ``return_times`` is True.

Notes ----- See :ref:`examples/model_selection/plot_learning_curve.py <sphx_glr_auto_examples_model_selection_plot_learning_curve.py>`

val permutation_test_score : ?groups:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?n_permutations:int -> ?n_jobs:[ `Int of int | `None ] -> ?random_state:[ `Int of int | `RandomState of Py.Object.t | `None ] -> ?verbose:int -> ?scoring:[ `String of string | `Callable of Py.Object.t | `None ] -> estimator:Py.Object.t -> x:Py.Object.t -> y:Ndarray.t -> unit -> float * Ndarray.t * float

Evaluate the significance of a cross-validated score with permutations

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.

X : array-like of shape at least 2D The data to fit.

y : array-like The target variable to try to predict in the case of supervised learning.

groups : array-like, with shape (n_samples,), optional Labels to constrain permutation within groups, i.e. ``y`` values are permuted among samples with the same group identifier. When not specified, ``y`` values are permuted among all samples.

When a grouped cross-validator is used, the group labels are also passed on to the ``split`` method of the cross-validator. The cross-validator uses them for grouping the samples while splitting the dataset into train/test set.

scoring : string, callable or None, optional, default: None A single string (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set.

If None the estimator's score method is used.

cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_permutations : integer, optional Number of times to permute ``y``.

n_jobs : int or None, optional (default=None) The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

random_state : int, RandomState instance or None, optional (default=0) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`.

verbose : integer, optional The verbosity level.

Returns ------- score : float The true score without permuting targets.

permutation_scores : array, shape (n_permutations,) The scores obtained for each permutations.

pvalue : float The p-value, which approximates the probability that the score would be obtained by chance. This is calculated as:

`(C + 1) / (n_permutations + 1)`

Where C is the number of permutations whose score >= the true score.

The best possible p-value is 1/(n_permutations + 1), the worst is 1.0.

Notes ----- This function implements Test 1 in:

Ojala and Garriga. Permutation Tests for Studying Classifier Performance. The Journal of Machine Learning Research (2010) vol. 11

val train_test_split : ?test_size:[ `Float of float | `Int of int | `None ] -> ?train_size:[ `Float of float | `Int of int | `None ] -> ?random_state:[ `Int of int | `RandomState of Py.Object.t | `None ] -> ?shuffle:bool -> ?stratify:[ `Ndarray of Ndarray.t | `None ] -> Ndarray.t list -> Ndarray.t array

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`` and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- *arrays : sequence of indexables with same length / shape0 Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

test_size : float, int or None, optional (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.25.

train_size : float, int, or None, (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`.

shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like or None (default=None) If not None, data is split in a stratified fashion, using this as the class labels.

Returns ------- splitting : list, length=2 * len(arrays) List containing train-test split of inputs.

.. versionadded:: 0.16 If the input is sparse, the output will be a ``scipy.sparse.csr_matrix``. Else, output type is the same as the input type.

Examples -------- >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X array([0, 1], [2, 3], [4, 5], [6, 7], [8, 9]) >>> list(y) 0, 1, 2, 3, 4

>>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) ... >>> X_train array([4, 5], [0, 1], [6, 7]) >>> y_train 2, 0, 3 >>> X_test array([2, 3], [8, 9]) >>> y_test 1, 4

>>> train_test_split(y, shuffle=False) [0, 1, 2], [3, 4]

val validation_curve : ?groups:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?cv: [ `Int of int | `CrossValGenerator of Py.Object.t | `Ndarray of Ndarray.t ] -> ?scoring:[ `String of string | `Callable of Py.Object.t | `None ] -> ?n_jobs:[ `Int of int | `None ] -> ?pre_dispatch:[ `Int of int | `String of string ] -> ?verbose:int -> ?error_score:[ `Raise | `PyObject of Py.Object.t ] -> estimator:Py.Object.t -> x:Ndarray.t -> y:Ndarray.t -> param_name:string -> param_range:Ndarray.t -> unit -> Ndarray.t * Ndarray.t

Validation curve.

Determine training and test scores for varying parameter values.

Compute scores for an estimator with different values of a specified parameter. This is similar to grid search with one parameter. However, this will also compute training scores and is merely a utility for plotting the results.

Read more in the :ref:`User Guide <learning_curve>`.

Parameters ---------- estimator : object type that implements the "fit" and "predict" methods An object of that type which is cloned for each validation.

X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression; None for unsupervised learning.

param_name : string Name of the parameter that will be varied.

param_range : array-like, shape (n_values,) The values of the parameter that will be evaluated.

groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`).

cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``.

n_jobs : int or None, optional (default=None) Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

pre_dispatch : integer or string, optional Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The string can be an expression like '2*n_jobs'.

verbose : integer, optional Controls the verbosity: the higher, the more messages.

error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

Returns ------- train_scores : array, shape (n_ticks, n_cv_folds) Scores on training sets.

test_scores : array, shape (n_ticks, n_cv_folds) Scores on test set.

Notes ----- See :ref:`sphx_glr_auto_examples_model_selection_plot_validation_curve.py`

OCaml

Innovation. Community. Security.