package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
val get_py : string -> Py.Object.t

Get an attribute of this module as a Py.Object.t. This is useful to pass a Python function to another function.

module BaseCrossValidator : sig ... end
module GridSearchCV : sig ... end
module GroupKFold : sig ... end
module GroupShuffleSplit : sig ... end
module KFold : sig ... end
module LeaveOneGroupOut : sig ... end
module LeaveOneOut : sig ... end
module LeavePGroupsOut : sig ... end
module LeavePOut : sig ... end
module ParameterGrid : sig ... end
module ParameterSampler : sig ... end
module PredefinedSplit : sig ... end
module RandomizedSearchCV : sig ... end
module RepeatedKFold : sig ... end
module RepeatedStratifiedKFold : sig ... end
module ShuffleSplit : sig ... end
module StratifiedKFold : sig ... end
module StratifiedShuffleSplit : sig ... end
module TimeSeriesSplit : sig ... end
val check_cv : ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?y:[> `ArrayLike ] Np.Obj.t -> ?classifier:bool -> unit -> [ `BaseCrossValidator | `Object ] Np.Obj.t

Input checker utility for building a cross-validator

Parameters ---------- cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • integer, to specify the number of folds.
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For integer/None inputs, if classifier is True and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value changed from 3-fold to 5-fold.

y : array-like, default=None The target variable for supervised learning problems.

classifier : bool, default=False Whether the task is a classification task, in which case stratified KFold will be used.

Returns ------- checked_cv : a cross-validator instance. The return value is a cross-validator which generates the train/test splits via the ``split`` method.

val cross_val_predict : ?y:[> `ArrayLike ] Np.Obj.t -> ?groups:[> `ArrayLike ] Np.Obj.t -> ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?n_jobs:int -> ?verbose:int -> ?fit_params:[ `Defualt_None of Py.Object.t | `Dict of Dict.t ] -> ?pre_dispatch:[ `S of string | `I of int ] -> ?method_:string -> estimator:[> `BaseEstimator ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t

Generate cross-validated estimates for each input data point

The data is split according to the cv parameter. Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.

Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from :func:`cross_validate` and :func:`cross_val_score` unless all tests sets have equal size and the metric decomposes over samples.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' and 'predict' The object to use to fit the data.

X : array-like of shape (n_samples, n_features) The data to fit. Can be, for example a list, or an array at least 2d.

y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None The target variable to try to predict in the case of supervised learning.

groups : array-like of shape (n_samples,), default=None Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a 'Group' :term:`cv` instance (e.g., :class:`GroupKFold`).

cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • int, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_jobs : int, default=None The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

verbose : int, default=0 The verbosity level.

fit_params : dict, defualt=None Parameters to pass to the fit method of the estimator.

pre_dispatch : int or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A str, giving an expression as a function of n_jobs, as in '2*n_jobs'

method : str, default='predict' Invokes the passed method name of the passed estimator. For method='predict_proba', the columns correspond to the classes in sorted order.

Returns ------- predictions : ndarray This is the result of calling ``method``

See also -------- cross_val_score : calculate score for each CV split

cross_validate : calculate one or more scores and timings for each CV split

Notes ----- In the case that one or more classes are absent in a training portion, a default score needs to be assigned to all instances for that class if ``method`` produces columns per class, as in 'decision_function', 'predict_proba', 'predict_log_proba'. For ``predict_proba`` this value is 0. In order to ensure finite output, we approximate negative infinity by the minimum finite float value for the dtype in other cases.

Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_predict >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data:150 >>> y = diabetes.target:150 >>> lasso = linear_model.Lasso() >>> y_pred = cross_val_predict(lasso, X, y, cv=3)

val cross_val_score : ?y:[> `ArrayLike ] Np.Obj.t -> ?groups:[> `ArrayLike ] Np.Obj.t -> ?scoring: [ `Neg_mean_absolute_error | `Completeness_score | `Roc_auc_ovr | `Neg_mean_squared_log_error | `Neg_mean_gamma_deviance | `Precision_macro | `R2 | `Precision_micro | `F1_weighted | `Balanced_accuracy | `Neg_mean_squared_error | `F1_samples | `Jaccard_micro | `Normalized_mutual_info_score | `F1_micro | `Roc_auc | `Mutual_info_score | `Adjusted_rand_score | `Average_precision | `Jaccard | `Homogeneity_score | `Accuracy | `Jaccard_macro | `Jaccard_weighted | `Recall_micro | `Explained_variance | `Precision | `Callable of Py.Object.t | `V_measure_score | `F1 | `Roc_auc_ovo | `Neg_mean_poisson_deviance | `Recall_samples | `Adjusted_mutual_info_score | `Neg_brier_score | `Roc_auc_ovo_weighted | `Recall | `Fowlkes_mallows_score | `Neg_log_loss | `Neg_root_mean_squared_error | `Precision_samples | `F1_macro | `Roc_auc_ovr_weighted | `Recall_weighted | `Neg_median_absolute_error | `Jaccard_samples | `Precision_weighted | `Max_error | `Recall_macro ] -> ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?n_jobs:int -> ?verbose:int -> ?fit_params:Dict.t -> ?pre_dispatch:[ `S of string | `I of int ] -> ?error_score:[ `Raise | `I of int | `F of float ] -> estimator:[> `BaseEstimator ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t

Evaluate a score by cross-validation

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.

X : array-like of shape (n_samples, n_features) The data to fit. Can be for example a list, or an array.

y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None The target variable to try to predict in the case of supervised learning.

groups : array-like of shape (n_samples,), default=None Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a 'Group' :term:`cv` instance (e.g., :class:`GroupKFold`).

scoring : str or callable, default=None A str (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)`` which should return only a single value.

Similar to :func:`cross_validate` but only a single metric is permitted.

If None, the estimator's default scorer (if available) is used.

cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • int, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_jobs : int, default=None The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

verbose : int, default=0 The verbosity level.

fit_params : dict, default=None Parameters to pass to the fit method of the estimator.

pre_dispatch : int or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A str, giving an expression as a function of n_jobs, as in '2*n_jobs'

error_score : 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

.. versionadded:: 0.20

Returns ------- scores : array of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation.

Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data:150 >>> y = diabetes.target:150 >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y, cv=3)) 0.33150734 0.08022311 0.03531764

See Also --------- :func:`sklearn.model_selection.cross_validate`: To run cross-validation on multiple metrics and also to return train scores, fit times and score times.

:func:`sklearn.model_selection.cross_val_predict`: Get predictions from each split of cross-validation for diagnostic purposes.

:func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.

val cross_validate : ?y:[> `ArrayLike ] Np.Obj.t -> ?groups:[> `ArrayLike ] Np.Obj.t -> ?scoring: [ `Neg_mean_absolute_error | `Completeness_score | `Roc_auc_ovr | `Neg_mean_squared_log_error | `Neg_mean_gamma_deviance | `Precision_macro | `R2 | `Precision_micro | `F1_weighted | `Balanced_accuracy | `Neg_mean_squared_error | `Scores of [ `Explained_variance | `R2 | `Max_error | `Neg_median_absolute_error | `Neg_mean_absolute_error | `Neg_mean_squared_error | `Neg_mean_squared_log_error | `Neg_root_mean_squared_error | `Neg_mean_poisson_deviance | `Neg_mean_gamma_deviance | `Accuracy | `Roc_auc | `Roc_auc_ovr | `Roc_auc_ovo | `Roc_auc_ovr_weighted | `Roc_auc_ovo_weighted | `Balanced_accuracy | `Average_precision | `Neg_log_loss | `Neg_brier_score | `Adjusted_rand_score | `Homogeneity_score | `Completeness_score | `V_measure_score | `Mutual_info_score | `Adjusted_mutual_info_score | `Normalized_mutual_info_score | `Fowlkes_mallows_score | `Precision | `Precision_macro | `Precision_micro | `Precision_samples | `Precision_weighted | `Recall | `Recall_macro | `Recall_micro | `Recall_samples | `Recall_weighted | `F1 | `F1_macro | `F1_micro | `F1_samples | `F1_weighted | `Jaccard | `Jaccard_macro | `Jaccard_micro | `Jaccard_samples | `Jaccard_weighted ] list | `F1_samples | `Jaccard_micro | `Normalized_mutual_info_score | `F1_micro | `Roc_auc | `Mutual_info_score | `Adjusted_rand_score | `Average_precision | `Jaccard | `Homogeneity_score | `Accuracy | `Jaccard_macro | `Jaccard_weighted | `Recall_micro | `Explained_variance | `Precision | `Callable of Py.Object.t | `V_measure_score | `F1 | `Roc_auc_ovo | `Neg_mean_poisson_deviance | `Recall_samples | `Adjusted_mutual_info_score | `Neg_brier_score | `Roc_auc_ovo_weighted | `Recall | `Dict of Dict.t | `Fowlkes_mallows_score | `Neg_log_loss | `Neg_root_mean_squared_error | `Precision_samples | `F1_macro | `Roc_auc_ovr_weighted | `Recall_weighted | `Neg_median_absolute_error | `Jaccard_samples | `Precision_weighted | `Max_error | `Recall_macro ] -> ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?n_jobs:int -> ?verbose:int -> ?fit_params:Dict.t -> ?pre_dispatch:[ `S of string | `I of int ] -> ?return_train_score:bool -> ?return_estimator:bool -> ?error_score:[ `Raise | `I of int | `F of float ] -> estimator:[> `BaseEstimator ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> unit -> Dict.t

Evaluate metric(s) by cross-validation and also record fit/score times.

Read more in the :ref:`User Guide <multimetric_cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.

X : array-like of shape (n_samples, n_features) The data to fit. Can be for example a list, or an array.

y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None The target variable to try to predict in the case of supervised learning.

groups : array-like of shape (n_samples,), default=None Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a 'Group' :term:`cv` instance (e.g., :class:`GroupKFold`).

scoring : str, callable, list/tuple, or dict, default=None A single str (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See :ref:`multimetric_grid_search` for an example.

If None, the estimator's score method is used.

cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • int, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_jobs : int, default=None The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

verbose : int, default=0 The verbosity level.

fit_params : dict, default=None Parameters to pass to the fit method of the estimator.

pre_dispatch : int or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A str, giving an expression as a function of n_jobs, as in '2*n_jobs'

return_train_score : bool, default=False Whether to include train scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance.

.. versionadded:: 0.19

.. versionchanged:: 0.21 Default value was changed from ``True`` to ``False``

return_estimator : bool, default=False Whether to return the estimators fitted on each split.

.. versionadded:: 0.20

error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

.. versionadded:: 0.20

Returns ------- scores : dict of float arrays of shape (n_splits,) Array of scores of the estimator for each run of the cross validation.

A dict of arrays containing the score/time arrays for each scorer is returned. The possible keys for this ``dict`` are:

``test_score`` The score array for test scores on each cv split. Suffix ``_score`` in ``test_score`` changes to a specific metric like ``test_r2`` or ``test_auc`` if there are multiple scoring metrics in the scoring parameter. ``train_score`` The score array for train scores on each cv split. Suffix ``_score`` in ``train_score`` changes to a specific metric like ``train_r2`` or ``train_auc`` if there are multiple scoring metrics in the scoring parameter. This is available only if ``return_train_score`` parameter is ``True``. ``fit_time`` The time for fitting the estimator on the train set for each cv split. ``score_time`` The time for scoring the estimator on the test set for each cv split. (Note time for scoring on the train set is not included even if ``return_train_score`` is set to ``True`` ``estimator`` The estimator objects for each cv split. This is available only if ``return_estimator`` parameter is set to ``True``.

Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_validate >>> from sklearn.metrics import make_scorer >>> from sklearn.metrics import confusion_matrix >>> from sklearn.svm import LinearSVC >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data:150 >>> y = diabetes.target:150 >>> lasso = linear_model.Lasso()

Single metric evaluation using ``cross_validate``

>>> cv_results = cross_validate(lasso, X, y, cv=3) >>> sorted(cv_results.keys()) 'fit_time', 'score_time', 'test_score' >>> cv_results'test_score' array(0.33150734, 0.08022311, 0.03531764)

Multiple metric evaluation using ``cross_validate`` (please refer the ``scoring`` parameter doc for more information)

>>> scores = cross_validate(lasso, X, y, cv=3, ... scoring=('r2', 'neg_mean_squared_error'), ... return_train_score=True) >>> print(scores'test_neg_mean_squared_error') -3635.5... -3573.3... -6114.7... >>> print(scores'train_r2') 0.28010158 0.39088426 0.22784852

See Also --------- :func:`sklearn.model_selection.cross_val_score`: Run cross-validation for single metric evaluation.

:func:`sklearn.model_selection.cross_val_predict`: Get predictions from each split of cross-validation for diagnostic purposes.

:func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.

val fit_grid_point : ?error_score:[ `Raise | `I of int | `F of float ] -> ?fit_params:(string * Py.Object.t) list -> x:[> `ArrayLike ] Np.Obj.t -> y:[ `Arr of [> `ArrayLike ] Np.Obj.t | `None ] -> estimator:[> `BaseEstimator ] Np.Obj.t -> parameters:Dict.t -> train: [ `Bool of bool | `Arr of [> `ArrayLike ] Np.Obj.t | `Dtype_int of Py.Object.t ] -> test: [ `Bool of bool | `Arr of [> `ArrayLike ] Np.Obj.t | `Dtype_int of Py.Object.t ] -> scorer:[ `Callable of Py.Object.t | `None ] -> verbose:int -> unit -> float * Dict.t * int

DEPRECATED: fit_grid_point is deprecated in version 0.23 and will be removed in version 0.25

Run fit on one set of parameters.

Parameters ---------- X : array-like, sparse matrix or list Input data.

y : array-like or None Targets for input data.

estimator : estimator object A object of that type is instantiated for each grid point. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.

parameters : dict Parameters to be set on estimator for this grid point.

train : ndarray, dtype int or bool Boolean mask or indices for training set.

test : ndarray, dtype int or bool Boolean mask or indices for test set.

scorer : callable or None The scorer callable object / function must have its signature as ``scorer(estimator, X, y)``.

If ``None`` the estimator's score method is used.

verbose : int Verbosity level.

**fit_params : kwargs Additional parameter passed to the fit function of the estimator.

error_score : 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

Returns ------- score : float Score of this parameter setting on given test split.

parameters : dict The parameters that have been evaluated.

n_samples_test : int Number of test samples in this split.

val learning_curve : ?groups:[> `ArrayLike ] Np.Obj.t -> ?train_sizes:[> `ArrayLike ] Np.Obj.t -> ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?scoring: [ `Neg_mean_absolute_error | `Completeness_score | `Roc_auc_ovr | `Neg_mean_squared_log_error | `Neg_mean_gamma_deviance | `Precision_macro | `R2 | `Precision_micro | `F1_weighted | `Balanced_accuracy | `Neg_mean_squared_error | `F1_samples | `Jaccard_micro | `Normalized_mutual_info_score | `F1_micro | `Roc_auc | `Mutual_info_score | `Adjusted_rand_score | `Average_precision | `Jaccard | `Homogeneity_score | `Accuracy | `Jaccard_macro | `Jaccard_weighted | `Recall_micro | `Explained_variance | `Precision | `Callable of Py.Object.t | `V_measure_score | `F1 | `Roc_auc_ovo | `Neg_mean_poisson_deviance | `Recall_samples | `Adjusted_mutual_info_score | `Neg_brier_score | `Roc_auc_ovo_weighted | `Recall | `Fowlkes_mallows_score | `Neg_log_loss | `Neg_root_mean_squared_error | `Precision_samples | `F1_macro | `Roc_auc_ovr_weighted | `Recall_weighted | `Neg_median_absolute_error | `Jaccard_samples | `Precision_weighted | `Max_error | `Recall_macro ] -> ?exploit_incremental_learning:bool -> ?n_jobs:int -> ?pre_dispatch:[ `S of string | `I of int ] -> ?verbose:int -> ?shuffle:bool -> ?random_state:int -> ?error_score:[ `Raise | `I of int | `F of float ] -> ?return_times:bool -> estimator:[> `BaseEstimator ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> y:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Learning curve.

Determines cross-validated training and test scores for different training set sizes.

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size.

Read more in the :ref:`User Guide <learning_curve>`.

Parameters ---------- estimator : object type that implements the 'fit' and 'predict' methods An object of that type which is cloned for each validation.

X : array-like of shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.

y : array-like of shape (n_samples,) or (n_samples, n_outputs) Target relative to X for classification or regression; None for unsupervised learning.

groups : array-like of shape (n_samples,), default=None Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a 'Group' :term:`cv` instance (e.g., :class:`GroupKFold`).

train_sizes : array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5) Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.

cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • int, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

scoring : str or callable, default=None A str (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``.

exploit_incremental_learning : bool, default=False If the estimator supports incremental learning, this will be used to speed up fitting for different training set sizes.

n_jobs : int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

pre_dispatch : int or str, default='all' Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The str can be an expression like '2*n_jobs'.

verbose : int, default=0 Controls the verbosity: the higher, the more messages.

shuffle : bool, default=False Whether to shuffle training data before taking prefixes of it based on``train_sizes``.

random_state : int or RandomState instance, default=None Used when ``shuffle`` is True. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.

error_score : 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

.. versionadded:: 0.20

return_times : bool, default=False Whether to return the fit and score times.

Returns ------- train_sizes_abs : array of shape (n_unique_ticks,) Numbers of training examples that has been used to generate the learning curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.

train_scores : array of shape (n_ticks, n_cv_folds) Scores on training sets.

test_scores : array of shape (n_ticks, n_cv_folds) Scores on test set.

fit_times : array of shape (n_ticks, n_cv_folds) Times spent for fitting in seconds. Only present if ``return_times`` is True.

score_times : array of shape (n_ticks, n_cv_folds) Times spent for scoring in seconds. Only present if ``return_times`` is True.

Notes ----- See :ref:`examples/model_selection/plot_learning_curve.py <sphx_glr_auto_examples_model_selection_plot_learning_curve.py>`

val permutation_test_score : ?groups:[> `ArrayLike ] Np.Obj.t -> ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?n_permutations:int -> ?n_jobs:int -> ?random_state:int -> ?verbose:int -> ?scoring: [ `Neg_mean_absolute_error | `Completeness_score | `Roc_auc_ovr | `Neg_mean_squared_log_error | `Neg_mean_gamma_deviance | `Precision_macro | `R2 | `Precision_micro | `F1_weighted | `Balanced_accuracy | `Neg_mean_squared_error | `F1_samples | `Jaccard_micro | `Normalized_mutual_info_score | `F1_micro | `Roc_auc | `Mutual_info_score | `Adjusted_rand_score | `Average_precision | `Jaccard | `Homogeneity_score | `Accuracy | `Jaccard_macro | `Jaccard_weighted | `Recall_micro | `Explained_variance | `Precision | `Callable of Py.Object.t | `V_measure_score | `F1 | `Roc_auc_ovo | `Neg_mean_poisson_deviance | `Recall_samples | `Adjusted_mutual_info_score | `Neg_brier_score | `Roc_auc_ovo_weighted | `Recall | `Fowlkes_mallows_score | `Neg_log_loss | `Neg_root_mean_squared_error | `Precision_samples | `F1_macro | `Roc_auc_ovr_weighted | `Recall_weighted | `Neg_median_absolute_error | `Jaccard_samples | `Precision_weighted | `Max_error | `Recall_macro ] -> estimator:[> `BaseEstimator ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> y:[ `Arr of [> `ArrayLike ] Np.Obj.t | `None ] -> unit -> float * [> `ArrayLike ] Np.Obj.t * float

Evaluate the significance of a cross-validated score with permutations

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data.

X : array-like of shape at least 2D The data to fit.

y : array-like of shape (n_samples,) or (n_samples, n_outputs) or None The target variable to try to predict in the case of supervised learning.

groups : array-like of shape (n_samples,), default=None Labels to constrain permutation within groups, i.e. ``y`` values are permuted among samples with the same group identifier. When not specified, ``y`` values are permuted among all samples.

When a grouped cross-validator is used, the group labels are also passed on to the ``split`` method of the cross-validator. The cross-validator uses them for grouping the samples while splitting the dataset into train/test set.

scoring : str or callable, default=None A single str (see :ref:`scoring_parameter`) or a callable (see :ref:`scoring`) to evaluate the predictions on the test set.

If None the estimator's score method is used.

cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • int, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

n_permutations : int, default=100 Number of times to permute ``y``.

n_jobs : int, default=None The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

random_state : int, RandomState instance or None, default=0 Pass an int for reproducible output for permutation of ``y`` values among samples. See :term:`Glossary <random_state>`.

verbose : int, default=0 The verbosity level.

Returns ------- score : float The true score without permuting targets.

permutation_scores : array of shape (n_permutations,) The scores obtained for each permutations.

pvalue : float The p-value, which approximates the probability that the score would be obtained by chance. This is calculated as:

`(C + 1) / (n_permutations + 1)`

Where C is the number of permutations whose score >= the true score.

The best possible p-value is 1/(n_permutations + 1), the worst is 1.0.

Notes ----- This function implements Test 1 in:

Ojala and Garriga. Permutation Tests for Studying Classifier Performance. The Journal of Machine Learning Research (2010) vol. 11 `pdf <http://www.jmlr.org/papers/volume11/ojala10a/ojala10a.pdf>`_.

val train_test_split : ?test_size:[ `I of int | `F of float ] -> ?train_size:[ `I of int | `F of float ] -> ?random_state:int -> ?shuffle:bool -> ?stratify:[> `ArrayLike ] Np.Obj.t -> [> `ArrayLike ] Np.Obj.t list -> [> `ArrayLike ] Np.Obj.t list

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`` and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters ---------- *arrays : sequence of indexables with same length / shape0 Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.

test_size : float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.25.

train_size : float or int, default=None If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int or RandomState instance, default=None Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.

shuffle : bool, default=True Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

stratify : array-like, default=None If not None, data is split in a stratified fashion, using this as the class labels.

Returns ------- splitting : list, length=2 * len(arrays) List containing train-test split of inputs.

.. versionadded:: 0.16 If the input is sparse, the output will be a ``scipy.sparse.csr_matrix``. Else, output type is the same as the input type.

Examples -------- >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X array([0, 1], [2, 3], [4, 5], [6, 7], [8, 9]) >>> list(y) 0, 1, 2, 3, 4

>>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) ... >>> X_train array([4, 5], [0, 1], [6, 7]) >>> y_train 2, 0, 3 >>> X_test array([2, 3], [8, 9]) >>> y_test 1, 4

>>> train_test_split(y, shuffle=False) [0, 1, 2], [3, 4]

val validation_curve : ?groups:[> `ArrayLike ] Np.Obj.t -> ?cv: [ `BaseCrossValidator of [> `BaseCrossValidator ] Np.Obj.t | `Arr of [> `ArrayLike ] Np.Obj.t | `I of int ] -> ?scoring: [ `Neg_mean_absolute_error | `Completeness_score | `Roc_auc_ovr | `Neg_mean_squared_log_error | `Neg_mean_gamma_deviance | `Precision_macro | `R2 | `Precision_micro | `F1_weighted | `Balanced_accuracy | `Neg_mean_squared_error | `F1_samples | `Jaccard_micro | `Normalized_mutual_info_score | `F1_micro | `Roc_auc | `Mutual_info_score | `Adjusted_rand_score | `Average_precision | `Jaccard | `Homogeneity_score | `Accuracy | `Jaccard_macro | `Jaccard_weighted | `Recall_micro | `Explained_variance | `Precision | `Callable of Py.Object.t | `V_measure_score | `F1 | `Roc_auc_ovo | `Neg_mean_poisson_deviance | `Recall_samples | `Adjusted_mutual_info_score | `Neg_brier_score | `Roc_auc_ovo_weighted | `Recall | `Fowlkes_mallows_score | `Neg_log_loss | `Neg_root_mean_squared_error | `Precision_samples | `F1_macro | `Roc_auc_ovr_weighted | `Recall_weighted | `Neg_median_absolute_error | `Jaccard_samples | `Precision_weighted | `Max_error | `Recall_macro ] -> ?n_jobs:int -> ?pre_dispatch:[ `S of string | `I of int ] -> ?verbose:int -> ?error_score:[ `Raise | `I of int | `F of float ] -> estimator:[> `BaseEstimator ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> y:[ `Arr of [> `ArrayLike ] Np.Obj.t | `None ] -> param_name:string -> param_range:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Validation curve.

Determine training and test scores for varying parameter values.

Compute scores for an estimator with different values of a specified parameter. This is similar to grid search with one parameter. However, this will also compute training scores and is merely a utility for plotting the results.

Read more in the :ref:`User Guide <validation_curve>`.

Parameters ---------- estimator : object type that implements the 'fit' and 'predict' methods An object of that type which is cloned for each validation.

X : array-like of shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.

y : array-like of shape (n_samples,) or (n_samples, n_outputs) or None Target relative to X for classification or regression; None for unsupervised learning.

param_name : str Name of the parameter that will be varied.

param_range : array-like of shape (n_values,) The values of the parameter that will be evaluated.

groups : array-like of shape (n_samples,), default=None Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a 'Group' :term:`cv` instance (e.g., :class:`GroupKFold`).

cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:

  • None, to use the default 5-fold cross validation,
  • int, to specify the number of folds in a `(Stratified)KFold`,
  • :term:`CV splitter`,
  • An iterable yielding (train, test) splits as arrays of indices.

For int/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used.

Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here.

.. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold.

scoring : str or callable, default=None A str (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``.

n_jobs : int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

pre_dispatch : int or str, default='all' Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The str can be an expression like '2*n_jobs'.

verbose : int, default=0 Controls the verbosity: the higher, the more messages.

error_score : 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

.. versionadded:: 0.20

Returns ------- train_scores : array of shape (n_ticks, n_cv_folds) Scores on training sets.

test_scores : array of shape (n_ticks, n_cv_folds) Scores on test set.

Notes ----- See :ref:`sphx_glr_auto_examples_model_selection_plot_validation_curve.py`