package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
type tag = [
  1. | `MiniBatchKMeans
]
type t = [ `BaseEstimator | `ClusterMixin | `MiniBatchKMeans | `Object | `TransformerMixin ] Obj.t
val of_pyobject : Py.Object.t -> t
val to_pyobject : [> tag ] Obj.t -> Py.Object.t
val as_transformer : t -> [ `TransformerMixin ] Obj.t
val as_estimator : t -> [ `BaseEstimator ] Obj.t
val as_cluster : t -> [ `ClusterMixin ] Obj.t
val create : ?n_clusters:int -> ?init:[ `K_means_ | `Random | `Arr of [> `ArrayLike ] Np.Obj.t ] -> ?max_iter:int -> ?batch_size:int -> ?verbose:int -> ?compute_labels:bool -> ?random_state:int -> ?tol:float -> ?max_no_improvement:int -> ?init_size:int -> ?n_init:int -> ?reassignment_ratio:float -> unit -> t

Mini-Batch K-Means clustering.

Read more in the :ref:`User Guide <mini_batch_kmeans>`.

Parameters ----------

n_clusters : int, default=8 The number of clusters to form as well as the number of centroids to generate.

init : 'k-means++', 'random' or ndarray of shape (n_clusters, n_features), default='k-means++' Method for initialization

'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

'random': choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

max_iter : int, default=100 Maximum number of iterations over the complete dataset before stopping independently of any early stopping criterion heuristics.

batch_size : int, default=100 Size of the mini batches.

verbose : int, default=0 Verbosity mode.

compute_labels : bool, default=True Compute label assignment and inertia for the complete dataset once the minibatch optimization has converged in fit.

random_state : int, RandomState instance, default=None Determines random number generation for centroid initialization and random reassignment. Use an int to make the randomness deterministic. See :term:`Glossary <random_state>`.

tol : float, default=0.0 Control early stopping based on the relative center changes as measured by a smoothed, variance-normalized of the mean center squared position changes. This early stopping heuristics is closer to the one used for the batch variant of the algorithms but induces a slight computational and memory overhead over the inertia heuristic.

To disable convergence detection based on normalized center change, set tol to 0.0 (default).

max_no_improvement : int, default=10 Control early stopping based on the consecutive number of mini batches that does not yield an improvement on the smoothed inertia.

To disable convergence detection based on inertia, set max_no_improvement to None.

init_size : int, default=None Number of samples to randomly sample for speeding up the initialization (sometimes at the expense of accuracy): the only algorithm is initialized by running a batch KMeans on a random subset of the data. This needs to be larger than n_clusters.

If `None`, `init_size= 3 * batch_size`.

n_init : int, default=3 Number of random initializations that are tried. In contrast to KMeans, the algorithm is only run once, using the best of the ``n_init`` initializations as measured by inertia.

reassignment_ratio : float, default=0.01 Control the fraction of the maximum number of counts for a center to be reassigned. A higher value means that low count centers are more easily reassigned, which means that the model will take longer to converge, but should converge in a better clustering.

Attributes ----------

cluster_centers_ : ndarray of shape (n_clusters, n_features) Coordinates of cluster centers

labels_ : int Labels of each point (if compute_labels is set to True).

inertia_ : float The value of the inertia criterion associated with the chosen partition (if compute_labels is set to True). The inertia is defined as the sum of square distances of samples to their nearest neighbor.

See Also -------- KMeans The classic implementation of the clustering method based on the Lloyd's algorithm. It consumes the whole set of input data at each iteration.

Notes ----- See https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

Examples -------- >>> from sklearn.cluster import MiniBatchKMeans >>> import numpy as np >>> X = np.array([1, 2], [1, 4], [1, 0], ... [4, 2], [4, 0], [4, 4], ... [4, 5], [0, 1], [2, 2], ... [3, 2], [5, 5], [1, -1]) >>> # manually fit on batches >>> kmeans = MiniBatchKMeans(n_clusters=2, ... random_state=0, ... batch_size=6) >>> kmeans = kmeans.partial_fit(X0:6,:) >>> kmeans = kmeans.partial_fit(X6:12,:) >>> kmeans.cluster_centers_ array([2. , 1. ], [3.5, 4.5]) >>> kmeans.predict([0, 0], [4, 4]) array(0, 1, dtype=int32) >>> # fit on the whole data >>> kmeans = MiniBatchKMeans(n_clusters=2, ... random_state=0, ... batch_size=6, ... max_iter=10).fit(X) >>> kmeans.cluster_centers_ array([3.95918367, 2.40816327], [1.12195122, 1.3902439 ]) >>> kmeans.predict([0, 0], [4, 4]) array(1, 0, dtype=int32)

val fit : ?y:Py.Object.t -> ?sample_weight:[> `ArrayLike ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> t

Compute the centroids on X by chunking it into mini-batches.

Parameters ---------- X : array-like or sparse matrix, shape=(n_samples, n_features) Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.

y : Ignored Not used, present here for API consistency by convention.

sample_weight : array-like, shape (n_samples,), optional The weights for each observation in X. If None, all observations are assigned equal weight (default: None).

.. versionadded:: 0.20

Returns ------- self

val fit_predict : ?y:Py.Object.t -> ?sample_weight:[> `ArrayLike ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> [> `ArrayLike ] Np.Obj.t

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) New data to transform.

y : Ignored Not used, present here for API consistency by convention.

sample_weight : array-like of shape (n_samples,), default=None The weights for each observation in X. If None, all observations are assigned equal weight.

Returns ------- labels : ndarray of shape (n_samples,) Index of the cluster each sample belongs to.

val fit_transform : ?y:Py.Object.t -> ?sample_weight:[> `ArrayLike ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> [> `ArrayLike ] Np.Obj.t

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) New data to transform.

y : Ignored Not used, present here for API consistency by convention.

sample_weight : array-like of shape (n_samples,), default=None The weights for each observation in X. If None, all observations are assigned equal weight.

Returns ------- X_new : array of shape (n_samples, n_clusters) X transformed in the new space.

val get_params : ?deep:bool -> [> tag ] Obj.t -> Dict.t

Get parameters for this estimator.

Parameters ---------- deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns ------- params : mapping of string to any Parameter names mapped to their values.

val partial_fit : ?y:Py.Object.t -> ?sample_weight:[> `ArrayLike ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> t

Update k means estimate on a single mini-batch X.

Parameters ---------- X : array-like of shape (n_samples, n_features) Coordinates of the data points to cluster. It must be noted that X will be copied if it is not C-contiguous.

y : Ignored Not used, present here for API consistency by convention.

sample_weight : array-like, shape (n_samples,), optional The weights for each observation in X. If None, all observations are assigned equal weight (default: None).

Returns ------- self

val predict : ?sample_weight:[> `ArrayLike ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> [> `ArrayLike ] Np.Obj.t

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, `cluster_centers_` is called the code book and each value returned by `predict` is the index of the closest code in the code book.

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) New data to predict.

sample_weight : array-like, shape (n_samples,), optional The weights for each observation in X. If None, all observations are assigned equal weight (default: None).

Returns ------- labels : array, shape n_samples, Index of the cluster each sample belongs to.

val score : ?y:Py.Object.t -> ?sample_weight:[> `ArrayLike ] Np.Obj.t -> x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> float

Opposite of the value of X on the K-means objective.

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) New data.

y : Ignored Not used, present here for API consistency by convention.

sample_weight : array-like of shape (n_samples,), default=None The weights for each observation in X. If None, all observations are assigned equal weight.

Returns ------- score : float Opposite of the value of X on the K-means objective.

val set_params : ?params:(string * Py.Object.t) list -> [> tag ] Obj.t -> t

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form ``<component>__<parameter>`` so that it's possible to update each component of a nested object.

Parameters ---------- **params : dict Estimator parameters.

Returns ------- self : object Estimator instance.

val transform : x:[> `ArrayLike ] Np.Obj.t -> [> tag ] Obj.t -> [> `ArrayLike ] Np.Obj.t

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by `transform` will typically be dense.

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) New data to transform.

Returns ------- X_new : ndarray of shape (n_samples, n_clusters) X transformed in the new space.

val cluster_centers_ : t -> [> `ArrayLike ] Np.Obj.t

Attribute cluster_centers_: get value or raise Not_found if None.

val cluster_centers_opt : t -> [> `ArrayLike ] Np.Obj.t option

Attribute cluster_centers_: get value as an option.

val labels_ : t -> int

Attribute labels_: get value or raise Not_found if None.

val labels_opt : t -> int option

Attribute labels_: get value as an option.

val inertia_ : t -> float

Attribute inertia_: get value or raise Not_found if None.

val inertia_opt : t -> float option

Attribute inertia_: get value as an option.

val to_string : t -> string

Print the object to a human-readable representation.

val show : t -> string

Print the object to a human-readable representation.

val pp : Stdlib.Format.formatter -> t -> unit

Pretty-print the object to a formatter.