package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
val get_py : string -> Py.Object.t

Get an attribute of this module as a Py.Object.t. This is useful to pass a Python function to another function.

module AffinityPropagation : sig ... end
module AgglomerativeClustering : sig ... end
module Birch : sig ... end
module DBSCAN : sig ... end
module FeatureAgglomeration : sig ... end
module KMeans : sig ... end
module MeanShift : sig ... end
module MiniBatchKMeans : sig ... end
module OPTICS : sig ... end
module SpectralBiclustering : sig ... end
module SpectralClustering : sig ... end
module SpectralCoclustering : sig ... end
val affinity_propagation : ?preference:[> `ArrayLike ] Np.Obj.t -> ?convergence_iter:int -> ?max_iter:int -> ?damping:float -> ?copy:bool -> ?verbose:int -> ?return_n_iter:bool -> ?random_state:int -> s:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * int

Perform Affinity Propagation Clustering of data

Read more in the :ref:`User Guide <affinity_propagation>`.

Parameters ----------

S : array-like, shape (n_samples, n_samples) Matrix of similarities between points

preference : array-like, shape (n_samples,) or float, optional Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, i.e. of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities (resulting in a moderate number of clusters). For a smaller amount of clusters, this can be set to the minimum value of the similarities.

convergence_iter : int, optional, default: 15 Number of iterations with no change in the number of estimated clusters that stops the convergence.

max_iter : int, optional, default: 200 Maximum number of iterations

damping : float, optional, default: 0.5 Damping factor between 0.5 and 1.

copy : boolean, optional, default: True If copy is False, the affinity matrix is modified inplace by the algorithm, for memory efficiency

verbose : boolean, optional, default: False The verbosity level

return_n_iter : bool, default False Whether or not to return the number of iterations.

random_state : int or np.random.RandomStateInstance, default: 0 Pseudo-random number generator to control the starting state. Use an int for reproducible results across function calls. See the :term:`Glossary <random_state>`.

.. versionadded:: 0.23 this parameter was previously hardcoded as 0.

Returns -------

cluster_centers_indices : array, shape (n_clusters,) index of clusters centers

labels : array, shape (n_samples,) cluster labels for each point

n_iter : int number of iterations run. Returned only if `return_n_iter` is set to True.

Notes ----- For an example, see :ref:`examples/cluster/plot_affinity_propagation.py <sphx_glr_auto_examples_cluster_plot_affinity_propagation.py>`.

When the algorithm does not converge, it returns an empty array as ``cluster_center_indices`` and ``-1`` as label for each training sample.

When all training samples have equal similarities and equal preferences, the assignment of cluster centers and labels depends on the preference. If the preference is smaller than the similarities, a single cluster center and label ``0`` for every sample will be returned. Otherwise, every training sample becomes its own cluster center and is assigned a unique label.

References ---------- Brendan J. Frey and Delbert Dueck, 'Clustering by Passing Messages Between Data Points', Science Feb. 2007

val cluster_optics_dbscan : reachability:[> `ArrayLike ] Np.Obj.t -> core_distances:[> `ArrayLike ] Np.Obj.t -> ordering:[> `ArrayLike ] Np.Obj.t -> eps:float -> unit -> [> `ArrayLike ] Np.Obj.t

Performs DBSCAN extraction for an arbitrary epsilon.

Extracting the clusters runs in linear time. Note that this results in ``labels_`` which are close to a :class:`~sklearn.cluster.DBSCAN` with similar settings and ``eps``, only if ``eps`` is close to ``max_eps``.

Parameters ---------- reachability : array, shape (n_samples,) Reachability distances calculated by OPTICS (``reachability_``)

core_distances : array, shape (n_samples,) Distances at which points become core (``core_distances_``)

ordering : array, shape (n_samples,) OPTICS ordered point indices (``ordering_``)

eps : float DBSCAN ``eps`` parameter. Must be set to < ``max_eps``. Results will be close to DBSCAN algorithm if ``eps`` and ``max_eps`` are close to one another.

Returns ------- labels_ : array, shape (n_samples,) The estimated labels.

val cluster_optics_xi : ?min_cluster_size:[ `I of int | `F of float ] -> ?xi:[ `Between_0_and_1 of Py.Object.t | `F of float ] -> ?predecessor_correction:bool -> reachability:[> `ArrayLike ] Np.Obj.t -> predecessor:[> `ArrayLike ] Np.Obj.t -> ordering:[> `ArrayLike ] Np.Obj.t -> min_samples:[ `I of int | `F of float ] -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Automatically extract clusters according to the Xi-steep method.

Parameters ---------- reachability : array, shape (n_samples,) Reachability distances calculated by OPTICS (`reachability_`)

predecessor : array, shape (n_samples,) Predecessors calculated by OPTICS.

ordering : array, shape (n_samples,) OPTICS ordered point indices (`ordering_`)

min_samples : int > 1 or float between 0 and 1 The same as the min_samples given to OPTICS. Up and down steep regions can't have more then ``min_samples`` consecutive non-steep points. Expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2).

min_cluster_size : int > 1 or float between 0 and 1 (default=None) Minimum number of samples in an OPTICS cluster, expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2). If ``None``, the value of ``min_samples`` is used instead.

xi : float, between 0 and 1, optional (default=0.05) Determines the minimum steepness on the reachability plot that constitutes a cluster boundary. For example, an upwards point in the reachability plot is defined by the ratio from one point to its successor being at most 1-xi.

predecessor_correction : bool, optional (default=True) Correct clusters based on the calculated predecessors.

Returns ------- labels : array, shape (n_samples) The labels assigned to samples. Points which are not included in any cluster are labeled as -1.

clusters : array, shape (n_clusters, 2) The list of clusters in the form of ``start, end`` in each row, with all indices inclusive. The clusters are ordered according to ``(end, -start)`` (ascending) so that larger clusters encompassing smaller clusters come after such nested smaller clusters. Since ``labels`` does not reflect the hierarchy, usually ``len(clusters) > np.unique(labels)``.

val compute_optics_graph : x:[> `ArrayLike ] Np.Obj.t -> min_samples:[ `I of int | `F of float ] -> max_eps:float -> metric:[ `S of string | `Callable of Py.Object.t ] -> p:int -> metric_params:Dict.t -> algorithm:[ `Auto | `Ball_tree | `Kd_tree | `Brute ] -> leaf_size:int -> n_jobs:[ `I of int | `None ] -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Computes the OPTICS reachability graph.

Read more in the :ref:`User Guide <optics>`.

Parameters ---------- X : array, shape (n_samples, n_features), or (n_samples, n_samples) if metric=’precomputed’. A feature array, or array of distances between samples if metric='precomputed'

min_samples : int > 1 or float between 0 and 1 The number of samples in a neighborhood for a point to be considered as a core point. Expressed as an absolute number or a fraction of the number of samples (rounded to be at least 2).

max_eps : float, optional (default=np.inf) The maximum distance between two samples for one to be considered as in the neighborhood of the other. Default value of ``np.inf`` will identify clusters across all scales; reducing ``max_eps`` will result in shorter run times.

metric : string or callable, optional (default='minkowski') Metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy's metrics, but is less efficient than passing the metric name as a string. If metric is 'precomputed', X is assumed to be a distance matrix and must be square.

Valid values for metric are:

  • from scikit-learn: 'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'
  • from scipy.spatial.distance: 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'

See the documentation for scipy.spatial.distance for details on these metrics.

p : integer, optional (default=2) Parameter for the Minkowski metric from :class:`sklearn.metrics.pairwise_distances`. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric_params : dict, optional (default=None) Additional keyword arguments for the metric function.

algorithm : 'auto', 'ball_tree', 'kd_tree', 'brute', optional Algorithm used to compute the nearest neighbors:

  • 'ball_tree' will use :class:`BallTree`
  • 'kd_tree' will use :class:`KDTree`
  • 'brute' will use a brute-force search.
  • 'auto' will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. (default)

Note: fitting on sparse input will override the setting of this parameter, using brute force.

leaf_size : int, optional (default=30) Leaf size passed to :class:`BallTree` or :class:`KDTree`. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

n_jobs : int or None, optional (default=None) The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

Returns ------- ordering_ : array, shape (n_samples,) The cluster ordered list of sample indices.

core_distances_ : array, shape (n_samples,) Distance at which each sample becomes a core point, indexed by object order. Points which will never be core have a distance of inf. Use ``clust.core_distances_clust.ordering_`` to access in cluster order.

reachability_ : array, shape (n_samples,) Reachability distances per sample, indexed by object order. Use ``clust.reachability_clust.ordering_`` to access in cluster order.

predecessor_ : array, shape (n_samples,) Point that a sample was reached from, indexed by object order. Seed points have a predecessor of -1.

References ---------- .. 1 Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 'OPTICS: ordering points to identify the clustering structure.' ACM SIGMOD Record 28, no. 2 (1999): 49-60.

val dbscan : ?eps:float -> ?min_samples:int -> ?metric:[ `S of string | `Callable of Py.Object.t ] -> ?metric_params:Dict.t -> ?algorithm:[ `Auto | `Ball_tree | `Kd_tree | `Brute ] -> ?leaf_size:int -> ?p:float -> ?sample_weight:[> `ArrayLike ] Np.Obj.t -> ?n_jobs:int -> x:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Perform DBSCAN clustering from vector array or distance matrix.

Read more in the :ref:`User Guide <dbscan>`.

Parameters ---------- X : array-like, sparse (CSR) matrix of shape (n_samples, n_features) or (n_samples, n_samples) A feature array, or array of distances between samples if ``metric='precomputed'``.

eps : float, default=0.5 The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.

min_samples : int, default=5 The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metric : string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by :func:`sklearn.metrics.pairwise_distances` for its metric parameter. If metric is 'precomputed', X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph <sparse graph>`, in which case only 'nonzero' elements may be considered neighbors.

metric_params : dict, default=None Additional keyword arguments for the metric function.

.. versionadded:: 0.19

algorithm : 'auto', 'ball_tree', 'kd_tree', 'brute', default='auto' The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

leaf_size : int, default=30 Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

p : float, default=2 The power of the Minkowski metric to be used to calculate distance between points.

sample_weight : array-like of shape (n_samples,), default=None Weight of each sample, such that a sample with a weight of at least ``min_samples`` is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

n_jobs : int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details. If precomputed distance are used, parallel execution is not available and thus n_jobs will have no effect.

Returns ------- core_samples : ndarray of shape (n_core_samples,) Indices of core samples.

labels : ndarray of shape (n_samples,) Cluster labels for each point. Noisy samples are given the label -1.

See also -------- DBSCAN An estimator interface for this clustering algorithm. OPTICS A similar estimator interface clustering at multiple values of eps. Our implementation is optimized for memory usage.

Notes ----- For an example, see :ref:`examples/cluster/plot_dbscan.py <sphx_glr_auto_examples_cluster_plot_dbscan.py>`.

This implementation bulk-computes all neighborhood queries, which increases the memory complexity to O(n.d) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n). It may attract a higher memory complexity when querying these nearest neighborhoods, depending on the ``algorithm``.

One way to avoid the query complexity is to pre-compute sparse neighborhoods in chunks using :func:`NearestNeighbors.radius_neighbors_graph <sklearn.neighbors.NearestNeighbors.radius_neighbors_graph>` with ``mode='distance'``, then using ``metric='precomputed'`` here.

Another way to reduce memory and computation time is to remove (near-)duplicate points and use ``sample_weight`` instead.

:func:`cluster.optics <sklearn.cluster.optics>` provides a similar clustering with lower memory usage.

References ---------- Ester, M., H. P. Kriegel, J. Sander, and X. Xu, 'A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise'. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.

val estimate_bandwidth : ?quantile:float -> ?n_samples:int -> ?random_state:int -> ?n_jobs:int -> x:[> `ArrayLike ] Np.Obj.t -> unit -> float

Estimate the bandwidth to use with the mean-shift algorithm.

That this function takes time at least quadratic in n_samples. For large datasets, it's wise to set that parameter to a small value.

Parameters ---------- X : array-like of shape (n_samples, n_features) Input points.

quantile : float, default=0.3 should be between 0, 1 0.5 means that the median of all pairwise distances is used.

n_samples : int, default=None The number of samples to use. If not given, all samples are used.

random_state : int, RandomState instance, default=None The generator used to randomly select the samples from input points for bandwidth estimation. Use an int to make the randomness deterministic. See :term:`Glossary <random_state>`.

n_jobs : int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

Returns ------- bandwidth : float The bandwidth parameter.

val get_bin_seeds : ?min_bin_freq:int -> x:[> `ArrayLike ] Np.Obj.t -> bin_size:float -> unit -> [> `ArrayLike ] Np.Obj.t

Finds seeds for mean_shift.

Finds seeds by first binning data onto a grid whose lines are spaced bin_size apart, and then choosing those bins with at least min_bin_freq points.

Parameters ----------

X : array-like of shape (n_samples, n_features) Input points, the same points that will be used in mean_shift.

bin_size : float Controls the coarseness of the binning. Smaller values lead to more seeding (which is computationally more expensive). If you're not sure how to set this, set it to the value of the bandwidth used in clustering.mean_shift.

min_bin_freq : int, default=1 Only bins with at least min_bin_freq will be selected as seeds. Raising this value decreases the number of seeds found, which makes mean_shift computationally cheaper.

Returns ------- bin_seeds : array-like of shape (n_samples, n_features) Points used as initial kernel positions in clustering.mean_shift.

val k_means : ?sample_weight:[> `ArrayLike ] Np.Obj.t -> ?init: [ `K_means_ | `Random | `Callable of Py.Object.t | `Arr of [> `ArrayLike ] Np.Obj.t ] -> ?precompute_distances:[ `Auto | `Bool of bool ] -> ?n_init:int -> ?max_iter:int -> ?verbose:int -> ?tol:float -> ?random_state:int -> ?copy_x:bool -> ?n_jobs:int -> ?algorithm:[ `Auto | `Full | `Elkan ] -> ?return_n_iter:bool -> x:Py.Object.t -> n_clusters:int -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t * float * int

K-means clustering algorithm.

Read more in the :ref:`User Guide <k_means>`.

Parameters ---------- X : array-like, sparse matrix of shape (n_samples, n_features) The observations to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.

n_clusters : int The number of clusters to form as well as the number of centroids to generate.

sample_weight : array-like of shape (n_samples,), default=None The weights for each observation in X. If None, all observations are assigned equal weight

init : 'k-means++', 'random', ndarray, callable, default='k-means++' Method for initialization:

'k-means++' : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

'random': choose `n_clusters` observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

precompute_distances : 'auto', True, False Precompute distances (faster but takes more memory).

'auto' : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

True : always precompute distances

False : never precompute distances

.. deprecated:: 0.23 'precompute_distances' was deprecated in version 0.23 and will be removed in 0.25. It has no effect.

n_init : int, default=10 Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

max_iter : int, default=300 Maximum number of iterations of the k-means algorithm to run.

verbose : bool, default=False Verbosity mode.

tol : float, default=1e-4 Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

random_state : int, RandomState instance, default=None Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See :term:`Glossary <random_state>`.

copy_x : bool, default=True When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.

n_jobs : int, default=None The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center.

``None`` or ``-1`` means using all processors.

.. deprecated:: 0.23 ``n_jobs`` was deprecated in version 0.23 and will be removed in 0.25.

algorithm : 'auto', 'full', 'elkan', default='auto' K-means algorithm to use. The classical EM-style algorithm is 'full'. The 'elkan' variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it's more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

For now 'auto' (kept for backward compatibiliy) chooses 'elkan' but it might change in the future for a better heuristic.

return_n_iter : bool, default=False Whether or not to return the number of iterations.

Returns ------- centroid : ndarray of shape (n_clusters, n_features) Centroids found at the last iteration of k-means.

label : ndarray of shape (n_samples,) labeli is the code or index of the centroid the i'th observation is closest to.

inertia : float The final value of the inertia criterion (sum of squared distances to the closest centroid for all observations in the training set).

best_n_iter : int Number of iterations corresponding to the best results. Returned only if `return_n_iter` is set to True.

val linkage_tree : ?connectivity:[> `Spmatrix ] Np.Obj.t -> ?n_clusters:int -> ?linkage:[ `Average | `Complete | `Single ] -> ?affinity:[ `S of string | `Callable of Py.Object.t ] -> x:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * int * int * [> `ArrayLike ] Np.Obj.t option * [> `ArrayLike ] Np.Obj.t

Linkage agglomerative clustering based on a Feature matrix.

The inertia matrix uses a Heapq-based representation.

This is the structured version, that takes into account some topological structure between samples.

Read more in the :ref:`User Guide <hierarchical_clustering>`.

Parameters ---------- X : array, shape (n_samples, n_features) feature matrix representing n_samples samples to be clustered

connectivity : sparse matrix (optional). connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. The matrix is assumed to be symmetric and only the upper triangular half is used. Default is None, i.e, the Ward algorithm is unstructured.

n_clusters : int (optional) Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. In this case, the complete tree is not computed, thus the 'children' output is of limited use, and the 'parents' output should rather be used. This option is valid only when specifying a connectivity matrix.

linkage : 'average', 'complete', 'single', optional, default: 'complete' Which linkage criteria to use. The linkage criterion determines which distance to use between sets of observation.

  • average uses the average of the distances of each observation of the two sets
  • complete or maximum linkage uses the maximum distances between all observations of the two sets.
  • single uses the minimum of the distances between all observations of the two sets.

affinity : string or callable, optional, default: 'euclidean'. which metric to use. Can be 'euclidean', 'manhattan', or any distance know to paired distance (see metric.pairwise)

return_distance : bool, default False whether or not to return the distances between the clusters.

Returns ------- children : 2D array, shape (n_nodes-1, 2) The children of each non-leaf node. Values less than `n_samples` correspond to leaves of the tree which are the original samples. A node `i` greater than or equal to `n_samples` is a non-leaf node and has children `children_i - n_samples`. Alternatively at the i-th iteration, childreni0 and childreni1 are merged to form node `n_samples + i`

n_connected_components : int The number of connected components in the graph.

n_leaves : int The number of leaves in the tree.

parents : 1D array, shape (n_nodes, ) or None The parent of each node. Only returned when a connectivity matrix is specified, elsewhere 'None' is returned.

distances : ndarray, shape (n_nodes-1,) Returned when return_distance is set to True.

distancesi refers to the distance between childreni0 and childreni1 when they are merged.

See also -------- ward_tree : hierarchical clustering with ward linkage

val mean_shift : ?bandwidth:float -> ?seeds:[> `ArrayLike ] Np.Obj.t -> ?bin_seeding:bool -> ?min_bin_freq:int -> ?cluster_all:bool -> ?max_iter:int -> ?n_jobs:int -> x:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * [> `ArrayLike ] Np.Obj.t

Perform mean shift clustering of data using a flat kernel.

Read more in the :ref:`User Guide <mean_shift>`.

Parameters ----------

X : array-like of shape (n_samples, n_features) Input data.

bandwidth : float, default=None Kernel bandwidth.

If bandwidth is not given, it is determined using a heuristic based on the median of all pairwise distances. This will take quadratic time in the number of samples. The sklearn.cluster.estimate_bandwidth function can be used to do this more efficiently.

seeds : array-like of shape (n_seeds, n_features) or None Point used as initial kernel locations. If None and bin_seeding=False, each data point is used as a seed. If None and bin_seeding=True, see bin_seeding.

bin_seeding : boolean, default=False If true, initial kernel locations are not locations of all points, but rather the location of the discretized version of points, where points are binned onto a grid whose coarseness corresponds to the bandwidth. Setting this option to True will speed up the algorithm because fewer seeds will be initialized. Ignored if seeds argument is not None.

min_bin_freq : int, default=1 To speed up the algorithm, accept only those bins with at least min_bin_freq points as seeds.

cluster_all : bool, default=True If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.

max_iter : int, default=300 Maximum number of iterations, per seed point before the clustering operation terminates (for that seed point), if has not converged yet.

n_jobs : int, default=None The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

.. versionadded:: 0.17 Parallel Execution using *n_jobs*.

Returns -------

cluster_centers : array, shape=n_clusters, n_features Coordinates of cluster centers.

labels : array, shape=n_samples Cluster labels for each point.

Notes ----- For an example, see :ref:`examples/cluster/plot_mean_shift.py <sphx_glr_auto_examples_cluster_plot_mean_shift.py>`.

val spectral_clustering : ?n_clusters:int -> ?n_components:int -> ?eigen_solver:[ `Arpack | `PyObject of Py.Object.t | `Lobpcg ] -> ?random_state:int -> ?n_init:int -> ?eigen_tol:float -> ?assign_labels:[ `Kmeans | `Discretize ] -> affinity:[> `ArrayLike ] Np.Obj.t -> unit -> Py.Object.t

Apply clustering to a projection of the normalized Laplacian.

In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance, when clusters are nested circles on the 2D plane.

If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.

Read more in the :ref:`User Guide <spectral_clustering>`.

Parameters ---------- affinity : array-like or sparse matrix, shape: (n_samples, n_samples) The affinity matrix describing the relationship of the samples to embed. **Must be symmetric**.

Possible examples:

  • adjacency matrix of a graph,
  • heat kernel of the pairwise distance matrix of the samples,
  • symmetric k-nearest neighbours connectivity matrix of the samples.

n_clusters : integer, optional Number of clusters to extract.

n_components : integer, optional, default is n_clusters Number of eigen vectors to use for the spectral embedding

eigen_solver : None, 'arpack', 'lobpcg', or 'amg' The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities

random_state : int, RandomState instance, default=None A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization. Use an int to make the randomness deterministic. See :term:`Glossary <random_state>`.

n_init : int, optional, default: 10 Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

eigen_tol : float, optional, default: 0.0 Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.

assign_labels : 'kmeans', 'discretize', default: 'kmeans' The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization. See the 'Multiclass spectral clustering' paper referenced below for more details on the discretization approach.

Returns ------- labels : array of integers, shape: n_samples The labels of the clusters.

References ----------

  • Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
  • A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
  • Multiclass spectral clustering, 2003 Stella X. Yu, Jianbo Shi https://www1.icsi.berkeley.edu/~stellayu/publication/doc/2003kwayICCV.pdf

Notes ----- The graph should contain only one connect component, elsewhere the results make little sense.

This algorithm solves the normalized cut for k=2: it is a normalized spectral clustering.

val ward_tree : ?connectivity:[> `Spmatrix ] Np.Obj.t -> ?n_clusters:int -> x:[> `ArrayLike ] Np.Obj.t -> unit -> [> `ArrayLike ] Np.Obj.t * int * int * [> `ArrayLike ] Np.Obj.t option * [> `ArrayLike ] Np.Obj.t * Py.Object.t

Ward clustering based on a Feature matrix.

Recursively merges the pair of clusters that minimally increases within-cluster variance.

The inertia matrix uses a Heapq-based representation.

This is the structured version, that takes into account some topological structure between samples.

Read more in the :ref:`User Guide <hierarchical_clustering>`.

Parameters ---------- X : array, shape (n_samples, n_features) feature matrix representing n_samples samples to be clustered

connectivity : sparse matrix (optional). connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. The matrix is assumed to be symmetric and only the upper triangular half is used. Default is None, i.e, the Ward algorithm is unstructured.

n_clusters : int (optional) Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. In this case, the complete tree is not computed, thus the 'children' output is of limited use, and the 'parents' output should rather be used. This option is valid only when specifying a connectivity matrix.

return_distance : bool (optional) If True, return the distance between the clusters.

Returns ------- children : 2D array, shape (n_nodes-1, 2) The children of each non-leaf node. Values less than `n_samples` correspond to leaves of the tree which are the original samples. A node `i` greater than or equal to `n_samples` is a non-leaf node and has children `children_i - n_samples`. Alternatively at the i-th iteration, childreni0 and childreni1 are merged to form node `n_samples + i`

n_connected_components : int The number of connected components in the graph.

n_leaves : int The number of leaves in the tree

parents : 1D array, shape (n_nodes, ) or None The parent of each node. Only returned when a connectivity matrix is specified, elsewhere 'None' is returned.

distances : 1D array, shape (n_nodes-1, ) Only returned if return_distance is set to True (for compatibility). The distances between the centers of the nodes. `distancesi` corresponds to a weighted euclidean distance between the nodes `childreni, 1` and `childreni, 2`. If the nodes refer to leaves of the tree, then `distancesi` is their unweighted euclidean distance. Distances are updated in the following way (from scipy.hierarchy.linkage):

The new entry :math:`d(u,v)` is computed as follows,

.. math::

d(u,v) = \sqrt\frac{ |v|+|s| Td(v,s)^2

  1. \frac |v|+|t| Td(v,t)^2
  • \frac |v| Td(s,t)^2

}

where :math:`u` is the newly joined cluster consisting of clusters :math:`s` and :math:`t`, :math:`v` is an unused cluster in the forest, :math:`T=|v|+|s|+|t|`, and :math:`|*|` is the cardinality of its argument. This is also known as the incremental algorithm.