Agglomerate features.
Similar to AgglomerativeClustering, but recursively merges features instead of samples.
Read more in the :ref:`User Guide <hierarchical_clustering>`.
Parameters ---------- n_clusters : int, default=2 The number of clusters to find. It must be ``None`` if ``distance_threshold`` is not ``None``.
affinity : str or callable, default='euclidean' Metric used to compute the linkage. Can be 'euclidean', 'l1', 'l2', 'manhattan', 'cosine', or 'precomputed'. If linkage is 'ward', only 'euclidean' is accepted.
memory : str or object with the joblib.Memory interface, default=None Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
connectivity : array-like or callable, default=None Connectivity matrix. Defines for each feature the neighboring features following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.
compute_full_tree : 'auto' or bool, optional, default='auto' Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of features. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be ``True`` if ``distance_threshold`` is not ``None``. By default `compute_full_tree` is 'auto', which is equivalent to `True` when `distance_threshold` is not `None` or that `n_clusters` is inferior to the maximum between 100 or `0.02 * n_samples`. Otherwise, 'auto' is equivalent to `False`.
linkage : 'ward', 'complete', 'average', 'single'
, default='ward' Which linkage criterion to use. The linkage criterion determines which distance to use between sets of features. The algorithm will merge the pairs of cluster that minimize this criterion.
- ward minimizes the variance of the clusters being merged.
- average uses the average of the distances of each feature of the two sets.
- complete or maximum linkage uses the maximum distances between all features of the two sets.
- single uses the minimum of the distances between all observations of the two sets.
pooling_func : callable, default=np.mean This combines the values of agglomerated features into a single value, and should accept an array of shape M, N
and the keyword argument `axis=1`, and reduce it to an array of size M
.
distance_threshold : float, default=None The linkage distance threshold above which, clusters will not be merged. If not ``None``, ``n_clusters`` must be ``None`` and ``compute_full_tree`` must be ``True``.
.. versionadded:: 0.21
Attributes ---------- n_clusters_ : int The number of clusters found by the algorithm. If ``distance_threshold=None``, it will be equal to the given ``n_clusters``.
labels_ : array-like of (n_features,) cluster labels for each feature.
n_leaves_ : int Number of leaves in the hierarchical tree.
n_connected_components_ : int The estimated number of connected components in the graph.
.. versionadded:: 0.21 ``n_connected_components_`` was added to replace ``n_components_``.
children_ : array-like of shape (n_nodes-1, 2) The children of each non-leaf node. Values less than `n_features` correspond to leaves of the tree which are the original samples. A node `i` greater than or equal to `n_features` is a non-leaf node and has children `children_i - n_features
`. Alternatively at the i-th iteration, childreni
0
and childreni
1
are merged to form node `n_features + i`
distances_ : array-like of shape (n_nodes-1,) Distances between nodes in the corresponding place in `children_`. Only computed if distance_threshold is not None.
Examples -------- >>> import numpy as np >>> from sklearn import datasets, cluster >>> digits = datasets.load_digits() >>> images = digits.images >>> X = np.reshape(images, (len(images), -1)) >>> agglo = cluster.FeatureAgglomeration(n_clusters=32) >>> agglo.fit(X) FeatureAgglomeration(n_clusters=32) >>> X_reduced = agglo.transform(X) >>> X_reduced.shape (1797, 32)