divik.cluster
module¶
Clustering methods
- class divik.cluster.DiviK(kmeans, fast_kmeans=None, distance='correlation', minimal_size=None, rejection_size=None, rejection_percentage=None, minimal_features_percentage=0.01, features_percentage=0.05, normalize_rows=None, neutral=None, use_logfilters=False, filter_type='gmm', n_jobs=None, verbose=False)[source]¶
DiviK clustering
- Parameters
- kmeans: AutoKMeans
A self-tuning KMeans estimator for the purpose of clustering. Two implementations are provided in divik.cluster package: DunnSearch and GAPSearch.
- fast_kmeans: GAPSearch, optional, default: None
A self-tuning KMeans estimator for the purpose of stop condition check. If None, the kmeans parameter is assumed to be the GAPSearch instance.
- distance: str, optional, default: ‘correlation’
The distance metric between points, centroids and for GAP index estimation. One of the distances supported by scipy package.
- minimal_size: int or float, optional, default: None
The minimum size of the region (the number of observations) to be considered for any further divisions. If provided number is between 0 and 1, it is considered a rate of training dataset size. When left None, defaults to 0.1% of the training dataset size.
- rejection_size: int, optional, default: None
Size under which split will be rejected - if a cluster appears in the split that is below rejection_size, the split is considered improper and discarded. This may be useful for some domains (like there is no justification for a 3-cells cluster in biological data). By default, no segmentation is discarded, as careful post-processing provides the same advantage.
- rejection_percentage: float, optional, default: None
An alternative to
rejection_size
, with the same behavior, but this parameter is related to the training data size percentage. By default, no segmentation is discarded.- minimal_features_percentage: float, optional, default: 0.01
The minimal percentage of features that must be preserved after GMM-based feature selection. By default at least 1% of features is preserved in the filtration process.
- features_percentage: float, optional, default: 0.05
The target percentage of features that are used by fallback percentage filter for ‘outlier’ filter.
- normalize_rows: bool, optional, default: None
Whether to normalize each row of the data to the norm of 1. By default, it normalizes rows for correlation metric, does no normalization otherwise.
- neutral: float, optional, default: None
Element skipped when filtering.
- use_logfilters: bool, optional, default: False
Whether to compute logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- filter_type: {‘gmm’, ‘outlier’, ‘auto’, ‘none’}, default: ‘gmm’
‘gmm’ - usual Gaussian Mixture Model-based filtering, useful for high
dimensional cases - ‘outlier’ - robust outlier detection-based filtering, useful for low dimensional cases. In the case of no outliers, percentage-based filtering is applied. - ‘auto’ - automatically selects between ‘gmm’ and ‘outlier’ based on the dimensionality. When more than 250 features are present, ‘gmm’ is chosen. - ‘none’ - feature selection is disabled
- n_jobs: int, optional, default: None
The number of jobs to use for the computation. This works by computing each of the GAP index evaluations in parallel and by making predictions in parallel.
- verbose: bool, optional, default: False
Whether to report the progress of the computations.
Examples
>>> from divik.cluster import DiviK, DunnSearch, KMeans >>> from sklearn.datasets import make_blobs >>> X, _ = make_blobs(n_samples=1_000, ... n_features=2, ... centers=7, ... random_state=42, ... ) >>> divik = DiviK( ... kmeans=DunnSearch( # we want to use Dunn's method for finding the optimal number of clusters ... kmeans=KMeans( ... n_clusters=2, # it is required, like in scikit-learn, but you can provide any number here, ... # DunnSearch will override it anyway ... ), ... max_clusters=5, # for the sake of the example I'll keep it low ... ), ... minimal_size=100, # for the sake of the example, I won't split clusters with less than 100 elements ... filter_type='none', # we have 2 features in sample data, feature selection would be pointless ... ).fit(X) >>> divik.n_clusters_ 22 >>> divik.labels_ array([1, 1, 1, 0, ..., 0, 0], dtype=int32) >>> divik.predict([[0, ..., 0], [12, ..., 3]]) array([1, 0], dtype=int32) >>> divik.cluster_centers_ array([[10., ..., 2.], ..., [ 1, ..., 2.]])
- Attributes
- result_: divik.DivikResult
Hierarchical structure describing all the consecutive segmentations.
- labels_:
Labels of each point
- centroids_: array, [n_clusters, n_features]
Coordinates of cluster centers. If the algorithm stops before fully converging, these will not be consistent with
labels_
. Also, the distance between points and respective centroids must be captured in appropriate features subspace. This is realized by thetransform
method.- filters_: array, [n_clusters, n_features]
Filters that were applied to the feature space on the level that was the final segmentation for a subset.
- depth_: int
The number of hierarchy levels in the segmentation.
- n_clusters_: int
The final number of clusters in the segmentation, on the tree leaf level.
- paths_: Dict[int, Tuple[int]]
Describes how the cluster number corresponds to the path in the tree. Element of the tuple indicates the sub-segment number on each tree level.
- reverse_paths_: Dict[Tuple[int], int]
Describes how the path in the tree corresponds to the cluster number. For more details see
paths_
.
Methods
fit
(X[, y])Compute DiviK clustering.
fit_predict
(X[, y])Compute cluster centers and predict cluster index for each sample.
fit_transform
(X[, y])Compute clustering and transform X to cluster-distance space.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
- fit(X, y=None)[source]¶
Compute DiviK clustering.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
- fit_predict(X, y=None)[source]¶
Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- yIgnored
not used, present here for API consistency by convention.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- fit_transform(X, y=None, **fit_params)[source]¶
Compute clustering and transform X to cluster-distance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- yIgnored
not used, present here for API consistency by convention.
- Returns
- X_newarray, shape [n_samples, self.n_clusters_]
X transformed in the new space.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]¶
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)[source]¶
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, self.n_clusters_]
X transformed in the new space.
- class divik.cluster.DunnSearch(kmeans, max_clusters, min_clusters=2, method='full', inter='centroid', intra='avg', sample_size=1000, n_trials=10, seed=42, n_jobs=1, drop_unfit=False, verbose=False)[source]¶
Select best number of clusters for k-means
- Parameters
- kmeansKMeans
KMeans object to tune
- max_clusters: int
The maximal number of clusters to form and score.
- min_clusters: int, default: 1
The minimal number of clusters to form and score.
- method: {‘full’, ‘sampled’, ‘auto’}, default: ‘full’
Whether to run full computations or approximate. - full - always computes full Dunn’s index, without sampling - sampled - samples the clusters to reduce computational overhead - auto - switches the above methods to provide best performance-quality trade-off.
- inter{‘centroid’, ‘closest’}, default: ‘centroid’
How the distance between clusters is computed. For more details see dunn.
- intra{‘avg’, ‘furthest’}, default: ‘avg’
How the cluster internal distance is computed. For more details see dunn.
- sample_sizeint, default: 1000
Size of the sample used to compute Dunn index in auto or sampled scenario.
- n_trialsint, default: 10
Number of trials to use when computing Dunn index in auto or sampled scenario.
- seedint, default: 42
Random seed for the reproducibility of subset draws in Dunn auto or sampled scenario.
- n_jobs: int, default: 1
The number of jobs to use for the computation. This works by computing each of the clustering & scoring runs in parallel.
- drop_unfit: bool, default: False
If True, drops the estimators that did not fit the data.
- verbose: bool, default: False
If True, shows progress with tqdm.
- Attributes
- cluster_centers_: array, [n_clusters, n_features]
Coordinates of cluster centers.
- labels_:
Labels of each point.
- estimators_: List[KMeans]
KMeans instances for n_clusters in range [min_clusters, max_clusters].
- scores_: array, [max_clusters - min_clusters + 1,]
Array with scores for each estimator.
- n_clusters_: int
Estimated optimal number of clusters.
- best_score_: float
Score of the optimal estimator.
- best_: KMeans
The optimal estimator.
Methods
fit
(X[, y])Compute k-means clustering and estimate optimal number of clusters.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
- fit(X, y=None)[source]¶
Compute k-means clustering and estimate optimal number of clusters.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
- fit_predict(X, y=None)¶
Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]¶
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)[source]¶
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, k]
X transformed in the new space.
- class divik.cluster.GAPSearch(kmeans, max_clusters, min_clusters=1, reference_sampler=None, n_jobs=1, seed=0, n_trials=10, sample_size=1000, drop_unfit=False, verbose=False)[source]¶
Select best number of cluters for k-means
- Parameters
- kmeansKMeans
KMeans object to tune
- max_clusters: int
The maximal number of clusters to form and score.
- min_clusters: int, default: 1
The minimal number of clusters to form and score.
- reference_sampler: BaseSampler, default: None
Sampler for reference dataset sampling in GAP statistic computations.
- n_jobs: int, default: 1
The number of jobs to use for the computation. This works by computing each of the clustering & scoring runs in parallel.
- seed: int, default: 0
Random seed for generating uniform data sets.
- n_trials: int, default: 10
Number of data sets drawn as a reference.
- sample_sizeint, default: 1000
Size of the sample used for GAP statistic computation. Used only if introduces speedup.
- drop_unfit: bool, default: False
If True, drops the estimators that did not fit the data.
- verbose: bool, default: False
If True, shows progress with tqdm.
- Attributes
- cluster_centers_: array, [n_clusters, n_features]
Coordinates of cluster centers.
- labels_:
Labels of each point.
- estimators_: List[KMeans]
KMeans instances for n_clusters in range [min_clusters, max_clusters].
- scores_: array, [max_clusters - min_clusters + 1, ?]
Array with scores for each estimator in each row.
- n_clusters_: int
Estimated optimal number of clusters.
- best_score_: float
Score of the optimal estimator.
- best_: KMeans
The optimal estimator.
Methods
fit
(X[, y])Compute k-means clustering and estimate optimal number of clusters.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
- fit(X, y=None)[source]¶
Compute k-means clustering and estimate optimal number of clusters.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
- fit_predict(X, y=None)¶
Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]¶
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)[source]¶
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, k]
X transformed in the new space.
- class divik.cluster.KMeans(n_clusters, distance='euclidean', init='percentile', percentile=95.0, leaf_size=0.01, max_iter=100, normalize_rows=False, allow_dask=False)[source]¶
K-Means clustering
- Parameters
- n_clustersint
The number of clusters to form as well as the number of centroids to generate.
- distancestr, optional, default: ‘euclidean’
Distance measure. One of the distances supported by scipy package.
- init{‘percentile’, ‘extreme’, ‘kdtree’, ‘kdtree_percentile’}
Method for initialization, defaults to ‘percentile’:
‘percentile’ : selects initial cluster centers for k-mean clustering starting from specified percentile of distance to already selected clusters
‘extreme’: selects initial cluster centers for k-mean clustering starting from the furthest points to already specified clusters
‘kdtree’: selects initial cluster centers for k-mean clustering starting from centroids of KD-Tree boxes
‘kdtree_percentile’: selects initial cluster centers for k-means clustering starting from centroids of KD-Tree boxes containing specified percentile. This should be more robust against outliers.
- percentilefloat, default: 95.0
Specifies the starting percentile for ‘percentile’ initialization. Must be within range [0.0, 100.0]. At 100.0 it is equivalent to ‘extreme’ initialization.
- leaf_sizeint or float, optional (default 0.01)
Desired leaf size in kdtree initialization. When int, the box size will be between leaf_size and 2 * leaf_size. When float, it will be between leaf_size * n_samples and 2 * leaf_size * n_samples
- max_iterint, default: 100
Maximum number of iterations of the k-means algorithm for a single run.
- normalize_rowsbool, default: False
If True, rows are translated to mean of 0.0 and scaled to norm of 1.0.
- allow_daskbool, default: False
If True, automatically selects dask as computations backend whenever reasonable. Default False since it cannot be used together with multiprocessing.Pool and everywhere n_jobs must be set to 1.
- Attributes
- cluster_centers_array, [n_clusters, n_features]
Coordinates of cluster centers.
- labels_
Labels of each point
Methods
fit
(X[, y])Compute k-means clustering.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
- fit(X, y=None)[source]¶
Compute k-means clustering.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
- fit_predict(X, y=None)¶
Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]¶
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)[source]¶
Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, k]
X transformed in the new space.
- class divik.cluster.TwoStep(clusterer, n_subsets=10, random_state=42)[source]¶
Perform a two-step clustering with a given clusterer
Separates a dataset into
n_subsets
, processes each of them separately and then combines the results.Works with centroid-based clustering methods, as it requires cluster representatives to combine the result.
- Parameters
- clustererUnion[AutoKMeans, Pipeline, KMeans]
A centroid-based estimator for the purpose of clustering.
- n_subsetsint, default 10
The number of subsets into which the original dataset should be separated
- random_stateint, default 42
Random state to use for seeding the random number generator.
Examples
>>> from sklearn.datasets import make_blobs >>> from divik.cluster import KMeans, TwoStep >>> X, _ = make_blobs( ... n_samples=10_000, n_features=2, centers=3, random_state=42 ... ) >>> kmeans = KMeans(n_clusters=3) >>> ctr = TwoStep(kmeans).fit(X)
Methods
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
predict
- fit_predict(X, y=None)[source]¶
Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.