Welcome to divik’s documentation!¶
Here you can find a list of documentation topics covered by this page.
Cluster analysis with fitclusters
executable¶
Note
fitclusters
requires installation with gin
extras, e.g.
pip install divik[gin]
fitclusters is just one CLI executable that allows you to run DiviK algorithm, any other clustering algorithms supported by scikitlearn or even a pipeline with preprocessing.
Usage¶
CLI interface¶
There are two types of parameters:
param
 this way you can set the value of a parameter during fitclusters executable launch, i.e. you can overwrite parameter provided in a config file or a default.config
 this way you can provide a list of config files. Their content will be treated as a one big (ordered) list of settings. In case of conflict, the later file overwrites a setting provided by earlier one.
These go directly to the CLI.
usage: fitclusters [h] [param [PARAM [PARAM ...]]]
[config [CONFIG [CONFIG ...]]]
optional arguments:
h, help show this help message and exit
param [PARAM [PARAM ...]]
List of Gin parameter bindings
config [CONFIG [CONFIG ...]]
List of paths to the config files
Sample fitclusters
call:
fitclusters \
param \
load_data.path='/data/my_data.csv' \
DiviK.distance='euclidean' \
DiviK.use_logfilters=False \
DiviK.n_jobs=1 \
config \
mydefaults.gin \
myoverrides.gin
The elaboration of all the parameters is included in Experiment configuration and Model setup.
Experiment configuration¶
Following parameters are available when launching experiments:
load_data.path
 path to the file with data for clustering. Observations in rows, features in columns.load_xy.path
 path to the file with X and Y coordinates for the observations. The number of coordinate pairs must be the same as the number of observations. Only integer coordinates are supported now.experiment.model
 the clustering model to fit to the data. See more in Model setup.experiment.steps_that_require_xy
 when using scikitlearn Pipeline, it may be required to provide spatial coordinates to fit specific algorithms. This parameter accepts the list of the steps that should be provided with spatial coordinates during pipeline execution (e.g.EximsSelector
).experiment.destination
 the destination directory for the experiment outputs. Defaultresult
.experiment.omit_datetime
 ifTrue
, the destination directory will be directly populated with the results of the experiment. Otherwise, a subdirectory with date and time will be created to keep separation between runs. DefaultFalse
.experiment.verbose
 ifTrue
, extends the messaging on the console. Default False.experiment.exist_ok
 ifTrue
, the experiment will not fail if the destination directory exists. This is to avoid results overwrites. DefaultFalse
.
Model setup¶
divik
models¶
To use DiviK algorithm in the experiment, a config file must:
Import the algorithms to the scope, e.g.:
import divik.cluster
Point experiment which algorithm to use, e.g.:
experiment.model = @DiviK()
Configure the algorithm, e.g.:
DiviK.distance = 'euclidean' DiviK.verbose = True
Sample config with KMeans
¶
Below you can check sample configuration file, that sets up simple KMeans:
import divik.cluster
KMeans.n_clusters = 3
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True
experiment.model = @KMeans()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
Sample config with DiviK
¶
Below is the configuration file with full setup of DiviK. DiviK
requires
an automated clustering method for stop condition and a separate one for
clustering. Here we use GAPSearch
for stop condition and DunnSearch
for selecting the number of clusters. These in turn require a KMeans
method set for a specific distance method, etc.:
import divik.cluster
KMeans.n_clusters = 1
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True
GAPSearch.kmeans = @KMeans()
GAPSearch.max_clusters = 2
GAPSearch.n_jobs = 1
GAPSearch.seed = 42
GAPSearch.n_trials = 10
GAPSearch.sample_size = 1000
GAPSearch.drop_unfit = True
GAPSearch.verbose = True
DunnSearch.kmeans = @KMeans()
DunnSearch.max_clusters = 10
DunnSearch.method = "auto"
DunnSearch.inter = "closest"
DunnSearch.intra = "furthest"
DunnSearch.sample_size = 1000
DunnSearch.seed = 42
DunnSearch.n_jobs = 1
DunnSearch.drop_unfit = True
DunnSearch.verbose = True
DiviK.kmeans = @DunnSearch()
DiviK.fast_kmeans = @GAPSearch()
DiviK.distance = "correlation"
DiviK.minimal_size = 200
DiviK.rejection_size = 2
DiviK.minimal_features_percentage = 0.005
DiviK.features_percentage = 1.0
DiviK.normalize_rows = True
DiviK.use_logfilters = True
DiviK.filter_type = "gmm"
DiviK.n_jobs = 1
DiviK.verbose = True
experiment.model = @DiviK()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
scikitlearn
models¶
For a model to be used with fitclusters
, it needs to be marked as
gin.configurable
. While it is true for DiviK and remaining algorithms
within divik
package, scikitlearn
requires additional setup.
Import helper module:
import divik.core.gin_sklearn_configurables
Point experiment which algorithm to use, e.g.:
experiment.model = @MeanShift()
Configure the algorithm, e.g.:
MeanShift.n_jobs = 1 MeanShift.max_iter = 300
Warning
Importing both scikitlearn
and divik
will result in an ambiguity
when using e.g. KMeans
. In such a case it is necesary to point specific
algorithms by a full name, e.g. divik.cluster._kmeans._core.KMeans
.
Sample config with MeanShift
¶
Below you can check sample configuration file, that sets up simple MeanShift:
import divik.core.gin_sklearn_configurables
MeanShift.cluster_all = True
MeanShift.n_jobs = 1
MeanShift.max_iter = 300
experiment.model = @MeanShift()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
Pipelines¶
scikitlearn
Pipelines have a separate section to provide an additional
explanation, even though these are part of scikitlearn
.
Import helper module:
import divik.core.gin_sklearn_configurables
Import the algorithms into the scope:
import divik.feature_extraction
Point experiment which algorithm to use, e.g.:
experiment.model = @Pipeline()
Configure the algorithms, e.g.:
MeanShift.n_jobs = 1 MeanShift.max_iter = 300
Configure the pipeline:
Pipeline.steps = [ ('histogram_equalization', @HistogramEqualization()), ('exims', @EximsSelector()), ('pca', @KneePCA()), ('mean_shift', @MeanShift()), ]
(If needed) configure steps that require spatial coordinates:
experiment.steps_that_require_xy = ['exims']
Sample config with Pipeline
¶
Below you can check sample configuration file, that sets up simple Pipeline:
import divik.core.gin_sklearn_configurables
import divik.feature_extraction
MeanShift.n_jobs = 1
MeanShift.max_iter = 300
Pipeline.steps = [
('histogram_equalization', @HistogramEqualization()),
('exims', @EximsSelector()),
('pca', @KneePCA()),
('mean_shift', @MeanShift()),
]
experiment.model = @Pipeline()
experiment.steps_that_require_xy = ['exims']
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
Custom models¶
The fitclusters
executable can work with custom algorithms as well.
Mark an algorithm class
gin.configurable
at the definition time:import gin @gin.configurable class MyClustering: pass
or when importing them from a library:
import gin gin.external_configurable(MyClustering)
Define artifacts saving methods:
from divik.core.io import saver @saver def save_my_clustering(model, fname_fn, **kwargs): if not hasattr(model, 'my_custom_field_'): return # custom saving logic comes here
There are some default savers defined, which are compatible with lots of
divik
andscikitlearn
algorithms, supporting things like:model pickling
JSON summary saving
labels saving (
.npy
,.csv
)centroids saving (
.npy
,.csv
)pipeline saving
A
saver
should be highly reusable and could be a pleasant contribution to thedivik
library.In config, import the module which marks your algorithm configurable:
import myclustering
Continue with the algorithm setup and plumbing as in the previous scenarios
Computational Modules¶
divik.cluster
module¶
Clustering methods

class
divik.cluster.
DiviK
(kmeans, fast_kmeans=None, distance='correlation', minimal_size=None, rejection_size=None, rejection_percentage=None, minimal_features_percentage=0.01, features_percentage=0.05, normalize_rows=None, use_logfilters=False, filter_type='gmm', n_jobs=None, verbose=False)[source]¶ DiviK clustering
 Parameters
 kmeans: AutoKMeans
A selftuning KMeans estimator for the purpose of clustering
 fast_kmeans: GAPSearch, optional, default: None
A selftuning KMeans estimator for the purpose of stop condition check. If None, the kmeans parameter is assumed to be the GAPSearch instance.
 distance: str, optional, default: ‘correlation’
The distance metric between points, centroids and for GAP index estimation. One of the distances supported by scipy package.
 minimal_size: int or float, optional, default: None
The minimum size of the region (the number of observations) to be considered for any further divisions. If provided number is between 0 and 1, it is considered a rate of training dataset size. When left None, defaults to 0.1% of the training dataset size.
 rejection_size: int, optional, default: None
Size under which split will be rejected  if a cluster appears in the split that is below rejection_size, the split is considered improper and discarded. This may be useful for some domains (like there is no justification for a 3cells cluster in biological data). By default, no segmentation is discarded, as careful postprocessing provides the same advantage.
 rejection_percentage: float, optional, default: None
An alternative to
rejection_size
, with the same behavior, but this parameter is related to the training data size percentage. By default, no segmentation is discarded. minimal_features_percentage: float, optional, default: 0.01
The minimal percentage of features that must be preserved after GMMbased feature selection. By default at least 1% of features is preserved in the filtration process.
 features_percentage: float, optional, default: 0.05
The target percentage of features that are used by fallback percentage filter for ‘outlier’ filter.
 normalize_rows: bool, optional, default: None
Whether to normalize each row of the data to the norm of 1. By default, it normalizes rows for correlation metric, does no normalization otherwise.
 use_logfilters: bool, optional, default: False
Whether to compute logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that  filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
 filter_type: {‘gmm’, ‘outlier’, ‘auto’, ‘none’}, default: ‘gmm’
‘gmm’  usual Gaussian Mixture Modelbased filtering, useful for high
dimensional cases  ‘outlier’  robust outlier detectionbased filtering, useful for low dimensional cases. In the case of no outliers, percentagebased filtering is applied.  ‘auto’  automatically selects between ‘gmm’ and ‘outlier’ based on the dimensionality. When more than 250 features are present, ‘gmm’ is chosen.  ‘none’  feature selection is disabled
 n_jobs: int, optional, default: None
The number of jobs to use for the computation. This works by computing each of the GAP index evaluations in parallel and by making predictions in parallel.
 verbose: bool, optional, default: False
Whether to report the progress of the computations.
Examples
>>> from divik.cluster import DiviK >>> from sklearn.datasets import make_blobs >>> X, _ = make_blobs(n_samples=200, n_features=100, centers=20, ... random_state=42) >>> divik = DiviK(distance='euclidean').fit(X) >>> divik.labels_ array([1, 1, 1, 0, ..., 0, 0], dtype=int32) >>> divik.predict([[0, ..., 0], [12, ..., 3]]) array([1, 0], dtype=int32) >>> divik.cluster_centers_ array([[10., ..., 2.], ..., [ 1, ..., 2.]])
 Attributes
 result_: divik.DivikResult
Hierarchical structure describing all the consecutive segmentations.
 labels_:
Labels of each point
 centroids_: array, [n_clusters, n_features]
Coordinates of cluster centers. If the algorithm stops before fully converging, these will not be consistent with
labels_
. Also, the distance between points and respective centroids must be captured in appropriate features subspace. This is realized by thetransform
method. filters_: array, [n_clusters, n_features]
Filters that were applied to the feature space on the level that was the final segmentation for a subset.
 depth_: int
The number of hierarchy levels in the segmentation.
 n_clusters_: int
The final number of clusters in the segmentation, on the tree leaf level.
 paths_: Dict[int, Tuple[int]]
Describes how the cluster number corresponds to the path in the tree. Element of the tuple indicates the subsegment number on each tree level.
 reverse_paths_: Dict[Tuple[int], int]
Describes how the path in the tree corresponds to the cluster number. For more details see
paths_
.
Methods
fit
(X[, y])Compute DiviK clustering.
fit_predict
(X[, y])Compute cluster centers and predict cluster index for each sample.
fit_transform
(X[, y])Compute clustering and transform X to clusterdistance space.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a clusterdistance space.

fit
(X, y=None)[source]¶ Compute DiviK clustering.
 Parameters
 Xarraylike or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not Ccontiguous.
 yIgnored
not used, present here for API consistency by convention.

fit_predict
(X, y=None)[source]¶ Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
 yIgnored
not used, present here for API consistency by convention.
 Returns
 labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.

fit_transform
(X, y=None, **fit_params)[source]¶ Compute clustering and transform X to clusterdistance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
 yIgnored
not used, present here for API consistency by convention.
 Returns
 X_newarray, shape [n_samples, self.n_clusters_]
X transformed in the new space.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
 Returns
 labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)[source]¶ Transform X to a clusterdistance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
 Returns
 X_newarray, shape [n_samples, self.n_clusters_]
X transformed in the new space.

class
divik.cluster.
DunnSearch
(kmeans, max_clusters, min_clusters=2, method='full', inter='centroid', intra='avg', sample_size=1000, n_trials=10, seed=42, n_jobs=1, drop_unfit=False, verbose=False)[source]¶ Select best number of clusters for kmeans
 Parameters
 kmeansKMeans
KMeans object to tune
 max_clusters: int
The maximal number of clusters to form and score.
 min_clusters: int, default: 1
The minimal number of clusters to form and score.
 method: {‘full’, ‘sampled’, ‘auto’}, default: ‘full’
Whether to run full computations or approximate.  full  always computes full Dunn’s index, without sampling  sampled  samples the clusters to reduce computational overhead  auto  switches the above methods to provide best performancequality tradeoff.
 inter{‘centroid’, ‘closest’}, default: ‘centroid’
How the distance between clusters is computed. For more details see dunn.
 intra{‘avg’, ‘furthest’}, default: ‘avg’
How the cluster internal distance is computed. For more details see dunn.
 sample_sizeint, default: 1000
Size of the sample used to compute Dunn index in auto or sampled scenario.
 n_trialsint, default: 10
Number of trials to use when computing Dunn index in auto or sampled scenario.
 seedint, default: 42
Random seed for the reproducibility of subset draws in Dunn auto or sampled scenario.
 n_jobs: int, default: 1
The number of jobs to use for the computation. This works by computing each of the clustering & scoring runs in parallel.
 drop_unfit: bool, default: False
If True, drops the estimators that did not fit the data.
 verbose: bool, default: False
If True, shows progress with tqdm.
 Attributes
 cluster_centers_: array, [n_clusters, n_features]
Coordinates of cluster centers.
 labels_:
Labels of each point.
 estimators_: List[KMeans]
KMeans instances for n_clusters in range [min_clusters, max_clusters].
 scores_: array, [max_clusters  min_clusters + 1,]
Array with scores for each estimator.
 n_clusters_: int
Estimated optimal number of clusters.
 best_score_: float
Score of the optimal estimator.
 best_: KMeans
The optimal estimator.
Methods
fit
(X[, y])Compute kmeans clustering and estimate optimal number of clusters.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a clusterdistance space.

fit
(X, y=None)[source]¶ Compute kmeans clustering and estimate optimal number of clusters.
 Parameters
 Xarraylike or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not Ccontiguous.
 yIgnored
not used, present here for API consistency by convention.

fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input data.
 yIgnored
Not used, present for API consistency by convention.
 Returns
 labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
 Returns
 labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)[source]¶ Transform X to a clusterdistance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
 Returns
 X_newarray, shape [n_samples, k]
X transformed in the new space.

class
divik.cluster.
GAPSearch
(kmeans, max_clusters, min_clusters=1, n_jobs=1, seed=0, n_trials=10, sample_size=1000, drop_unfit=False, verbose=False)[source]¶ Select best number of cluters for kmeans
 Parameters
 kmeansKMeans
KMeans object to tune
 max_clusters: int
The maximal number of clusters to form and score.
 min_clusters: int, default: 1
The minimal number of clusters to form and score.
 n_jobs: int, default: 1
The number of jobs to use for the computation. This works by computing each of the clustering & scoring runs in parallel.
 seed: int, default: 0
Random seed for generating uniform data sets.
 n_trials: int, default: 10
Number of data sets drawn as a reference.
 sample_sizeint, default: 1000
Size of the sample used for GAP statistic computation. Used only if introduces speedup.
 drop_unfit: bool, default: False
If True, drops the estimators that did not fit the data.
 verbose: bool, default: False
If True, shows progress with tqdm.
 Attributes
 cluster_centers_: array, [n_clusters, n_features]
Coordinates of cluster centers.
 labels_:
Labels of each point.
 estimators_: List[KMeans]
KMeans instances for n_clusters in range [min_clusters, max_clusters].
 scores_: array, [max_clusters  min_clusters + 1, ?]
Array with scores for each estimator in each row.
 n_clusters_: int
Estimated optimal number of clusters.
 best_score_: float
Score of the optimal estimator.
 best_: KMeans
The optimal estimator.
Methods
fit
(X[, y])Compute kmeans clustering and estimate optimal number of clusters.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a clusterdistance space.

fit
(X, y=None)[source]¶ Compute kmeans clustering and estimate optimal number of clusters.
 Parameters
 Xarraylike or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not Ccontiguous.
 yIgnored
not used, present here for API consistency by convention.

fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input data.
 yIgnored
Not used, present for API consistency by convention.
 Returns
 labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
 Returns
 labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)[source]¶ Transform X to a clusterdistance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
 Returns
 X_newarray, shape [n_samples, k]
X transformed in the new space.

class
divik.cluster.
KMeans
(n_clusters, distance='euclidean', init='percentile', percentile=95.0, leaf_size=0.01, max_iter=100, normalize_rows=False, allow_dask=False)[source]¶ KMeans clustering
 Parameters
 n_clustersint
The number of clusters to form as well as the number of centroids to generate.
 distancestr, optional, default: ‘euclidean’
Distance measure. One of the distances supported by scipy package.
 init{‘percentile’, ‘extreme’, ‘kdtree’, ‘kdtree_percentile’}
Method for initialization, defaults to ‘percentile’:
‘percentile’ : selects initial cluster centers for kmean clustering starting from specified percentile of distance to already selected clusters
‘extreme’: selects initial cluster centers for kmean clustering starting from the furthest points to already specified clusters
‘kdtree’: selects initial cluster centers for kmean clustering starting from centroids of KDTree boxes
‘kdtree_percentile’: selects initial cluster centers for kmeans clustering starting from centroids of KDTree boxes containing specified percentile. This should be more robust against outliers.
 percentilefloat, default: 95.0
Specifies the starting percentile for ‘percentile’ initialization. Must be within range [0.0, 100.0]. At 100.0 it is equivalent to ‘extreme’ initialization.
 leaf_sizeint or float, optional (default 0.01)
Desired leaf size in kdtree initialization. When int, the box size will be between leaf_size and 2 * leaf_size. When float, it will be between leaf_size * n_samples and 2 * leaf_size * n_samples
 max_iterint, default: 100
Maximum number of iterations of the kmeans algorithm for a single run.
 normalize_rowsbool, default: False
If True, rows are translated to mean of 0.0 and scaled to norm of 1.0.
 allow_daskbool, default: False
If True, automatically selects dask as computations backend whenever reasonable. Default False since it cannot be used together with multiprocessing.Pool and everywhere n_jobs must be set to 1.
 Attributes
 cluster_centers_array, [n_clusters, n_features]
Coordinates of cluster centers.
 labels_ :
Labels of each point
Methods
fit
(X[, y])Compute kmeans clustering.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a clusterdistance space.

fit
(X, y=None)[source]¶ Compute kmeans clustering.
 Parameters
 Xarraylike or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not Ccontiguous.
 yIgnored
not used, present here for API consistency by convention.

fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input data.
 yIgnored
Not used, present for API consistency by convention.
 Returns
 labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
 Returns
 labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)[source]¶ Transform X to a clusterdistance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
 Parameters
 X{arraylike, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
 Returns
 X_newarray, shape [n_samples, k]
X transformed in the new space.

class
divik.cluster.
TwoStep
(clusterer, n_subsets=10, random_state=42)[source]¶ Perform a twostep clustering with a given clusterer
Separates a dataset into
n_subsets
, processes each of them separately and then combines the results.Works with centroidbased clustering methods, as it requires cluster representatives to combine the result.
 Parameters
 clustererUnion[AutoKMeans, Pipeline, KMeans]
A centroidbased estimator for the purpose of clustering.
 n_subsetsint, default 10
The number of subsets into which the original dataset should be separated
 random_stateint, default 42
Random state to use for seeding the random number generator.
Examples
>>> from sklearn.datasets import make_blobs >>> from divik.cluster import KMeans, TwoStep >>> X, _ = make_blobs( ... n_samples=10_000, n_features=2, centers=3, random_state=42 ... ) >>> kmeans = KMeans(n_clusters=3) >>> ctr = TwoStep(kmeans).fit(X)
Methods
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
predict

fit_predict
(X, y=None)[source]¶ Perform clustering on X and returns cluster labels.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input data.
 yIgnored
Not used, present for API consistency by convention.
 Returns
 labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.
divik.feature_extraction
module¶
Unsupervised feature extraction methods

class
divik.feature_extraction.
HistogramEqualization
(n_bins=256, n_jobs= 1)[source]¶ Equalize histogram of the features to increase contrast
Based on https://github.com/scikitimage/scikitimage/blob/master/skimage/exposure/exposure.py#L187L223
 Parameters
 n_binsint, default 256
Number of bins for histogram equalization.
 n_jobsint, default 1
Number of CPU cores to use during equalization
 Attributes
 cdf_array
Values of cumulative distribution function for all the features
 bins_array
Bin centers for all the features
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
transform

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

class
divik.feature_extraction.
KneePCA
(whiten=False, refit=False)[source]¶ Principal component analysis (PCA) with knee method
PCA with automated components selection based on knee method over cumulative explained variance. Remaining components are discarded.
 Parameters
 whitenbool, optional (default False)
When True (False by default) the
pca_.components_
vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit componentwise variances.Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hardwired assumptions.
 refitbool, optional (default False)
When
True
(False
by default) thepca_
is refit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.
 Attributes
 pca_PCA
Fit PCA estimator.
 n_components_int
The number of selected components.
Methods
fit
(X[, y])Fit the model from data in X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
Transform data back to its original space.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y])Apply dimensionality reduction to X.

fit
(X, y=None)[source]¶ Fit the model from data in X.
 Parameters
 Xarraylike, shape (n_samples, n_features)
Training vector, where
n_samples
is the number of samples andn_features
is the number of features. Y: Ignored.
 Returns
 selfobject
Returns the instance itself.

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

inverse_transform
(X)[source]¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
 Parameters
 Xarraylike, shape (n_samples, n_components)
New data, where
n_samples
is the number of samples andn_components
is the number of components.
 Returns
 X_original arraylike, shape (n_samples, n_features)
Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes reversing whitening.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X, y=None)[source]¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
 Parameters
 Xarraylike, shape (n_samples, n_features)
New data, where
n_samples
is the number of samples andn_features
is the number of features.
 Returns
 X_newarraylike, shape (n_samples, n_components)
Examples
>>> import numpy as np >>> from divik.feature_extraction import KneePCA >>> X = np.array([[1, 1], [2, 1], [3, 2], [1, 1], [2, 1], [3, 2]]) >>> pca = KneePCA(refit=True) >>> pca.fit(X) KneePCA(refit=True) >>> pca.transform(X)

class
divik.feature_extraction.
LocallyAdjustedRbfSpectralEmbedding
(distance='euclidean', n_components=2, random_state=None, eigen_solver=None, n_neighbors=None, n_jobs=1)[source]¶ Spectral embedding for nonlinear dimensionality reduction.
Forms an affinity matrix given by the specified function and applies spectral decomposition to the corresponding graph laplacian. The resulting transformation is given by the value of the eigenvectors for each data point.
Note : Laplacian Eigenmaps is the actual algorithm implemented here.
 Parameters
 distance{‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’,
 ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’,
 ‘kulsinski’, ‘mahalanobis’, ‘atching’, ‘minkowski’, ‘rogerstanimoto’,
 ‘russellrao’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’}
Distance measure, defaults to
euclidean
. These are the distances supported by scipy package. n_componentsinteger, default: 2
The dimension of the projected subspace.
 random_stateint, RandomState instance or None, optional, default: None
A pseudo random number generator used for the initialization of the lobpcg eigenvectors. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by
np.random
. Used whensolver
==amg
. eigen_solver{None, ‘arpack’, ‘lobpcg’, or ‘amg’}
The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities.
 n_neighborsint, default
Number of nearest neighbors for nearest_neighbors graph building.
 n_jobsint, optional (default = 1)
The number of parallel jobs to run. If
1
, then the number of jobs is set to the number of CPU cores.
References
A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
On Spectral Clustering: Analysis and an algorithm, 2001 Andrew Y. Ng, Michael I. Jordan, Yair Weiss http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8100
Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
 Attributes
 embedding_array, shape = (n_samples, n_components)
Spectral embedding of the training matrix.
Methods
fit
(X[, y])Fit the model from data in X.
fit_transform
(X[, y])Fit the model from data in X and transform X.
get_params
([deep])Get parameters for this estimator.
save
(destination)Save embedding to a directory
set_params
(**params)Set the parameters of this estimator.
transform

fit
(X, y=None)[source]¶ Fit the model from data in X.
 Parameters
 Xarraylike, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
 Y: Ignored.
 Returns
 selfobject
Returns the instance itself.

fit_transform
(X, y=None)[source]¶ Fit the model from data in X and transform X.
 Parameters
 Xarraylike, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
 Y: Ignored.
 Returns
 X_newarraylike, shape (n_samples, n_components)

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

save
(destination)[source]¶ Save embedding to a directory
 Parameters
 destinationstr
Directory to save the embedding.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.
divik.feature_selection
module¶
Unsupervised feature selection methods

class
divik.feature_selection.
EximsSelector
[source]¶ Select features based on their spatial distribution
Preserves features that yield biologically plausible structures.
References
Wijetunge, Chalini D., et al. “EXIMS: an improved data analysis pipeline based on a new peak picking method for EXploring Imaging Mass Spectrometry data.” Bioinformatics 31.19 (2015): 31983206. https://academic.oup.com/bioinformatics/article/31/19/3198/212150
Methods
fit
(X[, y, xy])Learn datadriven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None, xy=None)[source]¶ Learn datadriven feature thresholds from X.
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 xyarraylike, shape (n_samples, 2)
Spatial coordinates of the samples. Expects integers, indices over am image.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.


class
divik.feature_selection.
GMMSelector
(stat, use_log=False, n_candidates=None, min_features=1, min_features_rate=0.0, preserve_high=True, max_components=10)[source]¶ Feature selector that removes low or high mean or variance features
Gaussian Mixture Modeling is applied to the features’ characteristics and components are obtained. Crossing points of the components are considered candidate thresholds. Out of these up to
n_candidates
components are removed in such a way that at leastmin_features
ormin_features_rate
features are retained.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
 Parameters
 stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
 use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that  filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
 n_candidates: int, optional, default: None
How many candidate thresholds to use at most.
0
preserves all the features (all candidate thresholds are discarded),None
allows to remove all but one component (all candidate thresholds are retained). Negative value means to discard up to all butn_candidates
candidates, e.g.1
will retain at least two components (one candidate threshold is removed). min_features: int, optional, default: 1
How many features must be preserved. Candidate thresholds are tested against this value, and if they retain less features, less conservative thresholds is selected.
 min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number. preserve_high: bool, optional, default: True
Whether to preserve the highcharacteristic features or lowcharacteristic ones.
 max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = labels * 5 + np.random.randn(*labels.shape) >>> fs.GMMSelector('mean').fit_transform(data) array([[14.78032811 15.35711257 ... 15.75193303]]) >>> fs.GMMSelector('mean', preserve_high=False).fit_transform(data) array([[ 0.49671415 0.1382643 ... 0.29169375]]) >>> fs.GMMSelector('mean', n_discard=1).fit_transform(data) array([[10.32408397 9.61491772 ... 15.75193303]])
 Attributes
 vals_: array, shape (n_features,)
Computed characteristic of each feature.
 threshold_: float
Threshold value to filter the features by the characteristic.
 raw_threshold_: float
Threshold value mapped back to characteristic space (no logarithm, etc.)
 selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn datadriven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None)[source]¶ Learn datadriven feature thresholds from X.
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.

class
divik.feature_selection.
HighAbundanceAndVarianceSelector
(use_log=False, min_features=1, min_features_rate=0.0, max_components=10)[source]¶ Feature selector that removes lowmean and lowvariance features
Exercises
GMMSelector
to filter out the lowabundance noise features and select highvariance informative features.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
 Parameters
 use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that  filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
 min_features: int, optional, default: 1
How many features must be preserved.
 min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number. max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> # Data in this case must be carefully crafted >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = np.vstack(100 * [labels * 10.]) >>> data += np.random.randn(*data.shape) >>> sub = data[:, :40] >>> sub += 5 * np.random.randn(*sub.shape) >>> # Label 0 has low abundance but high variance >>> # Label 3 has low variance but high abundance >>> # Label 1 and 2 has notlowest abundance and high variance >>> selector = fs.HighAbundanceAndVarianceSelector().fit(data) >>> selector.transform(labels.reshape(1,1)) array([[1 1 1 1 1 ...2 2 2]])
 Attributes
 abundance_selector_: GMMSelector
Selector used to filter out the noise component.
 variance_selector_: GMMSelector
Selector used to filter out the noninformative features.
 selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn datadriven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None)[source]¶ Learn datadriven feature thresholds from X.
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.

class
divik.feature_selection.
NoSelector
[source]¶ Dummy selector to use when no selection is supposed to be made.
Methods
fit
(X[, y])Pass data forward
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None)[source]¶ Pass data forward
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors to pass.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.


class
divik.feature_selection.
OutlierAbundanceAndVarianceSelector
(use_log=False, min_features_rate=0.01, p=0.2)[source]¶ Methods
fit
(X[, y])Learn datadriven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None)[source]¶ Learn datadriven feature thresholds from X.
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.


class
divik.feature_selection.
OutlierSelector
(stat, use_log=False, keep_outliers=False)[source]¶ Feature selector that removes outlier features w.r.t. mean or variance
Huberta’s outlier detection is applied to the features’ characteristics and the outlying features are removed.
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
 Parameters
 stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
 use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that  filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
 keep_outliers: bool, optional, default: False
When True, keeps outliers instead of inlier features.
 Attributes
 vals_: array, shape (n_features,)
Computed characteristic of each feature.
 selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn datadriven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None)[source]¶ Learn datadriven feature thresholds from X.
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.

class
divik.feature_selection.
PercentageSelector
(stat, use_log=False, keep_top=True, p=0.2)[source]¶ Feature selector that removes / preserves top some percent of features
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
 Parameters
 stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
 use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that  filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
 keep_top: bool, optional, default: True
When True, keeps features with highest value of the characteristic.
 p: float, optional, default: 0.2
Rate of features to keep.
 Attributes
 vals_: array, shape (n_features,)
Computed characteristic of each feature.
 threshold_: float
Value of the threshold used for filtering
 selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn datadriven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.

fit
(X, y=None)[source]¶ Learn datadriven feature thresholds from X.
 Parameters
 X{arraylike, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
 yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
 Returns
 self

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.

class
divik.feature_selection.
SelectorMixin
[source]¶ Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with transform and inverse_transform functionality given an implementation of _get_support_mask.
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
transform
(X)Reduce X to the selected features.

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_support
(indices=False)[source]¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)[source]¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.


class
divik.feature_selection.
StatSelectorMixin
[source]¶ Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with
transform
andinverse_transform
functionality given thatselected_
is specified duringfit
.Additionally, provides a
_to_characteristics
and_to_raw
implementations givenstat
, optionallyuse_log
andpreserve_high
.Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
transform
(X)Reduce X to the selected features.

fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
 Parameters
 Xarraylike of shape (n_samples, n_features)
Input samples.
 yarraylike of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
 **fit_paramsdict
Additional fit parameters.
 Returns
 X_newndarray array of shape (n_samples, n_features_new)
Transformed array.

get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
 Parameters
 indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
 Returns
 supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform
(X)¶ Reverse the transformation operation
 Parameters
 Xarray of shape [n_samples, n_selected_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.

transform
(X)¶ Reduce X to the selected features.
 Parameters
 Xarray of shape [n_samples, n_features]
The input samples.
 Returns
 X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.


divik.feature_selection.
huberta_outliers
(v)[source]¶ Outlier detection method based on medcouple statistic.
 Parameters
 v: arraylike
An array to filter outlier from.
 Returns
 Binary vector indicating all the outliers.
References
M. Huberta, E.Vandervierenb (2008) An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (2008) 5186–5201
divik.sampler
module¶
Sampling methods for statistical indices computation purposes

class
divik.sampler.
BaseSampler
[source]¶ Base class for all the samplers
Sampler is Poolsafe, i.e. can simply store a dataset. It will not be serialized by pickle when going to another process, if handled properly.
Before you spawn a pool, a data must be moved to a modulelevel variable. To simplify that process a contract has been prepared. You open a context and operate within a context:
>>> with sampler.parallel() as sampler_, ... Pool(initializer=sampler_.initializer, ... initargs=sampler_.initargs) as pool: ... pool.map(sampler_.get_sample, range(10))
Keep in mind, that __iter__ and fit are not accessible in parallel context. __iter__ would yield the same values independently in all the workers. Now it needs to be done consciously and in wellthough manner. fit could lead to a nonpredictable behaviour. If you need the original sampler, you can get a clone (not fit to the data).
Methods
fit
(X[, y])Fit sampler to data
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.

fit
(X, y=None)[source]¶ Fit sampler to data
It’s a base for both supervised and unsupervised samplers.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

abstract
get_sample
(seed)[source]¶ Return specific sample
Following assumptions should be met: a) sampler.get_sample(x) == sampler.get_sample(x) b) x != y should yield sampler.get_sample(x) != sampler.get_sample(y)
 Parameters
 seedint
The seed to use to draw the sample
 Returns
 samplearray_like, (*self.shape_)
Returns the drawn sample

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.


class
divik.sampler.
ParallelSampler
(sampler)[source]¶ Helper class for sharing the sampler functionality
 Attributes
 initargs
Methods
clone
()Clones the original sampler
get_sample
(seed)Return specific sample
initializer

property
initargs
¶

class
divik.sampler.
StratifiedSampler
(n_rows=100, n_samples=None)[source]¶ Sample the original data preserving proportions of groups
 Parameters
 n_rowsint or float, optional (default 10000)
Allows to limit the number of rows in the drawn samples. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the sample. If int, represents the absolute number of rows.
 n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
 Attributes
 X_array_like, shape (n_rows, n_features)
Data to sample from
 y_array_like, shape (n_rows,)
Group labels
Methods
fit
(X, y)Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.

fit
(X, y)[source]¶ Fit the model from data in X.
Both inputs are preserved inside to sample from the data.
 Parameters
 Xarraylike, shape (n_rows, n_features)
Training vector, where n_rows is the number of rows and n_features is the number of features.
 y: arraylike, shape (n_rows,)
 Returns
 selfStratifiedSampler
Returns the instance itself.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_sample
(seed)[source]¶ Return specific sample
Sample is drawn from the set of existing rows. A proportion of gorups should be moreorless the same, depending on the size of the sample.
 Parameters
 seedint
The seed to use to draw the sample
 Returns
 samplearray_like, (*self.shape_)
Returns the drawn sample

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

class
divik.sampler.
UniformPCASampler
(n_rows=None, n_samples=None, whiten=False, refit=False, pca='knee')[source]¶ Rotationinvariant uniform sampling
 Parameters
 n_rowsint, optional (default None)
Allows to limit the number of rows in the drawn samples
 n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
 whitenbool, optional (default False)
When True (False by default) the pca_.components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit componentwise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hardwired assumptions.
 refitbool, optional (default False)
When True (False by default) the pca_ is refit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.
 pca: {‘knee’, ‘full’}, default ‘knee’
Specifies whether to train full or knee PCA.
 Attributes
 pca_KneePCA or PCA
PCA transform which provided rotationinvariance
 sampler_UniformSampler
Sampler from the transformed distribution
Methods
fit
(X[, y])Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.

fit
(X, y=None)[source]¶ Fit the model from data in X.
PCA is fit to estimate the rotation and UniformSampler is fit to transformed data.
 Parameters
 Xarraylike, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
 Y: Ignored.
 Returns
 selfUniformPCASampler
Returns the instance itself.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_sample
(seed)[source]¶ Return specific sample
Sample is generated from transformed distribution and transformed back to the original space.
 Parameters
 seedint
The seed to use to draw the sample
 Returns
 samplearray_like, (*self.shape_)
Returns the drawn sample

parallel
()¶ Create parallel context for the sampler to operate

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.

class
divik.sampler.
UniformSampler
(n_rows=None, n_samples=None)[source]¶ Samples uniformly from the boundaries of the data
 Parameters
 n_rowsint, optional (default None)
Allows to limit the number of rows in the drawn samples
 n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
 Attributes
 shape_(n_rows, n_cols)
Shape of the drawn samples
 scaler_MinMaxScaler
Scaler ensuring the proper ranges
Methods
fit
(X[, y])Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.

fit
(X, y=None)[source]¶ Fit the model from data in X.
 Parameters
 Xarraylike, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
 Y: Ignored.
 Returns
 selfUniformSampler
Returns the instance itself.

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsdict
Parameter names mapped to their values.

get_sample
(seed)[source]¶ Return specific sample
 Parameters
 seedint
The seed to use to draw the sample
 Returns
 samplearray_like, (*self.shape_)
Returns the drawn sample

parallel
()¶ Create parallel context for the sampler to operate

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfestimator instance
Estimator instance.
Clustering methods 

Unsupervised feature extraction methods 

Unsupervised feature selection methods 

Sampling methods for statistical indices computation purposes 
Utility Packages¶
divik
package¶
Unsupervised highthroughput data analysis methods

divik.
reject_split
(tree, rejection_size=0)[source]¶ Reapply rejection condition on known result tree.
 Return type
Optional
[DivikResult
]
Modules
Clustering methods 

Reusable utilities used for building divik library 

Unsupervised feature extraction methods 

Unsupervised feature selection methods 

Sampling methods for statistical indices computation purposes 


divik.core
module¶
Reusable utilities used for building divik library

divik.core.
Centroids
¶ alias of
numpy.ndarray

divik.core.
Data
¶ alias of
numpy.ndarray

class
divik.core.
DivikResult
(clustering: Union[divik.cluster.GAPSearch, divik.cluster.DunnSearch], feature_selector: divik.feature_selection.StatSelectorMixin, merged: numpy.ndarray, subregions: List[Optional[DivikResult]])[source]¶ Result of DiviK clustering
 Attributes
clustering
Alias for field number 0
feature_selector
Alias for field number 1
merged
Alias for field number 2
subregions
Alias for field number 3
Methods
count
(value, /)Return number of occurrences of value.
index
(value[, start, stop])Return first index of value.

property
clustering
¶ Fitted automated clustering estimator

count
(value, /)¶ Return number of occurrences of value.

property
feature_selector
¶ Fitted feature selector

index
(value, start=0, stop=sys.maxsize, /)¶ Return first index of value.
Raises ValueError if the value is not present.

property
merged
¶ Recursively merged clustering labels

property
subregions
¶ DivikResults for all obtained subregions

divik.core.
IntLabels
¶ alias of
numpy.ndarray

class
divik.core.
Subsets
(n_splits=10, random_state=42)[source]¶ Scatter dataset to disjoint random subsets and combine them back
 Parameters
 n_splitsint, default 10
Number of subsets that will be generated.
 random_stateint, default 42
Random state to use for seeding the random number generator.
Examples
>>> from divik.core import Subsets >>> subsets = Subsets(n_splits=10, random_state=42) >>> X_list = subsets.scatter(X) >>> len(X_list) 10 >>> # do some computations on each subset >>> y = subsets.combine(y_list)
Methods
combine
scatter

divik.core.
cached_fit
(cls)[source]¶ Decorate a sklearncompatible estimator to cache the fitting result
It is a wrapper over joblib.Memory.cache, that supports runtime cache path definition.
Set path definition through gin config with
cache_path.path
identifier.

divik.core.
configurable
(name_or_fn=None, module=None, allowlist=None, denylist=None, whitelist=None, blacklist=None)[source]¶ Decorator to make a function or class configurable.
This decorator registers the decorated function/class as configurable, which allows its parameters to be supplied from the global configuration (i.e., set through bind_parameter or parse_config). The decorated function is associated with a name in the global configuration, which by default is simply the name of the function or class, but can be specified explicitly to avoid naming collisions or improve clarity.
If some parameters should not be configurable, they can be specified in denylist. If only a restricted set of parameters should be configurable, they can be specified in allowlist.
The decorator can be used without any parameters as follows:
@config.configurable def some_configurable_function(param1, param2=’a default value’):
…
In this case, the function is associated with the name ‘some_configurable_function’ in the global configuration, and both param1 and param2 are configurable.
The decorator can be supplied with parameters to specify the configurable name or supply an allowlist/denylist:
@config.configurable(‘explicit_configurable_name’, allowlist=’param2’) def some_configurable_function(param1, param2=’a default value’):
…
In this case, the configurable is associated with the name ‘explicit_configurable_name’ in the global configuration, and only param2 is configurable.
Classes can be decorated as well, in which case parameters of their constructors are made configurable:
@config.configurable class SomeClass:
 def __init__(self, param1, param2=’a default value’):
…
In this case, the name of the configurable is ‘SomeClass’, and both param1 and param2 are configurable.
 Args:
 name_or_fn: A name for this configurable, or a function to decorate (in
which case the name will be taken from that function). If not set, defaults to the name of the function/class that is being made configurable. If a name is provided, it may also include module components to be used for disambiguation (these will be appended to any components explicitly specified by module).
 module: The module to associate with the configurable, to help handle naming
collisions. By default, the module of the function or class being made configurable will be used (if no module is specified as part of the name).
 allowlist: An allowlisted set of kwargs that should be configurable. All
other kwargs will not be configurable. Only one of allowlist or denylist should be specified.
 denylist: A denylisted set of kwargs that should not be configurable. All
other kwargs will be configurable. Only one of allowlist or denylist should be specified.
whitelist: Deprecated version of allowlist for backwards compatibility. blacklist: Deprecated version of denylist for backwards compatibility.
 Returns:
When used with no parameters (or with a function/class supplied as the first parameter), it returns the decorated function or class. When used with parameters, it returns a function that can be applied to decorate the target function or class.

divik.core.
context_if
(condition, context, *args, **kwargs)[source]¶ Create context with given params only if the condition is
True

divik.core.
dump_gin_args
(destination)[source]¶ Dump ginconfig effective configuration
If you have gin extras installed, you can call dump_gin_args save effective gin configuration to a file.

divik.core.
maybe_pool
(processes=None, *args, **kwargs)[source]¶ Create
multiprocessing.Pool
if multiple CPUs are allowedExamples
>>> from divik.core import maybe_pool >>> with maybe_pool(processes=1) as pool: ... # Runs in sequential ... pool.map(id, range(10000)) >>> with maybe_pool(processes=1) as pool: ... # Runs with all cores ... pool.map(id, range(10000))

divik.core.
normalize_rows
(data)[source]¶ Translate and scale rows to zero mean and vector length equal one
 Return type
ndarray

divik.core.
seeded
(wrapped_requires_seed=False)[source]¶ Create seeded scope for function call.
 Parameters
 wrapped_requires_seed: bool, optional, default: False
if true, passes seed parameter to the inner function
Share a numpy array between
multiprocessing.Pool
processes

divik.core.
visualize
(label, xy, shape=None)[source]¶ Create RGB map of labels over with given coordinates
Modules
Mark scikitlearn classes as configurable 

Reusable utilities for data and model I/O 
divik.core.io
module¶
Reusable utilities for data and model I/O

divik.core.io.
save
(model, destination, **kwargs)[source]¶ Save model and related summaries into specified destination directory

divik.core.io.
saver
(fn)[source]¶ Register the function as handler for saving model and related summaries
The saver function should be reusable for different models exhibiting the required variables. Rather prefer checking the required attributes than the model class.
Examples
>>> from divik.core.io import saver >>> @saver ... def my_saver(model, destination, **kwargs): ... if not hasattr(model, 'my_custom_field_'): ... return ... if not 'my_param' in kwargs: ... return ... # custom saving logic comes here
You can also make this function configurable:
>>> import gin >>> from divik.core.io import saver >>> @saver ... @gin.configurable(allowlist=['my_param']) ... def configurable_saver(model, destination, my_param=None, **kwargs): ... if not hasattr(model, 'my_custom_field_'): ... return ... if my_param is None: ... return ... # custom saving logic comes here
divik.core.gin_sklearn_configurables
module¶
Mark scikitlearn classes as configurable
Unsupervised highthroughput data analysis methods 

Reusable utilities used for building divik library 

Reusable utilities for data and model I/O 

Mark scikitlearn classes as configurable 