Welcome to divik’s documentation!¶
Here you can find a list of documentation topics covered by this page.
Cluster analysis with fit-clusters
executable¶
Note
fit-clusters
requires installation with gin
extras, e.g.
pip install divik[gin]
fit-clusters is just one CLI executable that allows you to run DiviK algorithm, any other clustering algorithms supported by scikit-learn or even a pipeline with pre-processing.
Usage¶
CLI interface¶
There are two types of parameters:
--param
- this way you can set the value of a parameter during fit-clusters executable launch, i.e. you can overwrite parameter provided in a config file or a default.--config
- this way you can provide a list of config files. Their content will be treated as a one big (ordered) list of settings. In case of conflict, the later file overwrites a setting provided by earlier one.
These go directly to the CLI.
usage: fit-clusters [-h] [--param [PARAM [PARAM ...]]]
[--config [CONFIG [CONFIG ...]]]
optional arguments:
-h, --help show this help message and exit
--param [PARAM [PARAM ...]]
List of Gin parameter bindings
--config [CONFIG [CONFIG ...]]
List of paths to the config files
Sample fit-clusters
call:
fit-clusters \
--param \
load_data.path='/data/my_data.csv' \
DiviK.distance='euclidean' \
DiviK.use_logfilters=False \
DiviK.n_jobs=-1 \
--config \
my-defaults.gin \
my-overrides.gin
The elaboration of all the parameters is included in Experiment configuration and Model setup.
Experiment configuration¶
Following parameters are available when launching experiments:
load_data.path
- path to the file with data for clustering. Observations in rows, features in columns.load_xy.path
- path to the file with X and Y coordinates for the observations. The number of coordinate pairs must be the same as the number of observations. Only integer coordinates are supported now.experiment.model
- the clustering model to fit to the data. See more in Model setup.experiment.steps_that_require_xy
- when using scikit-learn Pipeline, it may be required to provide spatial coordinates to fit specific algorithms. This parameter accepts the list of the steps that should be provided with spatial coordinates during pipeline execution (e.g.EximsSelector
).experiment.destination
- the destination directory for the experiment outputs. Defaultresult
.experiment.omit_datetime
- ifTrue
, the destination directory will be directly populated with the results of the experiment. Otherwise, a subdirectory with date and time will be created to keep separation between runs. DefaultFalse
.experiment.verbose
- ifTrue
, extends the messaging on the console. Default False.experiment.exist_ok
- ifTrue
, the experiment will not fail if the destination directory exists. This is to avoid results overwrites. DefaultFalse
.
Model setup¶
divik
models¶
To use DiviK algorithm in the experiment, a config file must:
Import the algorithms to the scope, e.g.:
import divik.cluster
Point experiment which algorithm to use, e.g.:
experiment.model = @DiviK()
Configure the algorithm, e.g.:
DiviK.distance = 'euclidean' DiviK.verbose = True
Sample config with KMeans
¶
Below you can check sample configuration file, that sets up simple KMeans:
import divik.cluster
KMeans.n_clusters = 3
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True
experiment.model = @KMeans()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
Sample config with DiviK
¶
Below is the configuration file with full setup of DiviK. DiviK
requires
an automated clustering method for stop condition and a separate one for
clustering. Here we use GAPSearch
for stop condition and DunnSearch
for selecting the number of clusters. These in turn require a KMeans
method set for a specific distance method, etc.:
import divik.cluster
KMeans.n_clusters = 1
KMeans.distance = "correlation"
KMeans.init = "kdtree_percentile"
KMeans.leaf_size = 0.01
KMeans.percentile = 99.0
KMeans.max_iter = 100
KMeans.normalize_rows = True
GAPSearch.kmeans = @KMeans()
GAPSearch.max_clusters = 2
GAPSearch.n_jobs = 1
GAPSearch.seed = 42
GAPSearch.n_trials = 10
GAPSearch.sample_size = 1000
GAPSearch.drop_unfit = True
GAPSearch.verbose = True
DunnSearch.kmeans = @KMeans()
DunnSearch.max_clusters = 10
DunnSearch.method = "auto"
DunnSearch.inter = "closest"
DunnSearch.intra = "furthest"
DunnSearch.sample_size = 1000
DunnSearch.seed = 42
DunnSearch.n_jobs = 1
DunnSearch.drop_unfit = True
DunnSearch.verbose = True
DiviK.kmeans = @DunnSearch()
DiviK.fast_kmeans = @GAPSearch()
DiviK.distance = "correlation"
DiviK.minimal_size = 200
DiviK.rejection_size = 2
DiviK.minimal_features_percentage = 0.005
DiviK.features_percentage = 1.0
DiviK.normalize_rows = True
DiviK.use_logfilters = True
DiviK.filter_type = "gmm"
DiviK.n_jobs = 1
DiviK.verbose = True
experiment.model = @DiviK()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
scikit-learn
models¶
For a model to be used with fit-clusters
, it needs to be marked as
gin.configurable
. While it is true for DiviK and remaining algorithms
within divik
package, scikit-learn
requires additional setup.
Import helper module:
import divik.core.gin_sklearn_configurables
Point experiment which algorithm to use, e.g.:
experiment.model = @MeanShift()
Configure the algorithm, e.g.:
MeanShift.n_jobs = -1 MeanShift.max_iter = 300
Warning
Importing both scikit-learn
and divik
will result in an ambiguity
when using e.g. KMeans
. In such a case it is necesary to point specific
algorithms by a full name, e.g. divik.cluster._kmeans._core.KMeans
.
Sample config with MeanShift
¶
Below you can check sample configuration file, that sets up simple MeanShift:
import divik.core.gin_sklearn_configurables
MeanShift.cluster_all = True
MeanShift.n_jobs = -1
MeanShift.max_iter = 300
experiment.model = @MeanShift()
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
Pipelines¶
scikit-learn
Pipelines have a separate section to provide an additional
explanation, even though these are part of scikit-learn
.
Import helper module:
import divik.core.gin_sklearn_configurables
Import the algorithms into the scope:
import divik.feature_extraction
Point experiment which algorithm to use, e.g.:
experiment.model = @Pipeline()
Configure the algorithms, e.g.:
MeanShift.n_jobs = -1 MeanShift.max_iter = 300
Configure the pipeline:
Pipeline.steps = [ ('histogram_equalization', @HistogramEqualization()), ('exims', @EximsSelector()), ('pca', @KneePCA()), ('mean_shift', @MeanShift()), ]
(If needed) configure steps that require spatial coordinates:
experiment.steps_that_require_xy = ['exims']
Sample config with Pipeline
¶
Below you can check sample configuration file, that sets up simple Pipeline:
import divik.core.gin_sklearn_configurables
import divik.feature_extraction
MeanShift.n_jobs = -1
MeanShift.max_iter = 300
Pipeline.steps = [
('histogram_equalization', @HistogramEqualization()),
('exims', @EximsSelector()),
('pca', @KneePCA()),
('mean_shift', @MeanShift()),
]
experiment.model = @Pipeline()
experiment.steps_that_require_xy = ['exims']
experiment.omit_datetime = True
experiment.verbose = True
experiment.exist_ok = True
Custom models¶
The fit-clusters
executable can work with custom algorithms as well.
Mark an algorithm class
gin.configurable
at the definition time:import gin @gin.configurable class MyClustering: pass
or when importing them from a library:
import gin gin.external_configurable(MyClustering)
Define artifacts saving methods:
from divik.core.io import saver @saver def save_my_clustering(model, fname_fn, **kwargs): if not hasattr(model, 'my_custom_field_'): return # custom saving logic comes here
There are some default savers defined, which are compatible with lots of
divik
andscikit-learn
algorithms, supporting things like:model pickling
JSON summary saving
labels saving (
.npy
,.csv
)centroids saving (
.npy
,.csv
)pipeline saving
A
saver
should be highly reusable and could be a pleasant contribution to thedivik
library.In config, import the module which marks your algorithm configurable:
import myclustering
Continue with the algorithm setup and plumbing as in the previous scenarios
Computational Modules¶
divik.cluster
module¶
Clustering methods
-
class
divik.cluster.
DiviK
(kmeans, fast_kmeans=None, distance='correlation', minimal_size=None, rejection_size=None, rejection_percentage=None, minimal_features_percentage=0.01, features_percentage=0.05, normalize_rows=None, use_logfilters=False, filter_type='gmm', n_jobs=None, verbose=False)[source]¶ DiviK clustering
- Parameters
- kmeans: AutoKMeans
A self-tuning KMeans estimator for the purpose of clustering
- fast_kmeans: GAPSearch, optional, default: None
A self-tuning KMeans estimator for the purpose of stop condition check. If None, the kmeans parameter is assumed to be the GAPSearch instance.
- distance: str, optional, default: ‘correlation’
The distance metric between points, centroids and for GAP index estimation. One of the distances supported by scipy package.
- minimal_size: int or float, optional, default: None
The minimum size of the region (the number of observations) to be considered for any further divisions. If provided number is between 0 and 1, it is considered a rate of training dataset size. When left None, defaults to 0.1% of the training dataset size.
- rejection_size: int, optional, default: None
Size under which split will be rejected - if a cluster appears in the split that is below rejection_size, the split is considered improper and discarded. This may be useful for some domains (like there is no justification for a 3-cells cluster in biological data). By default, no segmentation is discarded, as careful post-processing provides the same advantage.
- rejection_percentage: float, optional, default: None
An alternative to
rejection_size
, with the same behavior, but this parameter is related to the training data size percentage. By default, no segmentation is discarded.- minimal_features_percentage: float, optional, default: 0.01
The minimal percentage of features that must be preserved after GMM-based feature selection. By default at least 1% of features is preserved in the filtration process.
- features_percentage: float, optional, default: 0.05
The target percentage of features that are used by fallback percentage filter for ‘outlier’ filter.
- normalize_rows: bool, optional, default: None
Whether to normalize each row of the data to the norm of 1. By default, it normalizes rows for correlation metric, does no normalization otherwise.
- use_logfilters: bool, optional, default: False
Whether to compute logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- filter_type: {‘gmm’, ‘outlier’, ‘auto’, ‘none’}, default: ‘gmm’
‘gmm’ - usual Gaussian Mixture Model-based filtering, useful for high
dimensional cases - ‘outlier’ - robust outlier detection-based filtering, useful for low dimensional cases. In the case of no outliers, percentage-based filtering is applied. - ‘auto’ - automatically selects between ‘gmm’ and ‘outlier’ based on the dimensionality. When more than 250 features are present, ‘gmm’ is chosen. - ‘none’ - feature selection is disabled
- n_jobs: int, optional, default: None
The number of jobs to use for the computation. This works by computing each of the GAP index evaluations in parallel and by making predictions in parallel.
- verbose: bool, optional, default: False
Whether to report the progress of the computations.
Examples
>>> from divik.cluster import DiviK >>> from sklearn.datasets import make_blobs >>> X, _ = make_blobs(n_samples=200, n_features=100, centers=20, ... random_state=42) >>> divik = DiviK(distance='euclidean').fit(X) >>> divik.labels_ array([1, 1, 1, 0, ..., 0, 0], dtype=int32) >>> divik.predict([[0, ..., 0], [12, ..., 3]]) array([1, 0], dtype=int32) >>> divik.cluster_centers_ array([[10., ..., 2.], ..., [ 1, ..., 2.]])
- Attributes
- result_: divik.DivikResult
Hierarchical structure describing all the consecutive segmentations.
- labels_:
Labels of each point
- centroids_: array, [n_clusters, n_features]
Coordinates of cluster centers. If the algorithm stops before fully converging, these will not be consistent with
labels_
. Also, the distance between points and respective centroids must be captured in appropriate features subspace. This is realized by thetransform
method.- filters_: array, [n_clusters, n_features]
Filters that were applied to the feature space on the level that was the final segmentation for a subset.
- depth_: int
The number of hierarchy levels in the segmentation.
- n_clusters_: int
The final number of clusters in the segmentation, on the tree leaf level.
- paths_: Dict[int, Tuple[int]]
Describes how the cluster number corresponds to the path in the tree. Element of the tuple indicates the sub-segment number on each tree level.
- reverse_paths_: Dict[Tuple[int], int]
Describes how the path in the tree corresponds to the cluster number. For more details see
paths_
.
Methods
fit
(X[, y])Compute DiviK clustering.
fit_predict
(X[, y])Compute cluster centers and predict cluster index for each sample.
fit_transform
(X[, y])Compute clustering and transform X to cluster-distance space.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
-
fit
(X, y=None)[source]¶ Compute DiviK clustering.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
-
fit_predict
(X, y=None)[source]¶ Compute cluster centers and predict cluster index for each sample.
Convenience method; equivalent to calling fit(X) followed by predict(X).
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- yIgnored
not used, present here for API consistency by convention.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
-
fit_transform
(X, y=None, **fit_params)[source]¶ Compute clustering and transform X to cluster-distance space.
Equivalent to fit(X).transform(X), but more efficiently implemented.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- yIgnored
not used, present here for API consistency by convention.
- Returns
- X_newarray, shape [n_samples, self.n_clusters_]
X transformed in the new space.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)[source]¶ Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, self.n_clusters_]
X transformed in the new space.
-
class
divik.cluster.
DunnSearch
(kmeans, max_clusters, min_clusters=2, method='full', inter='centroid', intra='avg', sample_size=1000, n_trials=10, seed=42, n_jobs=1, drop_unfit=False, verbose=False)[source]¶ Select best number of clusters for k-means
- Parameters
- kmeansKMeans
KMeans object to tune
- max_clusters: int
The maximal number of clusters to form and score.
- min_clusters: int, default: 1
The minimal number of clusters to form and score.
- method: {‘full’, ‘sampled’, ‘auto’}, default: ‘full’
Whether to run full computations or approximate. - full - always computes full Dunn’s index, without sampling - sampled - samples the clusters to reduce computational overhead - auto - switches the above methods to provide best performance-quality trade-off.
- inter{‘centroid’, ‘closest’}, default: ‘centroid’
How the distance between clusters is computed. For more details see dunn.
- intra{‘avg’, ‘furthest’}, default: ‘avg’
How the cluster internal distance is computed. For more details see dunn.
- sample_sizeint, default: 1000
Size of the sample used to compute Dunn index in auto or sampled scenario.
- n_trialsint, default: 10
Number of trials to use when computing Dunn index in auto or sampled scenario.
- seedint, default: 42
Random seed for the reproducibility of subset draws in Dunn auto or sampled scenario.
- n_jobs: int, default: 1
The number of jobs to use for the computation. This works by computing each of the clustering & scoring runs in parallel.
- drop_unfit: bool, default: False
If True, drops the estimators that did not fit the data.
- verbose: bool, default: False
If True, shows progress with tqdm.
- Attributes
- cluster_centers_: array, [n_clusters, n_features]
Coordinates of cluster centers.
- labels_:
Labels of each point.
- estimators_: List[KMeans]
KMeans instances for n_clusters in range [min_clusters, max_clusters].
- scores_: array, [max_clusters - min_clusters + 1,]
Array with scores for each estimator.
- n_clusters_: int
Estimated optimal number of clusters.
- best_score_: float
Score of the optimal estimator.
- best_: KMeans
The optimal estimator.
Methods
fit
(X[, y])Compute k-means clustering and estimate optimal number of clusters.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
-
fit
(X, y=None)[source]¶ Compute k-means clustering and estimate optimal number of clusters.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
-
fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)[source]¶ Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, k]
X transformed in the new space.
-
class
divik.cluster.
GAPSearch
(kmeans, max_clusters, min_clusters=1, n_jobs=1, seed=0, n_trials=10, sample_size=1000, drop_unfit=False, verbose=False)[source]¶ Select best number of cluters for k-means
- Parameters
- kmeansKMeans
KMeans object to tune
- max_clusters: int
The maximal number of clusters to form and score.
- min_clusters: int, default: 1
The minimal number of clusters to form and score.
- n_jobs: int, default: 1
The number of jobs to use for the computation. This works by computing each of the clustering & scoring runs in parallel.
- seed: int, default: 0
Random seed for generating uniform data sets.
- n_trials: int, default: 10
Number of data sets drawn as a reference.
- sample_sizeint, default: 1000
Size of the sample used for GAP statistic computation. Used only if introduces speedup.
- drop_unfit: bool, default: False
If True, drops the estimators that did not fit the data.
- verbose: bool, default: False
If True, shows progress with tqdm.
- Attributes
- cluster_centers_: array, [n_clusters, n_features]
Coordinates of cluster centers.
- labels_:
Labels of each point.
- estimators_: List[KMeans]
KMeans instances for n_clusters in range [min_clusters, max_clusters].
- scores_: array, [max_clusters - min_clusters + 1, ?]
Array with scores for each estimator in each row.
- n_clusters_: int
Estimated optimal number of clusters.
- best_score_: float
Score of the optimal estimator.
- best_: KMeans
The optimal estimator.
Methods
fit
(X[, y])Compute k-means clustering and estimate optimal number of clusters.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
-
fit
(X, y=None)[source]¶ Compute k-means clustering and estimate optimal number of clusters.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
-
fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)[source]¶ Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, k]
X transformed in the new space.
-
class
divik.cluster.
KMeans
(n_clusters, distance='euclidean', init='percentile', percentile=95.0, leaf_size=0.01, max_iter=100, normalize_rows=False, allow_dask=False)[source]¶ K-Means clustering
- Parameters
- n_clustersint
The number of clusters to form as well as the number of centroids to generate.
- distancestr, optional, default: ‘euclidean’
Distance measure. One of the distances supported by scipy package.
- init{‘percentile’, ‘extreme’, ‘kdtree’, ‘kdtree_percentile’}
Method for initialization, defaults to ‘percentile’:
‘percentile’ : selects initial cluster centers for k-mean clustering starting from specified percentile of distance to already selected clusters
‘extreme’: selects initial cluster centers for k-mean clustering starting from the furthest points to already specified clusters
‘kdtree’: selects initial cluster centers for k-mean clustering starting from centroids of KD-Tree boxes
‘kdtree_percentile’: selects initial cluster centers for k-means clustering starting from centroids of KD-Tree boxes containing specified percentile. This should be more robust against outliers.
- percentilefloat, default: 95.0
Specifies the starting percentile for ‘percentile’ initialization. Must be within range [0.0, 100.0]. At 100.0 it is equivalent to ‘extreme’ initialization.
- leaf_sizeint or float, optional (default 0.01)
Desired leaf size in kdtree initialization. When int, the box size will be between leaf_size and 2 * leaf_size. When float, it will be between leaf_size * n_samples and 2 * leaf_size * n_samples
- max_iterint, default: 100
Maximum number of iterations of the k-means algorithm for a single run.
- normalize_rowsbool, default: False
If True, rows are translated to mean of 0.0 and scaled to norm of 1.0.
- allow_daskbool, default: False
If True, automatically selects dask as computations backend whenever reasonable. Default False since it cannot be used together with multiprocessing.Pool and everywhere n_jobs must be set to 1.
- Attributes
- cluster_centers_array, [n_clusters, n_features]
Coordinates of cluster centers.
- labels_ :
Labels of each point
Methods
fit
(X[, y])Compute k-means clustering.
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Transform X to a cluster-distance space.
-
fit
(X, y=None)[source]¶ Compute k-means clustering.
- Parameters
- Xarray-like or sparse matrix, shape=(n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.
- yIgnored
not used, present here for API consistency by convention.
-
fit_predict
(X, y=None)¶ Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
predict
(X)[source]¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
- Returns
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)[source]¶ Transform X to a cluster-distance space.
In the new space, each dimension is the distance to the cluster centers. Note that even if X is sparse, the array returned by transform will typically be dense.
- Parameters
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
- Returns
- X_newarray, shape [n_samples, k]
X transformed in the new space.
-
class
divik.cluster.
TwoStep
(clusterer, n_subsets=10, random_state=42)[source]¶ Perform a two-step clustering with a given clusterer
Separates a dataset into
n_subsets
, processes each of them separately and then combines the results.Works with centroid-based clustering methods, as it requires cluster representatives to combine the result.
- Parameters
- clustererUnion[AutoKMeans, Pipeline, KMeans]
A centroid-based estimator for the purpose of clustering.
- n_subsetsint, default 10
The number of subsets into which the original dataset should be separated
- random_stateint, default 42
Random state to use for seeding the random number generator.
Examples
>>> from sklearn.datasets import make_blobs >>> from divik.cluster import KMeans, TwoStep >>> X, _ = make_blobs( ... n_samples=10_000, n_features=2, centers=3, random_state=42 ... ) >>> kmeans = KMeans(n_clusters=3) >>> ctr = TwoStep(kmeans).fit(X)
Methods
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
predict
-
fit_predict
(X, y=None)[source]¶ Perform clustering on X and returns cluster labels.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input data.
- yIgnored
Not used, present for API consistency by convention.
- Returns
- labelsndarray of shape (n_samples,), dtype=np.int64
Cluster labels.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
divik.feature_extraction
module¶
Unsupervised feature extraction methods
-
class
divik.feature_extraction.
HistogramEqualization
(n_bins=256, n_jobs=- 1)[source]¶ Equalize histogram of the features to increase contrast
Based on https://github.com/scikit-image/scikit-image/blob/master/skimage/exposure/exposure.py#L187-L223
- Parameters
- n_binsint, default 256
Number of bins for histogram equalization.
- n_jobsint, default -1
Number of CPU cores to use during equalization
- Attributes
- cdf_array
Values of cumulative distribution function for all the features
- bins_array
Bin centers for all the features
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
transform
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
class
divik.feature_extraction.
KneePCA
(whiten=False, refit=False)[source]¶ Principal component analysis (PCA) with knee method
PCA with automated components selection based on knee method over cumulative explained variance. Remaining components are discarded.
- Parameters
- whitenbool, optional (default False)
When True (False by default) the
pca_.components_
vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
- refitbool, optional (default False)
When
True
(False
by default) thepca_
is re-fit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.
- Attributes
- pca_PCA
Fit PCA estimator.
- n_components_int
The number of selected components.
Methods
fit
(X[, y])Fit the model from data in X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
Transform data back to its original space.
set_params
(**params)Set the parameters of this estimator.
transform
(X[, y])Apply dimensionality reduction to X.
-
fit
(X, y=None)[source]¶ Fit the model from data in X.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where
n_samples
is the number of samples andn_features
is the number of features.- Y: Ignored.
- Returns
- selfobject
Returns the instance itself.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
inverse_transform
(X)[source]¶ Transform data back to its original space.
In other words, return an input X_original whose transform would be X.
- Parameters
- Xarray-like, shape (n_samples, n_components)
New data, where
n_samples
is the number of samples andn_components
is the number of components.
- Returns
- X_original array-like, shape (n_samples, n_features)
Notes
If whitening is enabled, inverse_transform will compute the exact inverse operation, which includes reversing whitening.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X, y=None)[source]¶ Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted from a training set.
- Parameters
- Xarray-like, shape (n_samples, n_features)
New data, where
n_samples
is the number of samples andn_features
is the number of features.
- Returns
- X_newarray-like, shape (n_samples, n_components)
Examples
>>> import numpy as np >>> from divik.feature_extraction import KneePCA >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> pca = KneePCA(refit=True) >>> pca.fit(X) KneePCA(refit=True) >>> pca.transform(X)
-
class
divik.feature_extraction.
LocallyAdjustedRbfSpectralEmbedding
(distance='euclidean', n_components=2, random_state=None, eigen_solver=None, n_neighbors=None, n_jobs=1)[source]¶ Spectral embedding for non-linear dimensionality reduction.
Forms an affinity matrix given by the specified function and applies spectral decomposition to the corresponding graph laplacian. The resulting transformation is given by the value of the eigenvectors for each data point.
Note : Laplacian Eigenmaps is the actual algorithm implemented here.
- Parameters
- distance{‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’,
- ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’,
- ‘kulsinski’, ‘mahalanobis’, ‘atching’, ‘minkowski’, ‘rogerstanimoto’,
- ‘russellrao’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’}
Distance measure, defaults to
euclidean
. These are the distances supported by scipy package.- n_componentsinteger, default: 2
The dimension of the projected subspace.
- random_stateint, RandomState instance or None, optional, default: None
A pseudo random number generator used for the initialization of the lobpcg eigenvectors. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by
np.random
. Used whensolver
==amg
.- eigen_solver{None, ‘arpack’, ‘lobpcg’, or ‘amg’}
The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities.
- n_neighborsint, default
Number of nearest neighbors for nearest_neighbors graph building.
- n_jobsint, optional (default = 1)
The number of parallel jobs to run. If
-1
, then the number of jobs is set to the number of CPU cores.
References
A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
On Spectral Clustering: Analysis and an algorithm, 2001 Andrew Y. Ng, Michael I. Jordan, Yair Weiss http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8100
Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
- Attributes
- embedding_array, shape = (n_samples, n_components)
Spectral embedding of the training matrix.
Methods
fit
(X[, y])Fit the model from data in X.
fit_transform
(X[, y])Fit the model from data in X and transform X.
get_params
([deep])Get parameters for this estimator.
save
(destination)Save embedding to a directory
set_params
(**params)Set the parameters of this estimator.
transform
-
fit
(X, y=None)[source]¶ Fit the model from data in X.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
- Y: Ignored.
- Returns
- selfobject
Returns the instance itself.
-
fit_transform
(X, y=None)[source]¶ Fit the model from data in X and transform X.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
- Y: Ignored.
- Returns
- X_newarray-like, shape (n_samples, n_components)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
save
(destination)[source]¶ Save embedding to a directory
- Parameters
- destinationstr
Directory to save the embedding.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
divik.feature_selection
module¶
Unsupervised feature selection methods
-
class
divik.feature_selection.
EximsSelector
[source]¶ Select features based on their spatial distribution
Preserves features that yield biologically plausible structures.
References
Wijetunge, Chalini D., et al. “EXIMS: an improved data analysis pipeline based on a new peak picking method for EXploring Imaging Mass Spectrometry data.” Bioinformatics 31.19 (2015): 3198-3206. https://academic.oup.com/bioinformatics/article/31/19/3198/212150
Methods
fit
(X[, y, xy])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None, xy=None)[source]¶ Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- xyarray-like, shape (n_samples, 2)
Spatial coordinates of the samples. Expects integers, indices over am image.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
-
class
divik.feature_selection.
GMMSelector
(stat, use_log=False, n_candidates=None, min_features=1, min_features_rate=0.0, preserve_high=True, max_components=10)[source]¶ Feature selector that removes low- or high- mean or variance features
Gaussian Mixture Modeling is applied to the features’ characteristics and components are obtained. Crossing points of the components are considered candidate thresholds. Out of these up to
n_candidates
components are removed in such a way that at leastmin_features
ormin_features_rate
features are retained.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- n_candidates: int, optional, default: None
How many candidate thresholds to use at most.
0
preserves all the features (all candidate thresholds are discarded),None
allows to remove all but one component (all candidate thresholds are retained). Negative value means to discard up to all but-n_candidates
candidates, e.g.-1
will retain at least two components (one candidate threshold is removed).- min_features: int, optional, default: 1
How many features must be preserved. Candidate thresholds are tested against this value, and if they retain less features, less conservative thresholds is selected.
- min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number.- preserve_high: bool, optional, default: True
Whether to preserve the high-characteristic features or low-characteristic ones.
- max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = labels * 5 + np.random.randn(*labels.shape) >>> fs.GMMSelector('mean').fit_transform(data) array([[14.78032811 15.35711257 ... 15.75193303]]) >>> fs.GMMSelector('mean', preserve_high=False).fit_transform(data) array([[ 0.49671415 -0.1382643 ... -0.29169375]]) >>> fs.GMMSelector('mean', n_discard=-1).fit_transform(data) array([[10.32408397 9.61491772 ... 15.75193303]])
- Attributes
- vals_: array, shape (n_features,)
Computed characteristic of each feature.
- threshold_: float
Threshold value to filter the features by the characteristic.
- raw_threshold_: float
Threshold value mapped back to characteristic space (no logarithm, etc.)
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None)[source]¶ Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
HighAbundanceAndVarianceSelector
(use_log=False, min_features=1, min_features_rate=0.0, max_components=10)[source]¶ Feature selector that removes low-mean and low-variance features
Exercises
GMMSelector
to filter out the low-abundance noise features and select high-variance informative features.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- min_features: int, optional, default: 1
How many features must be preserved.
- min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number.- max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> # Data in this case must be carefully crafted >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = np.vstack(100 * [labels * 10.]) >>> data += np.random.randn(*data.shape) >>> sub = data[:, :-40] >>> sub += 5 * np.random.randn(*sub.shape) >>> # Label 0 has low abundance but high variance >>> # Label 3 has low variance but high abundance >>> # Label 1 and 2 has not-lowest abundance and high variance >>> selector = fs.HighAbundanceAndVarianceSelector().fit(data) >>> selector.transform(labels.reshape(1,-1)) array([[1 1 1 1 1 ...2 2 2]])
- Attributes
- abundance_selector_: GMMSelector
Selector used to filter out the noise component.
- variance_selector_: GMMSelector
Selector used to filter out the non-informative features.
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None)[source]¶ Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
NoSelector
[source]¶ Dummy selector to use when no selection is supposed to be made.
Methods
fit
(X[, y])Pass data forward
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None)[source]¶ Pass data forward
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors to pass.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
-
class
divik.feature_selection.
OutlierAbundanceAndVarianceSelector
(use_log=False, min_features_rate=0.01, p=0.2)[source]¶ Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None)[source]¶ Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
-
class
divik.feature_selection.
OutlierSelector
(stat, use_log=False, keep_outliers=False)[source]¶ Feature selector that removes outlier features w.r.t. mean or variance
Huberta’s outlier detection is applied to the features’ characteristics and the outlying features are removed.
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- keep_outliers: bool, optional, default: False
When True, keeps outliers instead of inlier features.
- Attributes
- vals_: array, shape (n_features,)
Computed characteristic of each feature.
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None)[source]¶ Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
PercentageSelector
(stat, use_log=False, keep_top=True, p=0.2)[source]¶ Feature selector that removes / preserves top some percent of features
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- keep_top: bool, optional, default: True
When True, keeps features with highest value of the characteristic.
- p: float, optional, default: 0.2
Rate of features to keep.
- Attributes
- vals_: array, shape (n_features,)
Computed characteristic of each feature.
- threshold_: float
Value of the threshold used for filtering
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
-
fit
(X, y=None)[source]¶ Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
SelectorMixin
[source]¶ Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with transform and inverse_transform functionality given an implementation of _get_support_mask.
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
transform
(X)Reduce X to the selected features.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_support
(indices=False)[source]¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)[source]¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
-
class
divik.feature_selection.
StatSelectorMixin
[source]¶ Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with
transform
andinverse_transform
functionality given thatselected_
is specified duringfit
.Additionally, provides a
_to_characteristics
and_to_raw
implementations givenstat
, optionallyuse_log
andpreserve_high
.Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_support
([indices])Get a mask, or integer index, of the features selected
Reverse the transformation operation
transform
(X)Reduce X to the selected features.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_support
(indices=False)¶ Get a mask, or integer index, of the features selected
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(X)¶ Reverse the transformation operation
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
-
transform
(X)¶ Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
-
divik.feature_selection.
huberta_outliers
(v)[source]¶ Outlier detection method based on medcouple statistic.
- Parameters
- v: array-like
An array to filter outlier from.
- Returns
- Binary vector indicating all the outliers.
References
M. Huberta, E.Vandervierenb (2008) An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (2008) 5186–5201
divik.sampler
module¶
Sampling methods for statistical indices computation purposes
-
class
divik.sampler.
BaseSampler
[source]¶ Base class for all the samplers
Sampler is Pool-safe, i.e. can simply store a dataset. It will not be serialized by pickle when going to another process, if handled properly.
Before you spawn a pool, a data must be moved to a module-level variable. To simplify that process a contract has been prepared. You open a context and operate within a context:
>>> with sampler.parallel() as sampler_, ... Pool(initializer=sampler_.initializer, ... initargs=sampler_.initargs) as pool: ... pool.map(sampler_.get_sample, range(10))
Keep in mind, that __iter__ and fit are not accessible in parallel context. __iter__ would yield the same values independently in all the workers. Now it needs to be done consciously and in well-though manner. fit could lead to a non-predictable behaviour. If you need the original sampler, you can get a clone (not fit to the data).
Methods
fit
(X[, y])Fit sampler to data
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
-
fit
(X, y=None)[source]¶ Fit sampler to data
It’s a base for both supervised and unsupervised samplers.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
abstract
get_sample
(seed)[source]¶ Return specific sample
Following assumptions should be met: a) sampler.get_sample(x) == sampler.get_sample(x) b) x != y should yield sampler.get_sample(x) != sampler.get_sample(y)
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
-
class
divik.sampler.
ParallelSampler
(sampler)[source]¶ Helper class for sharing the sampler functionality
- Attributes
- initargs
Methods
clone
()Clones the original sampler
get_sample
(seed)Return specific sample
initializer
-
property
initargs
¶
-
class
divik.sampler.
StratifiedSampler
(n_rows=100, n_samples=None)[source]¶ Sample the original data preserving proportions of groups
- Parameters
- n_rowsint or float, optional (default 10000)
Allows to limit the number of rows in the drawn samples. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the sample. If int, represents the absolute number of rows.
- n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
- Attributes
- X_array_like, shape (n_rows, n_features)
Data to sample from
- y_array_like, shape (n_rows,)
Group labels
Methods
fit
(X, y)Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
-
fit
(X, y)[source]¶ Fit the model from data in X.
Both inputs are preserved inside to sample from the data.
- Parameters
- Xarray-like, shape (n_rows, n_features)
Training vector, where n_rows is the number of rows and n_features is the number of features.
- y: array-like, shape (n_rows,)
- Returns
- selfStratifiedSampler
Returns the instance itself.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_sample
(seed)[source]¶ Return specific sample
Sample is drawn from the set of existing rows. A proportion of gorups should be more-or-less the same, depending on the size of the sample.
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
class
divik.sampler.
UniformPCASampler
(n_rows=None, n_samples=None, whiten=False, refit=False, pca='knee')[source]¶ Rotation-invariant uniform sampling
- Parameters
- n_rowsint, optional (default None)
Allows to limit the number of rows in the drawn samples
- n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
- whitenbool, optional (default False)
When True (False by default) the pca_.components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
- refitbool, optional (default False)
When True (False by default) the pca_ is re-fit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.
- pca: {‘knee’, ‘full’}, default ‘knee’
Specifies whether to train full or knee PCA.
- Attributes
- pca_KneePCA or PCA
PCA transform which provided rotation-invariance
- sampler_UniformSampler
Sampler from the transformed distribution
Methods
fit
(X[, y])Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
-
fit
(X, y=None)[source]¶ Fit the model from data in X.
PCA is fit to estimate the rotation and UniformSampler is fit to transformed data.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
- Y: Ignored.
- Returns
- selfUniformPCASampler
Returns the instance itself.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_sample
(seed)[source]¶ Return specific sample
Sample is generated from transformed distribution and transformed back to the original space.
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
-
parallel
()¶ Create parallel context for the sampler to operate
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
-
class
divik.sampler.
UniformSampler
(n_rows=None, n_samples=None)[source]¶ Samples uniformly from the boundaries of the data
- Parameters
- n_rowsint, optional (default None)
Allows to limit the number of rows in the drawn samples
- n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
- Attributes
- shape_(n_rows, n_cols)
Shape of the drawn samples
- scaler_MinMaxScaler
Scaler ensuring the proper ranges
Methods
fit
(X[, y])Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
-
fit
(X, y=None)[source]¶ Fit the model from data in X.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
- Y: Ignored.
- Returns
- selfUniformSampler
Returns the instance itself.
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
-
get_sample
(seed)[source]¶ Return specific sample
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
-
parallel
()¶ Create parallel context for the sampler to operate
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
Clustering methods |
|
Unsupervised feature extraction methods |
|
Unsupervised feature selection methods |
|
Sampling methods for statistical indices computation purposes |
Utility Packages¶
divik
package¶
Unsupervised high-throughput data analysis methods
-
divik.
reject_split
(tree, rejection_size=0)[source]¶ Re-apply rejection condition on known result tree.
- Return type
Optional
[DivikResult
]
Modules
Clustering methods |
|
Reusable utilities used for building divik library |
|
Unsupervised feature extraction methods |
|
Unsupervised feature selection methods |
|
Sampling methods for statistical indices computation purposes |
|
|
divik.core
module¶
Reusable utilities used for building divik library
-
divik.core.
Centroids
¶ alias of
numpy.ndarray
-
divik.core.
Data
¶ alias of
numpy.ndarray
-
class
divik.core.
DivikResult
(clustering: Union[divik.cluster.GAPSearch, divik.cluster.DunnSearch], feature_selector: divik.feature_selection.StatSelectorMixin, merged: numpy.ndarray, subregions: List[Optional[DivikResult]])[source]¶ Result of DiviK clustering
- Attributes
clustering
Alias for field number 0
feature_selector
Alias for field number 1
merged
Alias for field number 2
subregions
Alias for field number 3
Methods
count
(value, /)Return number of occurrences of value.
index
(value[, start, stop])Return first index of value.
-
property
clustering
¶ Fitted automated clustering estimator
-
count
(value, /)¶ Return number of occurrences of value.
-
property
feature_selector
¶ Fitted feature selector
-
index
(value, start=0, stop=sys.maxsize, /)¶ Return first index of value.
Raises ValueError if the value is not present.
-
property
merged
¶ Recursively merged clustering labels
-
property
subregions
¶ DivikResults for all obtained subregions
-
divik.core.
IntLabels
¶ alias of
numpy.ndarray
-
class
divik.core.
Subsets
(n_splits=10, random_state=42)[source]¶ Scatter dataset to disjoint random subsets and combine them back
- Parameters
- n_splitsint, default 10
Number of subsets that will be generated.
- random_stateint, default 42
Random state to use for seeding the random number generator.
Examples
>>> from divik.core import Subsets >>> subsets = Subsets(n_splits=10, random_state=42) >>> X_list = subsets.scatter(X) >>> len(X_list) 10 >>> # do some computations on each subset >>> y = subsets.combine(y_list)
Methods
combine
scatter
-
divik.core.
cached_fit
(cls)[source]¶ Decorate a sklearn-compatible estimator to cache the fitting result
It is a wrapper over joblib.Memory.cache, that supports runtime cache path definition.
Set path definition through gin config with
cache_path.path
identifier.
-
divik.core.
configurable
(name_or_fn=None, module=None, allowlist=None, denylist=None, whitelist=None, blacklist=None)[source]¶ Decorator to make a function or class configurable.
This decorator registers the decorated function/class as configurable, which allows its parameters to be supplied from the global configuration (i.e., set through bind_parameter or parse_config). The decorated function is associated with a name in the global configuration, which by default is simply the name of the function or class, but can be specified explicitly to avoid naming collisions or improve clarity.
If some parameters should not be configurable, they can be specified in denylist. If only a restricted set of parameters should be configurable, they can be specified in allowlist.
The decorator can be used without any parameters as follows:
@config.configurable def some_configurable_function(param1, param2=’a default value’):
…
In this case, the function is associated with the name ‘some_configurable_function’ in the global configuration, and both param1 and param2 are configurable.
The decorator can be supplied with parameters to specify the configurable name or supply an allowlist/denylist:
@config.configurable(‘explicit_configurable_name’, allowlist=’param2’) def some_configurable_function(param1, param2=’a default value’):
…
In this case, the configurable is associated with the name ‘explicit_configurable_name’ in the global configuration, and only param2 is configurable.
Classes can be decorated as well, in which case parameters of their constructors are made configurable:
@config.configurable class SomeClass:
- def __init__(self, param1, param2=’a default value’):
…
In this case, the name of the configurable is ‘SomeClass’, and both param1 and param2 are configurable.
- Args:
- name_or_fn: A name for this configurable, or a function to decorate (in
which case the name will be taken from that function). If not set, defaults to the name of the function/class that is being made configurable. If a name is provided, it may also include module components to be used for disambiguation (these will be appended to any components explicitly specified by module).
- module: The module to associate with the configurable, to help handle naming
collisions. By default, the module of the function or class being made configurable will be used (if no module is specified as part of the name).
- allowlist: An allowlisted set of kwargs that should be configurable. All
other kwargs will not be configurable. Only one of allowlist or denylist should be specified.
- denylist: A denylisted set of kwargs that should not be configurable. All
other kwargs will be configurable. Only one of allowlist or denylist should be specified.
whitelist: Deprecated version of allowlist for backwards compatibility. blacklist: Deprecated version of denylist for backwards compatibility.
- Returns:
When used with no parameters (or with a function/class supplied as the first parameter), it returns the decorated function or class. When used with parameters, it returns a function that can be applied to decorate the target function or class.
-
divik.core.
context_if
(condition, context, *args, **kwargs)[source]¶ Create context with given params only if the condition is
True
-
divik.core.
dump_gin_args
(destination)[source]¶ Dump gin-config effective configuration
If you have gin extras installed, you can call dump_gin_args save effective gin configuration to a file.
-
divik.core.
maybe_pool
(processes=None, *args, **kwargs)[source]¶ Create
multiprocessing.Pool
if multiple CPUs are allowedExamples
>>> from divik.core import maybe_pool >>> with maybe_pool(processes=1) as pool: ... # Runs in sequential ... pool.map(id, range(10000)) >>> with maybe_pool(processes=-1) as pool: ... # Runs with all cores ... pool.map(id, range(10000))
-
divik.core.
normalize_rows
(data)[source]¶ Translate and scale rows to zero mean and vector length equal one
- Return type
ndarray
-
divik.core.
seeded
(wrapped_requires_seed=False)[source]¶ Create seeded scope for function call.
- Parameters
- wrapped_requires_seed: bool, optional, default: False
if true, passes seed parameter to the inner function
Share a numpy array between
multiprocessing.Pool
processes
-
divik.core.
visualize
(label, xy, shape=None)[source]¶ Create RGB map of labels over with given coordinates
Modules
Mark scikit-learn classes as configurable |
|
Reusable utilities for data and model I/O |
divik.core.io
module¶
Reusable utilities for data and model I/O
-
divik.core.io.
save
(model, destination, **kwargs)[source]¶ Save model and related summaries into specified destination directory
-
divik.core.io.
saver
(fn)[source]¶ Register the function as handler for saving model and related summaries
The saver function should be reusable for different models exhibiting the required variables. Rather prefer checking the required attributes than the model class.
Examples
>>> from divik.core.io import saver >>> @saver ... def my_saver(model, destination, **kwargs): ... if not hasattr(model, 'my_custom_field_'): ... return ... if not 'my_param' in kwargs: ... return ... # custom saving logic comes here
You can also make this function configurable:
>>> import gin >>> from divik.core.io import saver >>> @saver ... @gin.configurable(allowlist=['my_param']) ... def configurable_saver(model, destination, my_param=None, **kwargs): ... if not hasattr(model, 'my_custom_field_'): ... return ... if my_param is None: ... return ... # custom saving logic comes here
divik.core.gin_sklearn_configurables
module¶
Mark scikit-learn classes as configurable
Unsupervised high-throughput data analysis methods |
|
Reusable utilities used for building divik library |
|
Reusable utilities for data and model I/O |
|
Mark scikit-learn classes as configurable |