divik.sampler
module¶
Sampling methods for statistical indices computation purposes
- class divik.sampler.BaseSampler[source]¶
Base class for all the samplers
Sampler is Pool-safe, i.e. can simply store a dataset. It will not be serialized by pickle when going to another process, if handled properly.
Before you spawn a pool, a data must be moved to a module-level variable. To simplify that process a contract has been prepared. You open a context and operate within a context:
>>> with sampler.parallel() as sampler_, ... Pool(initializer=sampler_.initializer, ... initargs=sampler_.initargs) as pool: ... pool.map(sampler_.get_sample, range(10))
Keep in mind, that __iter__ and fit are not accessible in parallel context. __iter__ would yield the same values independently in all the workers. Now it needs to be done consciously and in well-though manner. fit could lead to a non-predictable behaviour. If you need the original sampler, you can get a clone (not fit to the data).
Methods
fit
(X[, y])Fit sampler to data
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
- fit(X, y=None)[source]¶
Fit sampler to data
It’s a base for both supervised and unsupervised samplers.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- abstract get_sample(seed)[source]¶
Return specific sample
Following assumptions should be met: a) sampler.get_sample(x) == sampler.get_sample(x) b) x != y should yield sampler.get_sample(x) != sampler.get_sample(y)
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- class divik.sampler.ParallelSampler(sampler)[source]¶
Helper class for sharing the sampler functionality
- Attributes
- initargs
Methods
clone
()Clones the original sampler
get_sample
(seed)Return specific sample
initializer
- property initargs¶
- class divik.sampler.StratifiedSampler(n_rows=100, n_samples=None)[source]¶
Sample the original data preserving proportions of groups
- Parameters
- n_rowsint or float, optional (default 10000)
Allows to limit the number of rows in the drawn samples. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the sample. If int, represents the absolute number of rows.
- n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
- Attributes
- X_array_like, shape (n_rows, n_features)
Data to sample from
- y_array_like, shape (n_rows,)
Group labels
Methods
fit
(X, y)Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
- fit(X, y)[source]¶
Fit the model from data in X.
Both inputs are preserved inside to sample from the data.
- Parameters
- Xarray-like, shape (n_rows, n_features)
Training vector, where n_rows is the number of rows and n_features is the number of features.
- y: array-like, shape (n_rows,)
- Returns
- selfStratifiedSampler
Returns the instance itself.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_sample(seed)[source]¶
Return specific sample
Sample is drawn from the set of existing rows. A proportion of gorups should be more-or-less the same, depending on the size of the sample.
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- class divik.sampler.UniformPCASampler(n_rows=None, n_samples=None, whiten=False, refit=False, pca='knee')[source]¶
Rotation-invariant uniform sampling
- Parameters
- n_rowsint, optional (default None)
Allows to limit the number of rows in the drawn samples
- n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
- whitenbool, optional (default False)
When True (False by default) the pca_.components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
- refitbool, optional (default False)
When True (False by default) the pca_ is re-fit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.
- pca: {‘knee’, ‘full’}, default ‘knee’
Specifies whether to train full or knee PCA.
- Attributes
- pca_KneePCA or PCA
PCA transform which provided rotation-invariance
- sampler_UniformSampler
Sampler from the transformed distribution
Methods
fit
(X[, y])Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
- fit(X, y=None)[source]¶
Fit the model from data in X.
PCA is fit to estimate the rotation and UniformSampler is fit to transformed data.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
- Y: Ignored.
- Returns
- selfUniformPCASampler
Returns the instance itself.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_sample(seed)[source]¶
Return specific sample
Sample is generated from transformed distribution and transformed back to the original space.
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
- parallel()¶
Create parallel context for the sampler to operate
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- class divik.sampler.UniformSampler(n_rows=None, n_samples=None)[source]¶
Samples uniformly from the boundaries of the data
- Parameters
- n_rowsint, optional (default None)
Allows to limit the number of rows in the drawn samples
- n_samplesint, optional (default None)
Allows to limit the number of samples when iterating
- Attributes
- shape_(n_rows, n_cols)
Shape of the drawn samples
- scaler_MinMaxScaler
Scaler ensuring the proper ranges
Methods
fit
(X[, y])Fit the model from data in X.
get_params
([deep])Get parameters for this estimator.
get_sample
(seed)Return specific sample
parallel
()Create parallel context for the sampler to operate
set_params
(**params)Set the parameters of this estimator.
- fit(X, y=None)[source]¶
Fit the model from data in X.
- Parameters
- Xarray-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and n_features is the number of features.
- Y: Ignored.
- Returns
- selfUniformSampler
Returns the instance itself.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_sample(seed)[source]¶
Return specific sample
- Parameters
- seedint
The seed to use to draw the sample
- Returns
- samplearray_like, (*self.shape_)
Returns the drawn sample
- parallel()¶
Create parallel context for the sampler to operate
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.