`divik.sampler` module¶

Sampling methods for statistical indices computation purposes

class divik.sampler.BaseSampler[source]¶

Base class for all the samplers

Sampler is Pool-safe, i.e. can simply store a dataset. It will not be serialized by pickle when going to another process, if handled properly.

Before you spawn a pool, a data must be moved to a module-level variable. To simplify that process a contract has been prepared. You open a context and operate within a context:

>>> with sampler.parallel() as sampler_,
...         Pool(initializer=sampler_.initializer,
...              initargs=sampler_.initargs) as pool:
...     pool.map(sampler_.get_sample, range(10))

Keep in mind, that __iter__ and fit are not accessible in parallel context. __iter__ would yield the same values independently in all the workers. Now it needs to be done consciously and in well-though manner. fit could lead to a non-predictable behaviour. If you need the original sampler, you can get a clone (not fit to the data).

Methods

`fit`(X[, y])	Fit sampler to data
`get_params`([deep])	Get parameters for this estimator.
`get_sample`(seed)	Return specific sample
`parallel`()	Create parallel context for the sampler to operate
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y=None)[source]¶

Fit sampler to data

It’s a base for both supervised and unsupervised samplers.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

abstract get_sample(seed)[source]¶

Return specific sample

Following assumptions should be met: a) sampler.get_sample(x) == sampler.get_sample(x) b) x != y should yield sampler.get_sample(x) != sampler.get_sample(y)

Parameters

seedint: The seed to use to draw the sample

Returns

samplearray_like, (*self.shape_): Returns the drawn sample

parallel()[source]¶: Create parallel context for the sampler to operate

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

class divik.sampler.ParallelSampler(sampler)[source]¶

Helper class for sharing the sampler functionality

Attributes

initargs

Methods

`clone`()	Clones the original sampler
`get_sample`(seed)	Return specific sample

initializer

clone()[source]¶: Clones the original sampler

get_sample(seed)[source]¶: Return specific sample

property initargs¶

initializer(*args)[source]¶

class divik.sampler.StratifiedSampler(n_rows=100, n_samples=None)[source]¶

Sample the original data preserving proportions of groups

Parameters

n_rowsint or float, optional (default 10000): Allows to limit the number of rows in the drawn samples. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the sample. If int, represents the absolute number of rows.
n_samplesint, optional (default None): Allows to limit the number of samples when iterating

Attributes

X_array_like, shape (n_rows, n_features): Data to sample from
y_array_like, shape (n_rows,): Group labels

Methods

`fit`(X, y)	Fit the model from data in X.
`get_params`([deep])	Get parameters for this estimator.
`get_sample`(seed)	Return specific sample
`parallel`()	Create parallel context for the sampler to operate
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y)[source]¶

Fit the model from data in X.

Both inputs are preserved inside to sample from the data.

Parameters

Xarray-like, shape (n_rows, n_features): Training vector, where n_rows is the number of rows and n_features is the number of features.
y: array-like, shape (n_rows,)

Returns

selfStratifiedSampler: Returns the instance itself.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_sample(seed)[source]¶

Return specific sample

Sample is drawn from the set of existing rows. A proportion of gorups should be more-or-less the same, depending on the size of the sample.

Parameters

seedint: The seed to use to draw the sample

Returns

samplearray_like, (*self.shape_): Returns the drawn sample

parallel()[source]¶: Create parallel context for the sampler to operate

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

class divik.sampler.UniformPCASampler(n_rows=None, n_samples=None, whiten=False, refit=False, pca='knee')[source]¶

Rotation-invariant uniform sampling

Parameters

n_rowsint, optional (default None)

Allows to limit the number of rows in the drawn samples

n_samplesint, optional (default None)

Allows to limit the number of samples when iterating

whitenbool, optional (default False)

When True (False by default) the pca_.components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

refitbool, optional (default False)

When True (False by default) the pca_ is re-fit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.

pca: {‘knee’, ‘full’}, default ‘knee’

Specifies whether to train full or knee PCA.

Attributes

pca_KneePCA or PCA: PCA transform which provided rotation-invariance
sampler_UniformSampler: Sampler from the transformed distribution

Methods

`fit`(X[, y])	Fit the model from data in X.
`get_params`([deep])	Get parameters for this estimator.
`get_sample`(seed)	Return specific sample
`parallel`()	Create parallel context for the sampler to operate
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y=None)[source]¶

Fit the model from data in X.

PCA is fit to estimate the rotation and UniformSampler is fit to transformed data.

Parameters

Xarray-like, shape (n_samples, n_features): Training vector, where n_samples is the number of samples and n_features is the number of features.
Y: Ignored.

Returns

selfUniformPCASampler: Returns the instance itself.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_sample(seed)[source]¶

Return specific sample

Sample is generated from transformed distribution and transformed back to the original space.

Parameters

seedint: The seed to use to draw the sample

Returns

samplearray_like, (*self.shape_): Returns the drawn sample

parallel()¶: Create parallel context for the sampler to operate

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

class divik.sampler.UniformSampler(n_rows=None, n_samples=None)[source]¶

Samples uniformly from the boundaries of the data

Parameters

n_rowsint, optional (default None): Allows to limit the number of rows in the drawn samples
n_samplesint, optional (default None): Allows to limit the number of samples when iterating

Attributes

shape_(n_rows, n_cols): Shape of the drawn samples
scaler_MinMaxScaler: Scaler ensuring the proper ranges

Methods

`fit`(X[, y])	Fit the model from data in X.
`get_params`([deep])	Get parameters for this estimator.
`get_sample`(seed)	Return specific sample
`parallel`()	Create parallel context for the sampler to operate
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y=None)[source]¶

Fit the model from data in X.

Parameters

Xarray-like, shape (n_samples, n_features): Training vector, where n_samples is the number of samples and n_features is the number of features.
Y: Ignored.

Returns

selfUniformSampler: Returns the instance itself.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_sample(seed)[source]¶

Return specific sample

Parameters

seedint: The seed to use to draw the sample

Returns

samplearray_like, (*self.shape_): Returns the drawn sample

parallel()¶: Create parallel context for the sampler to operate

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

divik.sampler module¶

`divik.sampler` module¶