divik.sampler module

Sampling methods for statistical indices computation purposes

class divik.sampler.BaseSampler[source]

Base class for all the samplers

Sampler is Pool-safe, i.e. can simply store a dataset. It will not be serialized by pickle when going to another process, if handled properly.

Before you spawn a pool, a data must be moved to a module-level variable. To simplify that process a contract has been prepared. You open a context and operate within a context:

>>> with sampler.parallel() as sampler_,
...         Pool(initializer=sampler_.initializer,
...              initargs=sampler_.initargs) as pool:
...     pool.map(sampler_.get_sample, range(10))

Keep in mind, that __iter__ and fit are not accessible in parallel context. __iter__ would yield the same values independently in all the workers. Now it needs to be done consciously and in well-though manner. fit could lead to a non-predictable behaviour. If you need the original sampler, you can get a clone (not fit to the data).

Methods

fit(X[, y])

Fit sampler to data

get_params([deep])

Get parameters for this estimator.

get_sample(seed)

Return specific sample

parallel()

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None)[source]

Fit sampler to data

It’s a base for both supervised and unsupervised samplers.

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

abstract get_sample(seed)[source]

Return specific sample

Following assumptions should be met: a) sampler.get_sample(x) == sampler.get_sample(x) b) x != y should yield sampler.get_sample(x) != sampler.get_sample(y)

Parameters
seedint

The seed to use to draw the sample

Returns
samplearray_like, (*self.shape_)

Returns the drawn sample

parallel()[source]

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

class divik.sampler.ParallelSampler(sampler)[source]

Helper class for sharing the sampler functionality

Attributes
initargs

Methods

clone()

Clones the original sampler

get_sample(seed)

Return specific sample

initializer

clone()[source]

Clones the original sampler

get_sample(seed)[source]

Return specific sample

property initargs
initializer(*args)[source]
class divik.sampler.StratifiedSampler(n_rows=100, n_samples=None)[source]

Sample the original data preserving proportions of groups

Parameters
n_rowsint or float, optional (default 10000)

Allows to limit the number of rows in the drawn samples. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the sample. If int, represents the absolute number of rows.

n_samplesint, optional (default None)

Allows to limit the number of samples when iterating

Attributes
X_array_like, shape (n_rows, n_features)

Data to sample from

y_array_like, shape (n_rows,)

Group labels

Methods

fit(X, y)

Fit the model from data in X.

get_params([deep])

Get parameters for this estimator.

get_sample(seed)

Return specific sample

parallel()

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

fit(X, y)[source]

Fit the model from data in X.

Both inputs are preserved inside to sample from the data.

Parameters
Xarray-like, shape (n_rows, n_features)

Training vector, where n_rows is the number of rows and n_features is the number of features.

y: array-like, shape (n_rows,)
Returns
selfStratifiedSampler

Returns the instance itself.

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

get_sample(seed)[source]

Return specific sample

Sample is drawn from the set of existing rows. A proportion of gorups should be more-or-less the same, depending on the size of the sample.

Parameters
seedint

The seed to use to draw the sample

Returns
samplearray_like, (*self.shape_)

Returns the drawn sample

parallel()[source]

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

class divik.sampler.UniformPCASampler(n_rows=None, n_samples=None, whiten=False, refit=False, pca='knee')[source]

Rotation-invariant uniform sampling

Parameters
n_rowsint, optional (default None)

Allows to limit the number of rows in the drawn samples

n_samplesint, optional (default None)

Allows to limit the number of samples when iterating

whitenbool, optional (default False)

When True (False by default) the pca_.components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

refitbool, optional (default False)

When True (False by default) the pca_ is re-fit with the smaller number of components. This could reduce memory footprint, but requires training fitting PCA.

pca: {‘knee’, ‘full’}, default ‘knee’

Specifies whether to train full or knee PCA.

Attributes
pca_KneePCA or PCA

PCA transform which provided rotation-invariance

sampler_UniformSampler

Sampler from the transformed distribution

Methods

fit(X[, y])

Fit the model from data in X.

get_params([deep])

Get parameters for this estimator.

get_sample(seed)

Return specific sample

parallel()

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None)[source]

Fit the model from data in X.

PCA is fit to estimate the rotation and UniformSampler is fit to transformed data.

Parameters
Xarray-like, shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

Y: Ignored.
Returns
selfUniformPCASampler

Returns the instance itself.

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

get_sample(seed)[source]

Return specific sample

Sample is generated from transformed distribution and transformed back to the original space.

Parameters
seedint

The seed to use to draw the sample

Returns
samplearray_like, (*self.shape_)

Returns the drawn sample

parallel()

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

class divik.sampler.UniformSampler(n_rows=None, n_samples=None)[source]

Samples uniformly from the boundaries of the data

Parameters
n_rowsint, optional (default None)

Allows to limit the number of rows in the drawn samples

n_samplesint, optional (default None)

Allows to limit the number of samples when iterating

Attributes
shape_(n_rows, n_cols)

Shape of the drawn samples

scaler_MinMaxScaler

Scaler ensuring the proper ranges

Methods

fit(X[, y])

Fit the model from data in X.

get_params([deep])

Get parameters for this estimator.

get_sample(seed)

Return specific sample

parallel()

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None)[source]

Fit the model from data in X.

Parameters
Xarray-like, shape (n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features.

Y: Ignored.
Returns
selfUniformSampler

Returns the instance itself.

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

get_sample(seed)[source]

Return specific sample

Parameters
seedint

The seed to use to draw the sample

Returns
samplearray_like, (*self.shape_)

Returns the drawn sample

parallel()

Create parallel context for the sampler to operate

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.