`divik.feature_selection` module¶

Unsupervised feature selection methods

class divik.feature_selection.EximsSelector[source]¶

Select features based on their spatial distribution

Preserves features that yield biologically plausible structures.

References

Wijetunge, Chalini D., et al. “EXIMS: an improved data analysis pipeline based on a new peak picking method for EXploring Imaging Mass Spectrometry data.” Bioinformatics 31.19 (2015): 3198-3206. https://academic.oup.com/bioinformatics/article/31/19/3198/212150

Methods

`fit`(X[, y, xy])	Learn data-driven feature thresholds from X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None, xy=None)[source]¶

Learn data-driven feature thresholds from X.

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors from which to compute feature characteristic.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
xyarray-like, shape (n_samples, 2): Spatial coordinates of the samples. Expects integers, indices over am image.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.GMMSelector(stat, neutral=None, use_log=False, n_candidates=None, min_features=1, min_features_rate=0.0, preserve_high=True, max_components=10)[source]¶

Feature selector that removes low- or high- mean or variance features

Gaussian Mixture Modeling is applied to the features’ characteristics and components are obtained. Crossing points of the components are considered candidate thresholds. Out of these up to n_candidates components are removed in such a way that at least min_features or min_features_rate features are retained.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters

stat: {‘mean’, ‘var’, ‘cv’}: Kind of statistic to be computed out of the feature.
neutral: float, optional, default: None: This element will be omitted from the computation of the statistic.
use_log: bool, optional, default: False: Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
n_candidates: int, optional, default: None: How many candidate thresholds to use at most. 0 preserves all the features (all candidate thresholds are discarded), None allows to remove all but one component (all candidate thresholds are retained). Negative value means to discard up to all but -n_candidates candidates, e.g. -1 will retain at least two components (one candidate threshold is removed).
min_features: int, optional, default: 1: How many features must be preserved. Candidate thresholds are tested against this value, and if they retain less features, less conservative thresholds is selected.
min_features_rate: float, optional, default: 0.0: Similar to min_features but relative to the input data features number.
preserve_high: bool, optional, default: True: Whether to preserve the high-characteristic features or low-characteristic ones.
max_components: int, optional, default: 10: The maximum number of components used in the GMM decomposition.

Examples

>>> import numpy as np
>>> import divik.feature_selection as fs
>>> np.random.seed(42)
>>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]])
>>> data = labels * 5 + np.random.randn(*labels.shape)
>>> fs.GMMSelector('mean').fit_transform(data)
array([[14.78032811 15.35711257 ... 15.75193303]])
>>> fs.GMMSelector('mean', preserve_high=False).fit_transform(data)
array([[ 0.49671415 -0.1382643  ... -0.29169375]])
>>> fs.GMMSelector('mean', n_discard=-1).fit_transform(data)
array([[10.32408397  9.61491772 ... 15.75193303]])

Attributes

vals_: array, shape (n_features,): Computed characteristic of each feature.
threshold_: float: Threshold value to filter the features by the characteristic.
raw_threshold_: float: Threshold value mapped back to characteristic space (no logarithm, etc.)
selected_: array, shape (n_features,): Vector of binary selections of the informative features.

Methods

`fit`(X[, y])	Learn data-driven feature thresholds from X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None)[source]¶

Learn data-driven feature thresholds from X.

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors from which to compute feature characteristic.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.HighAbundanceAndVarianceSelector(use_log=False, min_features=1, min_features_rate=0.0, max_components=10)[source]¶

Feature selector that removes low-mean and low-variance features

Exercises GMMSelector to filter out the low-abundance noise features and select high-variance informative features.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters

use_log: bool, optional, default: False: Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
min_features: int, optional, default: 1: How many features must be preserved.
min_features_rate: float, optional, default: 0.0: Similar to min_features but relative to the input data features number.
max_components: int, optional, default: 10: The maximum number of components used in the GMM decomposition.

Examples

>>> import numpy as np
>>> import divik.feature_selection as fs
>>> np.random.seed(42)
>>> # Data in this case must be carefully crafted
>>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]])
>>> data = np.vstack(100 * [labels * 10.])
>>> data += np.random.randn(*data.shape)
>>> sub = data[:, :-40]
>>> sub += 5 * np.random.randn(*sub.shape)
>>> # Label 0 has low abundance but high variance
>>> # Label 3 has low variance but high abundance
>>> # Label 1 and 2 has not-lowest abundance and high variance
>>> selector = fs.HighAbundanceAndVarianceSelector().fit(data)
>>> selector.transform(labels.reshape(1,-1))
array([[1 1 1 1 1 ...2 2 2]])

Attributes

abundance_selector_: GMMSelector: Selector used to filter out the noise component.
variance_selector_: GMMSelector: Selector used to filter out the non-informative features.
selected_: array, shape (n_features,): Vector of binary selections of the informative features.

Methods

`fit`(X[, y])	Learn data-driven feature thresholds from X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None)[source]¶

Learn data-driven feature thresholds from X.

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors from which to compute feature characteristic.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.NoSelector[source]¶

Dummy selector to use when no selection is supposed to be made.

Methods

`fit`(X[, y])	Pass data forward
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None)[source]¶

Pass data forward

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors to pass.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.OutlierAbundanceAndVarianceSelector(use_log=False, min_features_rate=0.01, p=0.2)[source]¶

Methods

`fit`(X[, y])	Learn data-driven feature thresholds from X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None)[source]¶

Learn data-driven feature thresholds from X.

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors from which to compute feature characteristic.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.OutlierSelector(stat, use_log=False, keep_outliers=False)[source]¶

Feature selector that removes outlier features w.r.t. mean or variance

Huberta’s outlier detection is applied to the features’ characteristics and the outlying features are removed.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters

stat: {‘mean’, ‘var’}: Kind of statistic to be computed out of the feature.
use_log: bool, optional, default: False: Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
keep_outliers: bool, optional, default: False: When True, keeps outliers instead of inlier features.

Attributes

vals_: array, shape (n_features,): Computed characteristic of each feature.
selected_: array, shape (n_features,): Vector of binary selections of the informative features.

Methods

`fit`(X[, y])	Learn data-driven feature thresholds from X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None)[source]¶

Learn data-driven feature thresholds from X.

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors from which to compute feature characteristic.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.PercentageSelector(stat, use_log=False, keep_top=True, p=0.2)[source]¶

Feature selector that removes / preserves top some percent of features

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters

stat: {‘mean’, ‘var’}: Kind of statistic to be computed out of the feature.
use_log: bool, optional, default: False: Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
keep_top: bool, optional, default: True: When True, keeps features with highest value of the characteristic.
p: float, optional, default: 0.2: Rate of features to keep.

Attributes

vals_: array, shape (n_features,): Computed characteristic of each feature.
threshold_: float: Value of the threshold used for filtering
selected_: array, shape (n_features,): Vector of binary selections of the informative features.

Methods

`fit`(X[, y])	Learn data-driven feature thresholds from X.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_params`([deep])	Get parameters for this estimator.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(X)	Reduce X to the selected features.

fit(X, y=None)[source]¶

Learn data-driven feature thresholds from X.

Parameters

X{array-like, sparse matrix}, shape (n_samples, n_features): Sample vectors from which to compute feature characteristic.
yany: Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns

self

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsdict: Parameter names mapped to their values.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfestimator instance: Estimator instance.

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.SelectorMixin[source]¶

Transformer mixin that performs feature selection given a support mask

This mixin provides a feature selector implementation with transform and inverse_transform functionality given an implementation of _get_support_mask.

Methods

`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`transform`(X)	Reduce X to the selected features.

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)[source]¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_support(indices=False)[source]¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)[source]¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

transform(X)[source]¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

class divik.feature_selection.StatSelectorMixin[source]¶

Transformer mixin that performs feature selection given a support mask

This mixin provides a feature selector implementation with transform and inverse_transform functionality given that selected_ is specified during fit.

Additionally, provides a _to_characteristics and _to_raw implementations given stat, optionally use_log and preserve_high.

Methods

`fit_transform`(X[, y])	Fit to data, then transform it.
`get_feature_names_out`([input_features])	Mask feature names according to selected features.
`get_support`([indices])	Get a mask, or integer index, of the features selected.
`inverse_transform`(X)	Reverse the transformation operation.
`transform`(X)	Reduce X to the selected features.

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

Xarray-like of shape (n_samples, n_features): Input samples.
yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None: Target values (None for unsupervised transformations).
**fit_paramsdict: Additional fit parameters.

Returns

X_newndarray array of shape (n_samples, n_features_new): Transformed array.

get_feature_names_out(input_features=None)¶

Mask feature names according to selected features.

Parameters

input_featuresarray-like of str or None, default=None

Input features.

If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_outndarray of str objects: Transformed feature names.

get_support(indices=False)¶

Get a mask, or integer index, of the features selected.

Parameters

indicesbool, default=False: If True, the return value will be an array of integers, rather than a boolean mask.

Returns

supportarray: An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(X)¶

Reverse the transformation operation.

Parameters

Xarray of shape [n_samples, n_selected_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_original_features]: X with columns of zeros inserted where features would have been removed by transform().

transform(X)¶

Reduce X to the selected features.

Parameters

Xarray of shape [n_samples, n_features]: The input samples.

Returns

X_rarray of shape [n_samples, n_selected_features]: The input samples with only the selected features.

divik.feature_selection.huberta_outliers(v)[source]¶

Outlier detection method based on medcouple statistic.

Parameters

v: array-like: An array to filter outlier from.

Returns

Binary vector indicating all the outliers.

References

M. Huberta, E.Vandervierenb (2008) An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (2008) 5186–5201

divik.feature_selection.make_specialized_selector(name, n_features, **kwargs)[source]¶

Create a selector by name (gmm, outlier, none or auto)

auto switches to gmm if there is more than 250 features, outlier below.

divik.feature_selection module¶

`divik.feature_selection` module¶