feature_selection module

Unsupervised feature selection methods

class divik.feature_selection.StatSelectorMixin[source]

Transformer mixin that performs feature selection given a support mask

This mixin provides a feature selector implementation with transform and inverse_transform functionality given that selected_ is specified during fit.

Additionally, provides a _to_characteristics and _to_raw implementations given stat, optionally use_log and preserve_high.

Methods

fit_transform(self, X[, y]) Fit to data, then transform it.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
transform(self, X) Reduce X to the selected features.
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

class divik.feature_selection.NoSelector[source]

Dummy selector to use when no selection is supposed to be made.

Methods

fit(self, X[, y]) Pass data forward
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Reduce X to the selected features.
fit(self, X, y=None)[source]

Pass data forward

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Sample vectors to pass.

y : any

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
self
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

class divik.feature_selection.GMMSelector(stat: str, use_log: bool = False, n_candidates: int = None, min_features: int = 1, min_features_rate: float = 0.0, preserve_high: bool = True, max_components: int = 10)[source]

Feature selector that removes low- or high- mean or variance features

Gaussian Mixture Modeling is applied to the features’ characteristics and components are obtained. Crossing points of the components are considered candidate thresholds. Out of these up to n_candidates components are removed in such a way that at least min_features or min_features_rate features are retained.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:
stat: {‘mean’, ‘var’}

Kind of statistic to be computed out of the feature.

use_log: bool, optional, default: False

Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.

n_candidates: int, optional, default: None

How many candidate thresholds to use at most. 0 preserves all the features (all candidate thresholds are discarded), None allows to remove all but one component (all candidate thresholds are retained). Negative value means to discard up to all but -n_candidates candidates, e.g. -1 will retain at least two components (one candidate threshold is removed).

min_features: int, optional, default: 1

How many features must be preserved. Candidate thresholds are tested against this value, and if they retain less features, less conservative thresholds is selected.

min_features_rate: float, optional, default: 0.0

Similar to min_features but relative to the input data features number.

preserve_high: bool, optional, default: True

Whether to preserve the high-characteristic features or low-characteristic ones.

max_components: int, optional, default: 10

The maximum number of components used in the GMM decomposition.

Examples

>>> import numpy as np
>>> import divik.feature_selection as fs
>>> np.random.seed(42)
>>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]])
>>> data = labels * 5 + np.random.randn(*labels.shape)
>>> fs.GMMSelector('mean').fit_transform(data)
array([[14.78032811 15.35711257 ... 15.75193303]])
>>> fs.GMMSelector('mean', preserve_high=False).fit_transform(data)
array([[ 0.49671415 -0.1382643  ... -0.29169375]])
>>> fs.GMMSelector('mean', n_discard=-1).fit_transform(data)
array([[10.32408397  9.61491772 ... 15.75193303]])
Attributes:
vals_: array, shape (n_features,)

Computed characteristic of each feature.

threshold_: float

Threshold value to filter the features by the characteristic.

raw_threshold_: float

Threshold value mapped back to characteristic space (no logarithm, etc.)

selected_: array, shape (n_features,)

Vector of binary selections of the informative features.

Methods

fit(self, X[, y]) Learn data-driven feature thresholds from X.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Reduce X to the selected features.
fit(self, X, y=None)[source]

Learn data-driven feature thresholds from X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Sample vectors from which to compute feature characteristic.

y : any

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
self
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

divik.feature_selection.huberta_outliers(v)[source]

M. Huberta, E.Vandervierenb (2008) An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (2008) 5186–5201

Parameters:
v: array-like

An array to filter outlier from.

Returns:
Binary vector indicating all the outliers.
class divik.feature_selection.OutlierSelector(stat: str, use_log: bool = False, keep_outliers: bool = False)[source]

Feature selector that removes outlier features w.r.t. mean or variance

Huberta’s outlier detection is applied to the features’ characteristics and the outlying features are removed.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:
stat: {‘mean’, ‘var’}

Kind of statistic to be computed out of the feature.

use_log: bool, optional, default: False

Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.

keep_outliers: bool, optional, default: False

When True, keeps outliers instead of inlier features.

Attributes:
vals_: array, shape (n_features,)

Computed characteristic of each feature.

selected_: array, shape (n_features,)

Vector of binary selections of the informative features.

Methods

fit(self, X[, y]) Learn data-driven feature thresholds from X.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Reduce X to the selected features.
fit(self, X, y=None)[source]

Learn data-driven feature thresholds from X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Sample vectors from which to compute feature characteristic.

y : any

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
self
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

class divik.feature_selection.PercentageSelector(stat: str, use_log: bool = False, keep_top: bool = True, p: float = 0.2)[source]

Feature selector that removes / preserves top some percent of features

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:
stat: {‘mean’, ‘var’}

Kind of statistic to be computed out of the feature.

use_log: bool, optional, default: False

Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.

keep_top: bool, optional, default: True

When True, keeps features with highest value of the characteristic.

p: float, optional, default: 0.2

Rate of features to keep.

Attributes:
vals_: array, shape (n_features,)

Computed characteristic of each feature.

threshold_: float

Value of the threshold used for filtering

selected_: array, shape (n_features,)

Vector of binary selections of the informative features.

Methods

fit(self, X[, y]) Learn data-driven feature thresholds from X.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Reduce X to the selected features.
fit(self, X, y=None)[source]

Learn data-driven feature thresholds from X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Sample vectors from which to compute feature characteristic.

y : any

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
self
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

class divik.feature_selection.HighAbundanceAndVarianceSelector(use_log: bool = False, min_features: int = 1, min_features_rate: float = 0.0, max_components: int = 10)[source]

Feature selector that removes low-mean and low-variance features

Exercises GMMSelector to filter out the low-abundance noise features and select high-variance informative features.

This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:
use_log: bool, optional, default: False

Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.

min_features: int, optional, default: 1

How many features must be preserved.

min_features_rate: float, optional, default: 0.0

Similar to min_features but relative to the input data features number.

max_components: int, optional, default: 10

The maximum number of components used in the GMM decomposition.

Examples

>>> import numpy as np
>>> import divik.feature_selection as fs
>>> np.random.seed(42)
>>> # Data in this case must be carefully crafted
>>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]])
>>> data = np.vstack(100 * [labels * 10.])
>>> data += np.random.randn(*data.shape)
>>> sub = data[:, :-40]
>>> sub += 5 * np.random.randn(*sub.shape)
>>> # Label 0 has low abundance but high variance
>>> # Label 3 has low variance but high abundance
>>> # Label 1 and 2 has not-lowest abundance and high variance
>>> selector = fs.HighAbundanceAndVarianceSelector().fit(data)
>>> selector.transform(labels.reshape(1,-1))
array([[1 1 1 1 1 ...2 2 2]])
Attributes:
abundance_selector_: GMMSelector

Selector used to filter out the noise component.

variance_selector_: GMMSelector

Selector used to filter out the non-informative features.

selected_: array, shape (n_features,)

Vector of binary selections of the informative features.

Methods

fit(self, X[, y]) Learn data-driven feature thresholds from X.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Reduce X to the selected features.
fit(self, X, y=None)[source]

Learn data-driven feature thresholds from X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Sample vectors from which to compute feature characteristic.

y : any

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
self
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

class divik.feature_selection.OutlierAbundanceAndVarianceSelector(use_log: bool = False, min_features_rate: float = 0.01, p: float = 0.2)[source]

Methods

fit(self, X[, y]) Learn data-driven feature thresholds from X.
fit_transform(self, X[, y]) Fit to data, then transform it.
get_params(self[, deep]) Get parameters for this estimator.
get_support(self[, indices]) Get a mask, or integer index, of the features selected
inverse_transform(self, X) Reverse the transformation operation
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Reduce X to the selected features.
fit(self, X, y=None)[source]

Learn data-driven feature thresholds from X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Sample vectors from which to compute feature characteristic.

y : any

Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

Returns:
self
fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_support(self, indices=False)

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

inverse_transform(self, X)

Reverse the transformation operation

Parameters:
X : array of shape [n_samples, n_selected_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_original_features]

X with columns of zeros inserted where features would have been removed by transform.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(self, X)

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.