feature_selection module¶
Unsupervised feature selection methods
-
class
divik.feature_selection.
StatSelectorMixin
[source]¶ Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with transform and inverse_transform functionality given that selected_ is specified during fit.
Additionally, provides a _to_characteristics and _to_raw implementations given stat, optionally use_log and preserve_high.
Methods
fit_transform
(self, X[, y])Fit to data, then transform it. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation transform
(self, X)Reduce X to the selected features. -
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
-
class
divik.feature_selection.
NoSelector
[source]¶ Dummy selector to use when no selection is supposed to be made.
Methods
fit
(self, X[, y])Pass data forward fit_transform
(self, X[, y])Fit to data, then transform it. get_params
(self[, deep])Get parameters for this estimator. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, X)Reduce X to the selected features. -
fit
(self, X, y=None)[source]¶ Pass data forward
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors to pass.
- y : any
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns: - self
-
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
-
class
divik.feature_selection.
GMMSelector
(stat: str, use_log: bool = False, n_candidates: int = None, min_features: int = 1, min_features_rate: float = 0.0, preserve_high: bool = True, max_components: int = 10)[source]¶ Feature selector that removes low- or high- mean or variance features
Gaussian Mixture Modeling is applied to the features’ characteristics and components are obtained. Crossing points of the components are considered candidate thresholds. Out of these up to
n_candidates
components are removed in such a way that at leastmin_features
ormin_features_rate
features are retained.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
Parameters: - stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- n_candidates: int, optional, default: None
How many candidate thresholds to use at most.
0
preserves all the features (all candidate thresholds are discarded),None
allows to remove all but one component (all candidate thresholds are retained). Negative value means to discard up to all but-n_candidates
candidates, e.g.-1
will retain at least two components (one candidate threshold is removed).- min_features: int, optional, default: 1
How many features must be preserved. Candidate thresholds are tested against this value, and if they retain less features, less conservative thresholds is selected.
- min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number.- preserve_high: bool, optional, default: True
Whether to preserve the high-characteristic features or low-characteristic ones.
- max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = labels * 5 + np.random.randn(*labels.shape) >>> fs.GMMSelector('mean').fit_transform(data) array([[14.78032811 15.35711257 ... 15.75193303]]) >>> fs.GMMSelector('mean', preserve_high=False).fit_transform(data) array([[ 0.49671415 -0.1382643 ... -0.29169375]]) >>> fs.GMMSelector('mean', n_discard=-1).fit_transform(data) array([[10.32408397 9.61491772 ... 15.75193303]])
Attributes: - vals_: array, shape (n_features,)
Computed characteristic of each feature.
- threshold_: float
Threshold value to filter the features by the characteristic.
- raw_threshold_: float
Threshold value mapped back to characteristic space (no logarithm, etc.)
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(self, X[, y])Learn data-driven feature thresholds from X. fit_transform
(self, X[, y])Fit to data, then transform it. get_params
(self[, deep])Get parameters for this estimator. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, X)Reduce X to the selected features. -
fit
(self, X, y=None)[source]¶ Learn data-driven feature thresholds from X.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- y : any
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns: - self
-
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
divik.feature_selection.
huberta_outliers
(v)[source]¶ M. Huberta, E.Vandervierenb (2008) An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (2008) 5186–5201
Parameters: - v: array-like
An array to filter outlier from.
Returns: - Binary vector indicating all the outliers.
-
class
divik.feature_selection.
OutlierSelector
(stat: str, use_log: bool = False, keep_outliers: bool = False)[source]¶ Feature selector that removes outlier features w.r.t. mean or variance
Huberta’s outlier detection is applied to the features’ characteristics and the outlying features are removed.
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
Parameters: - stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- keep_outliers: bool, optional, default: False
When True, keeps outliers instead of inlier features.
Attributes: - vals_: array, shape (n_features,)
Computed characteristic of each feature.
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(self, X[, y])Learn data-driven feature thresholds from X. fit_transform
(self, X[, y])Fit to data, then transform it. get_params
(self[, deep])Get parameters for this estimator. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, X)Reduce X to the selected features. -
fit
(self, X, y=None)[source]¶ Learn data-driven feature thresholds from X.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- y : any
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns: - self
-
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
PercentageSelector
(stat: str, use_log: bool = False, keep_top: bool = True, p: float = 0.2)[source]¶ Feature selector that removes / preserves top some percent of features
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
Parameters: - stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- keep_top: bool, optional, default: True
When True, keeps features with highest value of the characteristic.
- p: float, optional, default: 0.2
Rate of features to keep.
Attributes: - vals_: array, shape (n_features,)
Computed characteristic of each feature.
- threshold_: float
Value of the threshold used for filtering
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(self, X[, y])Learn data-driven feature thresholds from X. fit_transform
(self, X[, y])Fit to data, then transform it. get_params
(self[, deep])Get parameters for this estimator. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, X)Reduce X to the selected features. -
fit
(self, X, y=None)[source]¶ Learn data-driven feature thresholds from X.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- y : any
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns: - self
-
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
HighAbundanceAndVarianceSelector
(use_log: bool = False, min_features: int = 1, min_features_rate: float = 0.0, max_components: int = 10)[source]¶ Feature selector that removes low-mean and low-variance features
Exercises
GMMSelector
to filter out the low-abundance noise features and select high-variance informative features.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
Parameters: - use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- min_features: int, optional, default: 1
How many features must be preserved.
- min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number.- max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> # Data in this case must be carefully crafted >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = np.vstack(100 * [labels * 10.]) >>> data += np.random.randn(*data.shape) >>> sub = data[:, :-40] >>> sub += 5 * np.random.randn(*sub.shape) >>> # Label 0 has low abundance but high variance >>> # Label 3 has low variance but high abundance >>> # Label 1 and 2 has not-lowest abundance and high variance >>> selector = fs.HighAbundanceAndVarianceSelector().fit(data) >>> selector.transform(labels.reshape(1,-1)) array([[1 1 1 1 1 ...2 2 2]])
Attributes: - abundance_selector_: GMMSelector
Selector used to filter out the noise component.
- variance_selector_: GMMSelector
Selector used to filter out the non-informative features.
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(self, X[, y])Learn data-driven feature thresholds from X. fit_transform
(self, X[, y])Fit to data, then transform it. get_params
(self[, deep])Get parameters for this estimator. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, X)Reduce X to the selected features. -
fit
(self, X, y=None)[source]¶ Learn data-driven feature thresholds from X.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- y : any
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns: - self
-
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-
class
divik.feature_selection.
OutlierAbundanceAndVarianceSelector
(use_log: bool = False, min_features_rate: float = 0.01, p: float = 0.2)[source]¶ Methods
fit
(self, X[, y])Learn data-driven feature thresholds from X. fit_transform
(self, X[, y])Fit to data, then transform it. get_params
(self[, deep])Get parameters for this estimator. get_support
(self[, indices])Get a mask, or integer index, of the features selected inverse_transform
(self, X)Reverse the transformation operation set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, X)Reduce X to the selected features. -
fit
(self, X, y=None)[source]¶ Learn data-driven feature thresholds from X.
Parameters: - X : {array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- y : any
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
Returns: - self
-
fit_transform
(self, X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
get_support
(self, indices=False)¶ Get a mask, or integer index, of the features selected
Parameters: - indices : boolean (default False)
If True, the return value will be an array of integers, rather than a boolean mask.
Returns: - support : array
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
-
inverse_transform
(self, X)¶ Reverse the transformation operation
Parameters: - X : array of shape [n_samples, n_selected_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by transform.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, X)¶ Reduce X to the selected features.
Parameters: - X : array of shape [n_samples, n_features]
The input samples.
Returns: - X_r : array of shape [n_samples, n_selected_features]
The input samples with only the selected features.
-