divik.feature_selection
module¶
Unsupervised feature selection methods
- class divik.feature_selection.EximsSelector[source]¶
Select features based on their spatial distribution
Preserves features that yield biologically plausible structures.
References
Wijetunge, Chalini D., et al. “EXIMS: an improved data analysis pipeline based on a new peak picking method for EXploring Imaging Mass Spectrometry data.” Bioinformatics 31.19 (2015): 3198-3206. https://academic.oup.com/bioinformatics/article/31/19/3198/212150
Methods
fit
(X[, y, xy])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None, xy=None)[source]¶
Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- xyarray-like, shape (n_samples, 2)
Spatial coordinates of the samples. Expects integers, indices over am image.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.GMMSelector(stat, neutral=None, use_log=False, n_candidates=None, min_features=1, min_features_rate=0.0, preserve_high=True, max_components=10)[source]¶
Feature selector that removes low- or high- mean or variance features
Gaussian Mixture Modeling is applied to the features’ characteristics and components are obtained. Crossing points of the components are considered candidate thresholds. Out of these up to
n_candidates
components are removed in such a way that at leastmin_features
ormin_features_rate
features are retained.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- stat: {‘mean’, ‘var’, ‘cv’}
Kind of statistic to be computed out of the feature.
- neutral: float, optional, default: None
This element will be omitted from the computation of the statistic.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- n_candidates: int, optional, default: None
How many candidate thresholds to use at most.
0
preserves all the features (all candidate thresholds are discarded),None
allows to remove all but one component (all candidate thresholds are retained). Negative value means to discard up to all but-n_candidates
candidates, e.g.-1
will retain at least two components (one candidate threshold is removed).- min_features: int, optional, default: 1
How many features must be preserved. Candidate thresholds are tested against this value, and if they retain less features, less conservative thresholds is selected.
- min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number.- preserve_high: bool, optional, default: True
Whether to preserve the high-characteristic features or low-characteristic ones.
- max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = labels * 5 + np.random.randn(*labels.shape) >>> fs.GMMSelector('mean').fit_transform(data) array([[14.78032811 15.35711257 ... 15.75193303]]) >>> fs.GMMSelector('mean', preserve_high=False).fit_transform(data) array([[ 0.49671415 -0.1382643 ... -0.29169375]]) >>> fs.GMMSelector('mean', n_discard=-1).fit_transform(data) array([[10.32408397 9.61491772 ... 15.75193303]])
- Attributes
- vals_: array, shape (n_features,)
Computed characteristic of each feature.
- threshold_: float
Threshold value to filter the features by the characteristic.
- raw_threshold_: float
Threshold value mapped back to characteristic space (no logarithm, etc.)
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None)[source]¶
Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.HighAbundanceAndVarianceSelector(use_log=False, min_features=1, min_features_rate=0.0, max_components=10)[source]¶
Feature selector that removes low-mean and low-variance features
Exercises
GMMSelector
to filter out the low-abundance noise features and select high-variance informative features.This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- min_features: int, optional, default: 1
How many features must be preserved.
- min_features_rate: float, optional, default: 0.0
Similar to
min_features
but relative to the input data features number.- max_components: int, optional, default: 10
The maximum number of components used in the GMM decomposition.
Examples
>>> import numpy as np >>> import divik.feature_selection as fs >>> np.random.seed(42) >>> # Data in this case must be carefully crafted >>> labels = np.concatenate([30 * [0] + 20 * [1] + 30 * [2] + 40 * [3]]) >>> data = np.vstack(100 * [labels * 10.]) >>> data += np.random.randn(*data.shape) >>> sub = data[:, :-40] >>> sub += 5 * np.random.randn(*sub.shape) >>> # Label 0 has low abundance but high variance >>> # Label 3 has low variance but high abundance >>> # Label 1 and 2 has not-lowest abundance and high variance >>> selector = fs.HighAbundanceAndVarianceSelector().fit(data) >>> selector.transform(labels.reshape(1,-1)) array([[1 1 1 1 1 ...2 2 2]])
- Attributes
- abundance_selector_: GMMSelector
Selector used to filter out the noise component.
- variance_selector_: GMMSelector
Selector used to filter out the non-informative features.
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None)[source]¶
Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.NoSelector[source]¶
Dummy selector to use when no selection is supposed to be made.
Methods
fit
(X[, y])Pass data forward
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None)[source]¶
Pass data forward
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors to pass.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.OutlierAbundanceAndVarianceSelector(use_log=False, min_features_rate=0.01, p=0.2)[source]¶
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None)[source]¶
Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.OutlierSelector(stat, use_log=False, keep_outliers=False)[source]¶
Feature selector that removes outlier features w.r.t. mean or variance
Huberta’s outlier detection is applied to the features’ characteristics and the outlying features are removed.
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- keep_outliers: bool, optional, default: False
When True, keeps outliers instead of inlier features.
- Attributes
- vals_: array, shape (n_features,)
Computed characteristic of each feature.
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None)[source]¶
Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.PercentageSelector(stat, use_log=False, keep_top=True, p=0.2)[source]¶
Feature selector that removes / preserves top some percent of features
This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters
- stat: {‘mean’, ‘var’}
Kind of statistic to be computed out of the feature.
- use_log: bool, optional, default: False
Whether to use the logarithm of feature characteristic instead of the characteristic itself. This may improve feature filtering performance, depending on the distribution of features, however all the characteristics (mean, variance) have to be positive for that - filtering will fail otherwise. This is useful for specific cases in biology where the distribution of data may actually require this option for any efficient filtering.
- keep_top: bool, optional, default: True
When True, keeps features with highest value of the characteristic.
- p: float, optional, default: 0.2
Rate of features to keep.
- Attributes
- vals_: array, shape (n_features,)
Computed characteristic of each feature.
- threshold_: float
Value of the threshold used for filtering
- selected_: array, shape (n_features,)
Vector of binary selections of the informative features.
Methods
fit
(X[, y])Learn data-driven feature thresholds from X.
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_params
([deep])Get parameters for this estimator.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
set_params
(**params)Set the parameters of this estimator.
transform
(X)Reduce X to the selected features.
- fit(X, y=None)[source]¶
Learn data-driven feature thresholds from X.
- Parameters
- X{array-like, sparse matrix}, shape (n_samples, n_features)
Sample vectors from which to compute feature characteristic.
- yany
Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.
- Returns
- self
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsdict
Parameter names mapped to their values.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfestimator instance
Estimator instance.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- class divik.feature_selection.SelectorMixin[source]¶
Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with transform and inverse_transform functionality given an implementation of _get_support_mask.
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
transform
(X)Reduce X to the selected features.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)[source]¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_support(indices=False)[source]¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)[source]¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- class divik.feature_selection.StatSelectorMixin[source]¶
Transformer mixin that performs feature selection given a support mask
This mixin provides a feature selector implementation with
transform
andinverse_transform
functionality given thatselected_
is specified duringfit
.Additionally, provides a
_to_characteristics
and_to_raw
implementations givenstat
, optionallyuse_log
andpreserve_high
.Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_feature_names_out
([input_features])Mask feature names according to selected features.
get_support
([indices])Get a mask, or integer index, of the features selected.
Reverse the transformation operation.
transform
(X)Reduce X to the selected features.
- fit_transform(X, y=None, **fit_params)¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_feature_names_out(input_features=None)¶
Mask feature names according to selected features.
- Parameters
- input_featuresarray-like of str or None, default=None
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
- feature_names_outndarray of str objects
Transformed feature names.
- get_support(indices=False)¶
Get a mask, or integer index, of the features selected.
- Parameters
- indicesbool, default=False
If True, the return value will be an array of integers, rather than a boolean mask.
- Returns
- supportarray
An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.
- inverse_transform(X)¶
Reverse the transformation operation.
- Parameters
- Xarray of shape [n_samples, n_selected_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_original_features]
X with columns of zeros inserted where features would have been removed by
transform()
.
- transform(X)¶
Reduce X to the selected features.
- Parameters
- Xarray of shape [n_samples, n_features]
The input samples.
- Returns
- X_rarray of shape [n_samples, n_selected_features]
The input samples with only the selected features.
- divik.feature_selection.huberta_outliers(v)[source]¶
Outlier detection method based on medcouple statistic.
- Parameters
- v: array-like
An array to filter outlier from.
- Returns
- Binary vector indicating all the outliers.
References
M. Huberta, E.Vandervierenb (2008) An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (2008) 5186–5201