fatf.utils.data.feature_selection.sklearn.lasso_path

fatf.utils.data.feature_selection.sklearn.lasso_path(dataset: numpy.ndarray, target: numpy.ndarray, weights: Optional[numpy.ndarray] = None, features_number: Optional[int] = None, features_percentage: int = 100) → numpy.ndarray[source]

Selects the specified number of features based on Lasso path coefficients.

New in version 0.0.2.

It may be the case that the specified number of features cannot be selected as a lasso path does not give enough non-zero coefficients, in which case the biggest number of features (smaller than the specified number) will be returned. In case all of the features are assigned 0 weight or all of the paths have a non-zero number of coefficients larger than the specified number, all of the features are selected. If the exact number of features specified by the user cannot be selected an appropriate message will be logged. Also, if the value of feature_percentage results in selecting 0 features, 1 feature will be selected and a warning will be logged.

The weights provided as the input parameter are incorporated into the feature selection process by centering the dataset around their weighted average (if no weights are provided, the average is simply not weighted) and scaling by the square root of the weights. The target array is treated in the same way.

This feature selection method is based on LIME (Local Interpretable Model-agnostic Explanations). The original implementation can be found in the lime.lime_base.LimeBase.feature_selection method in the official LIME package.

Parameters
datasetnumpy.ndarray

A 2-dimensional numpy array holding a data set.

targetnumpy.ndarray

The class/probability/regression values of each row in the input data set.

weightsnumpy.ndarray, optional (default=None)

An array of (importance) weights for each data point in the input data set. If None, all of the data points are equally important when computing the Lasso path.

features_numberinteger, optional (default=None)

The number of (top) features to be selected. If None, the top x% of the features are selected where x is given by the features_percentage parameter. It may be the case that the specified number of features cannot be extracted, in which case a warning is logged and the next biggest subset of features is selected.

features_percentageinteger, optional (default=100)

The percentage of (top) features to be selected. By default all of the features are returned if features_number is None.

Returns
feature_indicesnumpy.ndarray

Array with indices of features selected by the Lasso path.

Raises
IncorrectShapeError

The dataset array is not 2-dimensional. The target array is not 1-dimensional. The number of elements in the target array is different than the number of samples in the dataset array. The weights array is not 1-dimensional. The number of weights in the weights array does not agree with the number of samples in the dataset array.

TypeError

One of the dataset, target or weights array is not purely numerical. The features_number parameter is not an integer. The features_percentage parameter is not an integer.

ValueError

The features_number parameter is not a positive integer. The features_percentage parameter is outside of the allowed range 0–100 (inclusive).

Warns
UserWarning

The specified features_number is larger than the number of features in the dataset array; all of the features are selected.