`fatf.utils.data.discretisation`.Discretiser¶

class fatf.utils.data.discretisation.Discretiser(dataset: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, feature_names: Optional[List[str]] = None)[source]¶

An abstract class that all discretiser implementations should inherit from.

New in version 0.0.2.

The validation of the initialiser input parameters is done via the fatf.utils.data.discretise._validate_input_discretiser function. This abstract class also contains an abstract discretise method and its input validator _validate_input_discretise. The discretise method should be overwritten in the children classes and the _validate_input_discretise methods should be called therein to validate their input.

If you need extra initialisation capabilities, you may overwrite the __init__ method in which case please remember to call super().__init__() at its top to make sure that all of the abstract class attributes are validated and initialised.

Warning

The feature_value_names and feature_bin_boundaries class attributes must be overwritten by every child class. The first attribute is of Dictionary[Column Index, Dictionary[integer, string]] type where the outer dictionary is mapping a column (feature) index of the input dataset to a dictionary with keys being discretised bin ids for that feature and values being these bins (string) descriptions, for example, if we discretised a feature vector into quartiles the inner dictionary would be {0: ‘feature < q1’, 1: ‘q1 < feature < q2’, 2: ‘q2 < feature < q3’, 3: ‘feature > q3’}, where q1, q2 and q3 are quartile boundaries.

The feature_bin_boundaries attribute should be overwritten with a dictionary which keys are column (feature) indices and values are numpy arrays holding bin boundaries for each feature. Using the above example this would be numpy.array([q1, q2, q3]). (By default the upper bin boundary should be inclusive.)

Parameters

datasetnumpy.ndarray: A 2-dimensional numpy array with a dataset to be discretised.
categorical_indicesList[column indices], optional (default=None): A list of column indices that should be treat as categorical features. If None is given, this will be inferred from the dataset array: string-based columns will be treated as categorical features and numerical columns will be treated as numerical features.
feature_namesList[strings], optional (default=None): A list of feature names in order they appear in the dataset array. If None, this will be extracted from the dataset array. For structured arrays these will be the column names extracted from the dtype; for classic arrays these will be numbers indicating the column index in the array.

Attributes

dataset_dtypenumpy.dtype: The dtype of the input dataset.
is_structuredboolean: True if the input dataset is a structured numpy array, False otherwise.
features_numberinteger: The number of features (columns) in the input dataset.
categorical_indicesList[Column Indices]: A list of column indices that should be treat as categorical features.
numerical_indicesList[Column Indices]: A list of column indices that should be treat as numerical features.
feature_names_mapDict[Column Index, String]: A dictionary that holds mapping of column (feature) indices to their names (feature names). If the feature_names parameter was not given (None), the feature names are inferred from the dataset.
feature_value_namesDictionary[Index, Dictionary[Integer, String]]: A dictionary mapping dataset column (feature) indices to dictionaries holding description (value) of each discrete value (key) for that feature.
feature_bin_boundariesDictionary[Index, numpy.ndarray]: A dictionary mapping dataset column (feature) indices to numpy arrays holding bin boundaries (with the upper threshold inclusive) for each feature.

Raises

IncorrectShapeError: The input dataset is not a 2-dimensional numpy array.
IndexError: Some of the column indices given in the categorical_indices list are invalid for the input dataset.
TypeError: The dataset is not of a base (numerical and/or string) type. The categorical_indices is neither a Python list nor None. The feature_names is neither a Python list nor None or one of its elements (if it is a list) is not a string.
ValueError: The length of the feature_names list is different than the number of columns (features) in the input dataset.

Warns

UserWarning: If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the categorical_indices parameter) the user is warned that they will be added to the list of categorical features.

Methods

discretise(dataset, numpy.void])

Discretises non-categorical (numerical) features in the dataset.

discretise(dataset: Union[numpy.ndarray, numpy.void]) → numpy.ndarray[source]¶

Discretises non-categorical (numerical) features in the dataset.

This is an abstract method that must be implemented for each discretiser object that inherits form Discretiser. This method should return a numpy.ndarray with all non-categorical columns (features) of the input dataset being discretised.

Warning

When implementing this method please remember to call assert self._validate_input_discretise(dataset) to validate the input parameters.

Parameters

datasetUnion[numpy.ndarray, numpy.void]: A data point (1-D) or an array (2-D) of data points to be discretised.

Returns

discretised_datanumpy.ndarray: A discretised data array.

Raises

NotImplementedError: This is an abstract method and has not been implemented.
IncorrectShapeError: The input dataset is neither 1- nor 2-dimensional numpy array. The number of features (columns) in the input dataset is different than the number of features in the dataset used to initialise this object.
TypeError: The dtype of the input dataset is too different from the dtype of the dataset used to initialise this object.

fatf.utils.data.discretisation.Discretiser¶

`fatf.utils.data.discretisation`.Discretiser¶