fatf.utils.data.discretisation.Discretiser

class fatf.utils.data.discretisation.Discretiser(dataset: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, feature_names: Optional[List[str]] = None)[source]

An abstract class that all discretiser implementations should inherit from.

New in version 0.0.2.

The validation of the initialiser input parameters is done via the fatf.utils.data.discretise._validate_input_discretiser function. This abstract class also contains an abstract discretise method and its input validator _validate_input_discretise. The discretise method should be overwritten in the children classes and the _validate_input_discretise methods should be called therein to validate their input.

If you need extra initialisation capabilities, you may overwrite the __init__ method in which case please remember to call super().__init__() at its top to make sure that all of the abstract class attributes are validated and initialised.

Warning

The feature_value_names and feature_bin_boundaries class attributes must be overwritten by every child class. The first attribute is of Dictionary[Column Index, Dictionary[integer, string]] type where the outer dictionary is mapping a column (feature) index of the input dataset to a dictionary with keys being discretised bin ids for that feature and values being these bins (string) descriptions, for example, if we discretised a feature vector into quartiles the inner dictionary would be {0: ‘feature < q1’, 1: ‘q1 < feature < q2’, 2: ‘q2 < feature < q3’, 3: ‘feature > q3’}, where q1, q2 and q3 are quartile boundaries.

The feature_bin_boundaries attribute should be overwritten with a dictionary which keys are column (feature) indices and values are numpy arrays holding bin boundaries for each feature. Using the above example this would be numpy.array([q1, q2, q3]). (By default the upper bin boundary should be inclusive.)

Parameters
datasetnumpy.ndarray

A 2-dimensional numpy array with a dataset to be discretised.

categorical_indicesList[column indices], optional (default=None)

A list of column indices that should be treat as categorical features. If None is given, this will be inferred from the dataset array: string-based columns will be treated as categorical features and numerical columns will be treated as numerical features.

feature_namesList[strings], optional (default=None)

A list of feature names in order they appear in the dataset array. If None, this will be extracted from the dataset array. For structured arrays these will be the column names extracted from the dtype; for classic arrays these will be numbers indicating the column index in the array.

Attributes
dataset_dtypenumpy.dtype

The dtype of the input dataset.

is_structuredboolean

True if the input dataset is a structured numpy array, False otherwise.

features_numberinteger

The number of features (columns) in the input dataset.

categorical_indicesList[Column Indices]

A list of column indices that should be treat as categorical features.

numerical_indicesList[Column Indices]

A list of column indices that should be treat as numerical features.

feature_names_mapDict[Column Index, String]

A dictionary that holds mapping of column (feature) indices to their names (feature names). If the feature_names parameter was not given (None), the feature names are inferred from the dataset.

feature_value_namesDictionary[Index, Dictionary[Integer, String]]

A dictionary mapping dataset column (feature) indices to dictionaries holding description (value) of each discrete value (key) for that feature.

feature_bin_boundariesDictionary[Index, numpy.ndarray]

A dictionary mapping dataset column (feature) indices to numpy arrays holding bin boundaries (with the upper threshold inclusive) for each feature.

Raises
IncorrectShapeError

The input dataset is not a 2-dimensional numpy array.

IndexError

Some of the column indices given in the categorical_indices list are invalid for the input dataset.

TypeError

The dataset is not of a base (numerical and/or string) type. The categorical_indices is neither a Python list nor None. The feature_names is neither a Python list nor None or one of its elements (if it is a list) is not a string.

ValueError

The length of the feature_names list is different than the number of columns (features) in the input dataset.

Warns
UserWarning

If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the categorical_indices parameter) the user is warned that they will be added to the list of categorical features.

Methods

discretise(dataset, numpy.void])

Discretises non-categorical (numerical) features in the dataset.

discretise(dataset: Union[numpy.ndarray, numpy.void]) → numpy.ndarray[source]

Discretises non-categorical (numerical) features in the dataset.

This is an abstract method that must be implemented for each discretiser object that inherits form Discretiser. This method should return a numpy.ndarray with all non-categorical columns (features) of the input dataset being discretised.

Warning

When implementing this method please remember to call assert self._validate_input_discretise(dataset) to validate the input parameters.

Parameters
datasetUnion[numpy.ndarray, numpy.void]

A data point (1-D) or an array (2-D) of data points to be discretised.

Returns
discretised_datanumpy.ndarray

A discretised data array.

Raises
NotImplementedError

This is an abstract method and has not been implemented.

IncorrectShapeError

The input dataset is neither 1- nor 2-dimensional numpy array. The number of features (columns) in the input dataset is different than the number of features in the dataset used to initialise this object.

TypeError

The dtype of the input dataset is too different from the dtype of the dataset used to initialise this object.