fatf.utils.data.discretisation.Discretiser¶
-
class
fatf.utils.data.discretisation.Discretiser(dataset: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, feature_names: Optional[List[str]] = None)[source]¶ An abstract class that all discretiser implementations should inherit from.
New in version 0.0.2.
The validation of the initialiser input parameters is done via the
fatf.utils.data.discretise._validate_input_discretiserfunction. This abstract class also contains an abstractdiscretisemethod and its input validator_validate_input_discretise. Thediscretisemethod should be overwritten in the children classes and the_validate_input_discretisemethods should be called therein to validate their input.If you need extra initialisation capabilities, you may overwrite the
__init__method in which case please remember to callsuper().__init__()at its top to make sure that all of the abstract class attributes are validated and initialised.Warning
The
feature_value_namesandfeature_bin_boundariesclass attributes must be overwritten by every child class. The first attribute is ofDictionary[Column Index, Dictionary[integer, string]]type where the outer dictionary is mapping a column (feature) index of the inputdatasetto a dictionary with keys being discretised bin ids for that feature and values being these bins (string) descriptions, for example, if we discretised a feature vector into quartiles the inner dictionary would be {0: ‘feature < q1’, 1: ‘q1 < feature < q2’, 2: ‘q2 < feature < q3’, 3: ‘feature > q3’}, where q1, q2 and q3 are quartile boundaries.The
feature_bin_boundariesattribute should be overwritten with a dictionary which keys are column (feature) indices and values are numpy arrays holding bin boundaries for each feature. Using the above example this would benumpy.array([q1, q2, q3]). (By default the upper bin boundary should be inclusive.)- Parameters
- datasetnumpy.ndarray
A 2-dimensional numpy array with a dataset to be discretised.
- categorical_indicesList[column indices], optional (default=None)
A list of column indices that should be treat as categorical features. If
Noneis given, this will be inferred from thedatasetarray: string-based columns will be treated as categorical features and numerical columns will be treated as numerical features.- feature_namesList[strings], optional (default=None)
A list of feature names in order they appear in the
datasetarray. IfNone, this will be extracted from thedatasetarray. For structured arrays these will be the column names extracted from the dtype; for classic arrays these will be numbers indicating the column index in the array.
- Attributes
- dataset_dtypenumpy.dtype
The dtype of the input
dataset.- is_structuredboolean
Trueif the inputdatasetis a structured numpy array,Falseotherwise.- features_numberinteger
The number of features (columns) in the input
dataset.- categorical_indicesList[Column Indices]
A list of column indices that should be treat as categorical features.
- numerical_indicesList[Column Indices]
A list of column indices that should be treat as numerical features.
- feature_names_mapDict[Column Index, String]
A dictionary that holds mapping of column (feature) indices to their names (feature names). If the
feature_namesparameter was not given (None), the feature names are inferred from thedataset.- feature_value_namesDictionary[Index, Dictionary[Integer, String]]
A dictionary mapping
datasetcolumn (feature) indices to dictionaries holding description (value) of each discrete value (key) for that feature.- feature_bin_boundariesDictionary[Index, numpy.ndarray]
A dictionary mapping
datasetcolumn (feature) indices to numpy arrays holding bin boundaries (with the upper threshold inclusive) for each feature.
- Raises
- IncorrectShapeError
The input
datasetis not a 2-dimensional numpy array.- IndexError
Some of the column indices given in the
categorical_indiceslist are invalid for the inputdataset.- TypeError
The
datasetis not of a base (numerical and/or string) type. Thecategorical_indicesis neither a Python list norNone. Thefeature_namesis neither a Python list norNoneor one of its elements (if it is a list) is not a string.- ValueError
The length of the
feature_nameslist is different than the number of columns (features) in the inputdataset.
- Warns
- UserWarning
If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the
categorical_indicesparameter) the user is warned that they will be added to the list of categorical features.
Methods
discretise(dataset, numpy.void])Discretises non-categorical (numerical) features in the
dataset.-
discretise(dataset: Union[numpy.ndarray, numpy.void]) → numpy.ndarray[source]¶ Discretises non-categorical (numerical) features in the
dataset.This is an abstract method that must be implemented for each discretiser object that inherits form
Discretiser. This method should return a numpy.ndarray with all non-categorical columns (features) of the inputdatasetbeing discretised.Warning
When implementing this method please remember to call
assert self._validate_input_discretise(dataset)to validate the input parameters.- Parameters
- datasetUnion[numpy.ndarray, numpy.void]
A data point (1-D) or an array (2-D) of data points to be discretised.
- Returns
- discretised_datanumpy.ndarray
A discretised data array.
- Raises
- NotImplementedError
This is an abstract method and has not been implemented.
- IncorrectShapeError
The input
datasetis neither 1- nor 2-dimensional numpy array. The number of features (columns) in the inputdatasetis different than the number of features in the dataset used to initialise this object.- TypeError
The dtype of the input
datasetis too different from the dtype of the dataset used to initialise this object.