fatf.utils.data.discretisation
.Discretiser¶
-
class
fatf.utils.data.discretisation.
Discretiser
(dataset: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, feature_names: Optional[List[str]] = None)[source]¶ An abstract class that all discretiser implementations should inherit from.
New in version 0.0.2.
The validation of the initialiser input parameters is done via the
fatf.utils.data.discretise._validate_input_discretiser
function. This abstract class also contains an abstractdiscretise
method and its input validator_validate_input_discretise
. Thediscretise
method should be overwritten in the children classes and the_validate_input_discretise
methods should be called therein to validate their input.If you need extra initialisation capabilities, you may overwrite the
__init__
method in which case please remember to callsuper().__init__()
at its top to make sure that all of the abstract class attributes are validated and initialised.Warning
The
feature_value_names
andfeature_bin_boundaries
class attributes must be overwritten by every child class. The first attribute is ofDictionary[Column Index, Dictionary[integer, string]]
type where the outer dictionary is mapping a column (feature) index of the inputdataset
to a dictionary with keys being discretised bin ids for that feature and values being these bins (string) descriptions, for example, if we discretised a feature vector into quartiles the inner dictionary would be {0: ‘feature < q1’, 1: ‘q1 < feature < q2’, 2: ‘q2 < feature < q3’, 3: ‘feature > q3’}, where q1, q2 and q3 are quartile boundaries.The
feature_bin_boundaries
attribute should be overwritten with a dictionary which keys are column (feature) indices and values are numpy arrays holding bin boundaries for each feature. Using the above example this would benumpy.array([q1, q2, q3])
. (By default the upper bin boundary should be inclusive.)- Parameters
- datasetnumpy.ndarray
A 2-dimensional numpy array with a dataset to be discretised.
- categorical_indicesList[column indices], optional (default=None)
A list of column indices that should be treat as categorical features. If
None
is given, this will be inferred from thedataset
array: string-based columns will be treated as categorical features and numerical columns will be treated as numerical features.- feature_namesList[strings], optional (default=None)
A list of feature names in order they appear in the
dataset
array. IfNone
, this will be extracted from thedataset
array. For structured arrays these will be the column names extracted from the dtype; for classic arrays these will be numbers indicating the column index in the array.
- Attributes
- dataset_dtypenumpy.dtype
The dtype of the input
dataset
.- is_structuredboolean
True
if the inputdataset
is a structured numpy array,False
otherwise.- features_numberinteger
The number of features (columns) in the input
dataset
.- categorical_indicesList[Column Indices]
A list of column indices that should be treat as categorical features.
- numerical_indicesList[Column Indices]
A list of column indices that should be treat as numerical features.
- feature_names_mapDict[Column Index, String]
A dictionary that holds mapping of column (feature) indices to their names (feature names). If the
feature_names
parameter was not given (None
), the feature names are inferred from thedataset
.- feature_value_namesDictionary[Index, Dictionary[Integer, String]]
A dictionary mapping
dataset
column (feature) indices to dictionaries holding description (value) of each discrete value (key) for that feature.- feature_bin_boundariesDictionary[Index, numpy.ndarray]
A dictionary mapping
dataset
column (feature) indices to numpy arrays holding bin boundaries (with the upper threshold inclusive) for each feature.
- Raises
- IncorrectShapeError
The input
dataset
is not a 2-dimensional numpy array.- IndexError
Some of the column indices given in the
categorical_indices
list are invalid for the inputdataset
.- TypeError
The
dataset
is not of a base (numerical and/or string) type. Thecategorical_indices
is neither a Python list norNone
. Thefeature_names
is neither a Python list norNone
or one of its elements (if it is a list) is not a string.- ValueError
The length of the
feature_names
list is different than the number of columns (features) in the inputdataset
.
- Warns
- UserWarning
If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the
categorical_indices
parameter) the user is warned that they will be added to the list of categorical features.
Methods
discretise
(dataset, numpy.void])Discretises non-categorical (numerical) features in the
dataset
.-
discretise
(dataset: Union[numpy.ndarray, numpy.void]) → numpy.ndarray[source]¶ Discretises non-categorical (numerical) features in the
dataset
.This is an abstract method that must be implemented for each discretiser object that inherits form
Discretiser
. This method should return a numpy.ndarray with all non-categorical columns (features) of the inputdataset
being discretised.Warning
When implementing this method please remember to call
assert self._validate_input_discretise(dataset)
to validate the input parameters.- Parameters
- datasetUnion[numpy.ndarray, numpy.void]
A data point (1-D) or an array (2-D) of data points to be discretised.
- Returns
- discretised_datanumpy.ndarray
A discretised data array.
- Raises
- NotImplementedError
This is an abstract method and has not been implemented.
- IncorrectShapeError
The input
dataset
is neither 1- nor 2-dimensional numpy array. The number of features (columns) in the inputdataset
is different than the number of features in the dataset used to initialise this object.- TypeError
The dtype of the input
dataset
is too different from the dtype of the dataset used to initialise this object.