fatf.utils.data.augmentation
.NormalClassDiscovery¶
-
class
fatf.utils.data.augmentation.
NormalClassDiscovery
(dataset: numpy.ndarray, predictive_function: Callable[[numpy.ndarray], numpy.ndarray], categorical_indices: Optional[numpy.ndarray] = None, int_to_float: bool = True, classes_number: Optional[int] = None, class_proportion_threshold: float = 0.05, standard_deviation_init: float = 1.0, standard_deviation_increment: float = 0.1)[source]¶ Sampling data to discover instances spanning all the possible classes.
New in version 0.0.2.
This augmenter ensures that the generated sample has at least a predefined proportion (cf.
class_proportion_threshold
parameter) of every possible class. For a specific data point, it samples with a normal distribution centered around this point, incrementally increasing the standard deviation of the sample until the proportion of the samples of a class different (assigned by the predictive function) than the one of the specified data point is reached. Next, one of the data points found to be in another class is used as the centre of the normal distribution sampling to discover another class. These steps are repeated until all of the classes (with satisfying proportion) are in the sampled data set. If thesample
method is called without adata_row
, the starting point for the sampling procedure is the mean of thedataset
. For categorical features in the dataset, the values are sampled with replacement with the probability for each unique value calculated based on the frequency of their appearance in the dataset.Note
The number of classes when using a classifier.
Consider using the
classes_number
parameter when using a non-probabilisticpredictive_function
. For more details please see the description of theclasses_number
parameter.(When initialising this class without user-defined number of classes – via the
classes_number
parameter – it will log the number of discovered target classes when thepredictive_function
is a classifier.)For additional parameters, attributes, warnings and exceptions raised by this class please see the documentation of its parent class:
fatf.utils.data.augmentation.Augmentation
.This augmentation approach is similar to the Growing Spheres technique introduced by [LAUGEL2018INVERSE].
- LAUGEL2018INVERSE
Laugel, T., Lesot, M.J., Marsala, C., Renard, X. and Detyniecki, M., 2017. Inverse Classification for Comparison-based Interpretability in Machine Learning. arXiv preprint arXiv:1712.08443.
- Parameters
- predictive_functionCallable[[numpy.ndarray], numpy.ndarray]
A Python callable, e.g., a function, that is either a classifier or a probabilistic predictor. This function is used to compute the class of the sampled data, which is used to ensure meeting the
class_proportion_threshold
. A probabilistic function is expected to output a 2-dimensional numpy array with the assigned class being the one with maximum probability. A classifier function is expected to output a 1-dimensional numpy array with class assignment. Thepredictive_function
should require exactly one input parameter – a data array to be predicted.- classes_numberinteger, optional (default=None)
The number of classes (target values) modelled by the
predictive_function
. If thepredictive_function
is probabilistic, the number of classes is inferred from the width of the probabilities output by thepredictive_function
. If thepredictive_function
is a classifier, it is applied to the inputdataset
and the number of classes is computed based on the unique number of elements in this predictions array. Since the latter case may result in not all of the classes being discovered, it is advised to specify the number of classes using this parameter.- class_proportion_thresholdfloat, optional (default=0.05)
The minimum proportion of data points assigned to a different class by the
predictive_function
when sampling for each data point as per the procedure described above.Warning
Setting the
class_proportion_threshold
parameter.This augmenter samples a cloud of points for each discovered class with each cloud having 1 /
classes_number
number of points. This means that the value of theclass_proportion_threshold
has to be smaller than this number for the sampling to be successful. For example, for 2 classes and 100 sampled points, 2 clouds of 50 data points each will be generated. By setting theclass_proportion_threshold
parameter to0.6
, at least 60 point of each class are expected, which cannot be achieved.- standard_deviation_initfloat, optional (default=1)
The standard deviation of the normal distribution used for initial sampling around each selected data point.
- standard_deviation_incrementfloat, optional (default=0.1)
The increment used to increase the standard deviation every time the sample does not satisfy the specified
class_proportion_threshold
or at least one data point of yet unseen class is not discovered.
- Attributes
- predictive_functionCallable[[numpy.ndarray], numpy.ndarray]
The predictive function used to initialise this class.
- is_probabilisticboolean
True
if thepredictive_function
is probabilistic,False
otherwise. This attribute is set based on the shape of the numpy array output by thepredictive_function
: if it is a 2-dimensional array, thepredictive_function
is assumed to be probabilistic, if it is a 1-dimensional array, thepredictive_function
is assumed to be a classifier.- classes_numberinteger
The number of classes modelled by the
predictive_function
, either defined by the user when initialising this class or inferred from the output of thepredictive_function
.- standard_deviation_initfloat
The initial value of the standard deviation used to initialise this class.
- standard_deviation_incrementfloat
The standard deviation increment value used to initialise this class.
- class_proportion_thresholdfloat
The value of the smallest proportion of a different class for sampling used to initialise this class.
- categorical_sampling_valuesDictionary[column index, Tuple[numpy.ndarray, numpy.ndarray]]
Dictionary mapping categorical column feature indices to tuples consisting of two 1-dimensional numpy arrays: one with unique values for that column and the other one with their normalised (summing up to 1) frequencies.
- Raises
- IncompatibleModelError
The
predictive_function
does not require exactly one input parameter.- RuntimeError
The class initialisation was unable to identify the number of classes using the input
dataset
and the providedpredictive_function
. The value of theclass_proportion_threshold
parameter is too large for the given number of classes (please see the warning in theclass_proportion_threshold
parameter description for more information).- TypeError
The
predictive_function
is not a Python callable. Theclasses_number
is neitherNone
nor an integer. Theclass_proportion_threshold
is not a float. Eitherstandard_deviation_init
orstandard_deviation_increment
is not a number.- ValueError
The
classes_number
parameter is smaller than 2. Theclass_proportion_threshold
parameter is outside of the (0, 1) range (non-inclusive). Thestandard_deviation_init
orstandard_deviation_increment
parameter is not a positive number.
Methods
sample
(data_row, numpy.void, None] = None, …)Samples data using normal distribution class discovery process.
-
sample
(data_row: Union[numpy.ndarray, numpy.void, None] = None, samples_number: int = 50, max_iter: int = 1000) → numpy.ndarray[source]¶ Samples data using normal distribution class discovery process.
For the additional documentation of parameters, warnings and errors please see the description of the
fatf.utils.data.augmentation.Augmentation.sample
method in the parentfatf.utils.data.augmentation.Augmentation
class.- Parameters
- max_iterinteger, optional (default=1000)
The maximum number of iterations for the iterative normal sampling procedure. If the limit is reached and the
class_proportion_threshold
is not satisfied in addition to discovering at least one data point of yet unseen class aRuntimeError
is raised. If this is the case you may want to consider initialising the class with a smallerclass_proportion_threshold
parameter or largerstandard_deviation_init
andstandard_deviation_increment
parameters. Alternatively, increasing themax_iter
may help to discover all of the classes with the other parameters fixed.
- Returns
- samplesnumpy.ndarray
A numpy array of [
samples_number
, number of features] shape holding the sampled data.
- Raises
- RuntimeError
The maximum number of iterations was reached without discovering samples from every class (with the specified proportion).
- TypeError
The
max_iter
parameter is not an integer.- ValueError
The
max_iter
parameter is not a positive number (greater than 0).