fatf.utils.data.augmentation.NormalClassDiscovery

class fatf.utils.data.augmentation.NormalClassDiscovery(dataset: numpy.ndarray, predictive_function: Callable[[numpy.ndarray], numpy.ndarray], categorical_indices: Optional[numpy.ndarray] = None, int_to_float: bool = True, classes_number: Optional[int] = None, class_proportion_threshold: float = 0.05, standard_deviation_init: float = 1.0, standard_deviation_increment: float = 0.1)[source]

Sampling data to discover instances spanning all the possible classes.

New in version 0.0.2.

This augmenter ensures that the generated sample has at least a predefined proportion (cf. class_proportion_threshold parameter) of every possible class. For a specific data point, it samples with a normal distribution centered around this point, incrementally increasing the standard deviation of the sample until the proportion of the samples of a class different (assigned by the predictive function) than the one of the specified data point is reached. Next, one of the data points found to be in another class is used as the centre of the normal distribution sampling to discover another class. These steps are repeated until all of the classes (with satisfying proportion) are in the sampled data set. If the sample method is called without a data_row, the starting point for the sampling procedure is the mean of the dataset. For categorical features in the dataset, the values are sampled with replacement with the probability for each unique value calculated based on the frequency of their appearance in the dataset.

Note

The number of classes when using a classifier.

Consider using the classes_number parameter when using a non-probabilistic predictive_function. For more details please see the description of the classes_number parameter.

(When initialising this class without user-defined number of classes – via the classes_number parameter – it will log the number of discovered target classes when the predictive_function is a classifier.)

For additional parameters, attributes, warnings and exceptions raised by this class please see the documentation of its parent class: fatf.utils.data.augmentation.Augmentation.

This augmentation approach is similar to the Growing Spheres technique introduced by [LAUGEL2018INVERSE].

LAUGEL2018INVERSE

Laugel, T., Lesot, M.J., Marsala, C., Renard, X. and Detyniecki, M., 2017. Inverse Classification for Comparison-based Interpretability in Machine Learning. arXiv preprint arXiv:1712.08443.

Parameters
predictive_functionCallable[[numpy.ndarray], numpy.ndarray]

A Python callable, e.g., a function, that is either a classifier or a probabilistic predictor. This function is used to compute the class of the sampled data, which is used to ensure meeting the class_proportion_threshold. A probabilistic function is expected to output a 2-dimensional numpy array with the assigned class being the one with maximum probability. A classifier function is expected to output a 1-dimensional numpy array with class assignment. The predictive_function should require exactly one input parameter – a data array to be predicted.

classes_numberinteger, optional (default=None)

The number of classes (target values) modelled by the predictive_function. If the predictive_function is probabilistic, the number of classes is inferred from the width of the probabilities output by the predictive_function. If the predictive_function is a classifier, it is applied to the input dataset and the number of classes is computed based on the unique number of elements in this predictions array. Since the latter case may result in not all of the classes being discovered, it is advised to specify the number of classes using this parameter.

class_proportion_thresholdfloat, optional (default=0.05)

The minimum proportion of data points assigned to a different class by the predictive_function when sampling for each data point as per the procedure described above.

Warning

Setting the class_proportion_threshold parameter.

This augmenter samples a cloud of points for each discovered class with each cloud having 1 / classes_number number of points. This means that the value of the class_proportion_threshold has to be smaller than this number for the sampling to be successful. For example, for 2 classes and 100 sampled points, 2 clouds of 50 data points each will be generated. By setting the class_proportion_threshold parameter to 0.6, at least 60 point of each class are expected, which cannot be achieved.

standard_deviation_initfloat, optional (default=1)

The standard deviation of the normal distribution used for initial sampling around each selected data point.

standard_deviation_incrementfloat, optional (default=0.1)

The increment used to increase the standard deviation every time the sample does not satisfy the specified class_proportion_threshold or at least one data point of yet unseen class is not discovered.

Attributes
predictive_functionCallable[[numpy.ndarray], numpy.ndarray]

The predictive function used to initialise this class.

is_probabilisticboolean

True if the predictive_function is probabilistic, False otherwise. This attribute is set based on the shape of the numpy array output by the predictive_function: if it is a 2-dimensional array, the predictive_function is assumed to be probabilistic, if it is a 1-dimensional array, the predictive_function is assumed to be a classifier.

classes_numberinteger

The number of classes modelled by the predictive_function, either defined by the user when initialising this class or inferred from the output of the predictive_function.

standard_deviation_initfloat

The initial value of the standard deviation used to initialise this class.

standard_deviation_incrementfloat

The standard deviation increment value used to initialise this class.

class_proportion_thresholdfloat

The value of the smallest proportion of a different class for sampling used to initialise this class.

categorical_sampling_valuesDictionary[column index, Tuple[numpy.ndarray, numpy.ndarray]]

Dictionary mapping categorical column feature indices to tuples consisting of two 1-dimensional numpy arrays: one with unique values for that column and the other one with their normalised (summing up to 1) frequencies.

Raises
IncompatibleModelError

The predictive_function does not require exactly one input parameter.

RuntimeError

The class initialisation was unable to identify the number of classes using the input dataset and the provided predictive_function. The value of the class_proportion_threshold parameter is too large for the given number of classes (please see the warning in the class_proportion_threshold parameter description for more information).

TypeError

The predictive_function is not a Python callable. The classes_number is neither None nor an integer. The class_proportion_threshold is not a float. Either standard_deviation_init or standard_deviation_increment is not a number.

ValueError

The classes_number parameter is smaller than 2. The class_proportion_threshold parameter is outside of the (0, 1) range (non-inclusive). The standard_deviation_init or standard_deviation_increment parameter is not a positive number.

Methods

sample(data_row, numpy.void, None] = None, …)

Samples data using normal distribution class discovery process.

sample(data_row: Union[numpy.ndarray, numpy.void, None] = None, samples_number: int = 50, max_iter: int = 1000) → numpy.ndarray[source]

Samples data using normal distribution class discovery process.

For the additional documentation of parameters, warnings and errors please see the description of the fatf.utils.data.augmentation.Augmentation.sample method in the parent fatf.utils.data.augmentation.Augmentation class.

Parameters
max_iterinteger, optional (default=1000)

The maximum number of iterations for the iterative normal sampling procedure. If the limit is reached and the class_proportion_threshold is not satisfied in addition to discovering at least one data point of yet unseen class a RuntimeError is raised. If this is the case you may want to consider initialising the class with a smaller class_proportion_threshold parameter or larger standard_deviation_init and standard_deviation_increment parameters. Alternatively, increasing the max_iter may help to discover all of the classes with the other parameters fixed.

Returns
samplesnumpy.ndarray

A numpy array of [samples_number, number of features] shape holding the sampled data.

Raises
RuntimeError

The maximum number of iterations was reached without discovering samples from every class (with the specified proportion).

TypeError

The max_iter parameter is not an integer.

ValueError

The max_iter parameter is not a positive number (greater than 0).