fatf.utils.data.augmentation.NormalClassDiscovery¶
-
class
fatf.utils.data.augmentation.NormalClassDiscovery(dataset: numpy.ndarray, predictive_function: Callable[[numpy.ndarray], numpy.ndarray], categorical_indices: Optional[numpy.ndarray] = None, int_to_float: bool = True, classes_number: Optional[int] = None, class_proportion_threshold: float = 0.05, standard_deviation_init: float = 1.0, standard_deviation_increment: float = 0.1)[source]¶ Sampling data to discover instances spanning all the possible classes.
New in version 0.0.2.
This augmenter ensures that the generated sample has at least a predefined proportion (cf.
class_proportion_thresholdparameter) of every possible class. For a specific data point, it samples with a normal distribution centered around this point, incrementally increasing the standard deviation of the sample until the proportion of the samples of a class different (assigned by the predictive function) than the one of the specified data point is reached. Next, one of the data points found to be in another class is used as the centre of the normal distribution sampling to discover another class. These steps are repeated until all of the classes (with satisfying proportion) are in the sampled data set. If thesamplemethod is called without adata_row, the starting point for the sampling procedure is the mean of thedataset. For categorical features in the dataset, the values are sampled with replacement with the probability for each unique value calculated based on the frequency of their appearance in the dataset.Note
The number of classes when using a classifier.
Consider using the
classes_numberparameter when using a non-probabilisticpredictive_function. For more details please see the description of theclasses_numberparameter.(When initialising this class without user-defined number of classes – via the
classes_numberparameter – it will log the number of discovered target classes when thepredictive_functionis a classifier.)For additional parameters, attributes, warnings and exceptions raised by this class please see the documentation of its parent class:
fatf.utils.data.augmentation.Augmentation.This augmentation approach is similar to the Growing Spheres technique introduced by [LAUGEL2018INVERSE].
- LAUGEL2018INVERSE
Laugel, T., Lesot, M.J., Marsala, C., Renard, X. and Detyniecki, M., 2017. Inverse Classification for Comparison-based Interpretability in Machine Learning. arXiv preprint arXiv:1712.08443.
- Parameters
- predictive_functionCallable[[numpy.ndarray], numpy.ndarray]
A Python callable, e.g., a function, that is either a classifier or a probabilistic predictor. This function is used to compute the class of the sampled data, which is used to ensure meeting the
class_proportion_threshold. A probabilistic function is expected to output a 2-dimensional numpy array with the assigned class being the one with maximum probability. A classifier function is expected to output a 1-dimensional numpy array with class assignment. Thepredictive_functionshould require exactly one input parameter – a data array to be predicted.- classes_numberinteger, optional (default=None)
The number of classes (target values) modelled by the
predictive_function. If thepredictive_functionis probabilistic, the number of classes is inferred from the width of the probabilities output by thepredictive_function. If thepredictive_functionis a classifier, it is applied to the inputdatasetand the number of classes is computed based on the unique number of elements in this predictions array. Since the latter case may result in not all of the classes being discovered, it is advised to specify the number of classes using this parameter.- class_proportion_thresholdfloat, optional (default=0.05)
The minimum proportion of data points assigned to a different class by the
predictive_functionwhen sampling for each data point as per the procedure described above.Warning
Setting the
class_proportion_thresholdparameter.This augmenter samples a cloud of points for each discovered class with each cloud having 1 /
classes_numbernumber of points. This means that the value of theclass_proportion_thresholdhas to be smaller than this number for the sampling to be successful. For example, for 2 classes and 100 sampled points, 2 clouds of 50 data points each will be generated. By setting theclass_proportion_thresholdparameter to0.6, at least 60 point of each class are expected, which cannot be achieved.- standard_deviation_initfloat, optional (default=1)
The standard deviation of the normal distribution used for initial sampling around each selected data point.
- standard_deviation_incrementfloat, optional (default=0.1)
The increment used to increase the standard deviation every time the sample does not satisfy the specified
class_proportion_thresholdor at least one data point of yet unseen class is not discovered.
- Attributes
- predictive_functionCallable[[numpy.ndarray], numpy.ndarray]
The predictive function used to initialise this class.
- is_probabilisticboolean
Trueif thepredictive_functionis probabilistic,Falseotherwise. This attribute is set based on the shape of the numpy array output by thepredictive_function: if it is a 2-dimensional array, thepredictive_functionis assumed to be probabilistic, if it is a 1-dimensional array, thepredictive_functionis assumed to be a classifier.- classes_numberinteger
The number of classes modelled by the
predictive_function, either defined by the user when initialising this class or inferred from the output of thepredictive_function.- standard_deviation_initfloat
The initial value of the standard deviation used to initialise this class.
- standard_deviation_incrementfloat
The standard deviation increment value used to initialise this class.
- class_proportion_thresholdfloat
The value of the smallest proportion of a different class for sampling used to initialise this class.
- categorical_sampling_valuesDictionary[column index, Tuple[numpy.ndarray, numpy.ndarray]]
Dictionary mapping categorical column feature indices to tuples consisting of two 1-dimensional numpy arrays: one with unique values for that column and the other one with their normalised (summing up to 1) frequencies.
- Raises
- IncompatibleModelError
The
predictive_functiondoes not require exactly one input parameter.- RuntimeError
The class initialisation was unable to identify the number of classes using the input
datasetand the providedpredictive_function. The value of theclass_proportion_thresholdparameter is too large for the given number of classes (please see the warning in theclass_proportion_thresholdparameter description for more information).- TypeError
The
predictive_functionis not a Python callable. Theclasses_numberis neitherNonenor an integer. Theclass_proportion_thresholdis not a float. Eitherstandard_deviation_initorstandard_deviation_incrementis not a number.- ValueError
The
classes_numberparameter is smaller than 2. Theclass_proportion_thresholdparameter is outside of the (0, 1) range (non-inclusive). Thestandard_deviation_initorstandard_deviation_incrementparameter is not a positive number.
Methods
sample(data_row, numpy.void, None] = None, …)Samples data using normal distribution class discovery process.
-
sample(data_row: Union[numpy.ndarray, numpy.void, None] = None, samples_number: int = 50, max_iter: int = 1000) → numpy.ndarray[source]¶ Samples data using normal distribution class discovery process.
For the additional documentation of parameters, warnings and errors please see the description of the
fatf.utils.data.augmentation.Augmentation.samplemethod in the parentfatf.utils.data.augmentation.Augmentationclass.- Parameters
- max_iterinteger, optional (default=1000)
The maximum number of iterations for the iterative normal sampling procedure. If the limit is reached and the
class_proportion_thresholdis not satisfied in addition to discovering at least one data point of yet unseen class aRuntimeErroris raised. If this is the case you may want to consider initialising the class with a smallerclass_proportion_thresholdparameter or largerstandard_deviation_initandstandard_deviation_incrementparameters. Alternatively, increasing themax_itermay help to discover all of the classes with the other parameters fixed.
- Returns
- samplesnumpy.ndarray
A numpy array of [
samples_number, number of features] shape holding the sampled data.
- Raises
- RuntimeError
The maximum number of iterations was reached without discovering samples from every class (with the specified proportion).
- TypeError
The
max_iterparameter is not an integer.- ValueError
The
max_iterparameter is not a positive number (greater than 0).