fatf.utils.data.augmentation.TruncatedNormalSampling

class fatf.utils.data.augmentation.TruncatedNormalSampling(dataset: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, int_to_float: bool = True)[source]

Sampling data from a truncated normal distribution.

New in version 0.0.2.

This class allows to sample data according to the truncated normal distribution. The sampling can be performed either around a particular data point (by supplying the data_row parameter to the sample method) or around the mean of the whole dataset (if data_row is not given when calling the sample method). In both cases, the standard deviation of each numerical feature calculated for the whole dataset is used. The minimum and maximum of each numerical feature are also used as the bounds for the truncated normal distribution. For categorical features, the values are sampled with replacement with the probability for each unique value calculated based on the frequency of their appearance in the dataset.

For additional parameters, attributes, warnings and exceptions raised by this class please see the documentation of its parent class: fatf.utils.data.augmentation.Augmentation.

Attributes
numerical_sampling_valuesDictionary[column index, Tuple[number, number, number, number]]

Dictionary mapping numerical column feature indices to tuples of four numbers: column’s mean, standard deviation, its minimum and maximum value.

categorical_sampling_valuesDictionary[column index, Tuple[numpy.ndarray, numpy.ndarray]]

Dictionary mapping categorical column feature indices to tuples consisting of two 1-dimensional numpy arrays: one with unique values for that column and the other one with their normalised (summing up to 1) frequencies.

Methods

sample(data_row, numpy.void, None] = None, …)

Samples new data from a truncated normal distribution.

sample(data_row: Union[numpy.ndarray, numpy.void, None] = None, samples_number: int = 50) → numpy.ndarray[source]

Samples new data from a truncated normal distribution.

If data_row parameter is given, the sample will be centered around that data point. Otherwise, when the data_row parameter is None, the sample will be generated around the mean of the dataset used to initialise this class.

Numerical features are sampled around their corresponding values in the data_row parameter or the mean of that feature in the dataset using the standard deviation, minimum and maximum values calculated from the dataset. Categorical features are sampled by choosing with replacement all the possible values of that feature with the probability of sampling each value corresponding to this value’s frequency in the dataset. (This means that any particular value of a categorical feature in a data_row is ignored.)

For the documentation of parameters, warnings and errors please see the description of the fatf.utils.data.augmentation.Augmentation.sample method in the parent fatf.utils.data.augmentation.Augmentation class.