fatf.utils.data.augmentation.Augmentation

class fatf.utils.data.augmentation.Augmentation(dataset: numpy.ndarray, ground_truth: Optional[numpy.ndarray] = None, categorical_indices: Optional[numpy.ndarray] = None, int_to_float: bool = True)[source]

An abstract class for implementing data augmentation methods.

An abstract class that all augmentation classes should inherit from. It contains abstract __init__ and sample methods and an input validator – _validate_sample_input – for the sample method. The validation of the input parameters to the initialisation method is done via the fatf.utils.data.augmentation._validate_input function.

Note

The _validate_sample_input method should be called in all implementations of the sample method in the children classes to ensure that all the input parameters of this method are valid.

Parameters
datasetnumpy.ndarray

A 2-dimensional numpy array with a dataset to be used for sampling.

ground_truthnumpy.ndarray, optional (default=None)

A 1-dimensional numpy array with labels for the supplied dataset.

categorical_indicesList[column indices], optional (default=None)

A list of column indices that should be treat as categorical features. If None is given this will be inferred from the data array: string-based columns will be treated as categorical features and numerical columns will be treated as numerical features.

int_to_floatboolean

If True, all of the integer dtype columns in the dataset will be generalised to numpy.float64 type. Otherwise, integer type columns will remain integer and floating point type columns will remain floating point.

Attributes
datasetnumpy.ndarray

A 2-dimensional numpy array with a dataset to be used for sampling.

data_points_numberinteger

The number of data points in the dataset.

is_structuredboolean

True if the dataset is a structured numpy array, False otherwise.

ground_truthUnion[numpy.ndarray, None]

A 1-dimensional numpy array with labels for the supplied dataset.

categorical_indicesList[column indices]

A list of column indices that should be treat as categorical features.

numerical_indicesList[column indices]

A list of column indices that should be treat as numerical features.

features_numberinteger

The number of features (columns) in the input dataset.

sample_dtypeUnion[numpy.dtype, List[Tuple[string, numpy.dtype]]

A dtype with numerical dtypes (in case of a structured data array) generalised to support the assignment of sampled values. For example, if the dtype of a numerical feature is int and the sampling generates float this dtype will generalise the type of that column to float.

Raises
IncorrectShapeError

The input dataset is not a 2-dimensional numpy array. The ground_truth array is not a 1-dimensional numpy array. The number of ground truth annotation is different than the number of rows in the data array.

IndexError

Some of the column indices given in the categorical_indices parameter are not valid for the input dataset.

TypeError

The categorical_indices parameter is neither a list nor None. The dataset or the ground_truth array (if not None) are not of base (numerical and/or string) type. The int_to_float parameter is not a boolean.

Warns
UserWarning

If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the categorical_indices parameter) the user is warned that they will be added to the list of categorical features.

Methods

sample(data_row, numpy.void, None] = None, …)

Samples a given number of data points based on the initialisation data.

sample(data_row: Union[numpy.ndarray, numpy.void, None] = None, samples_number: int = 50) → numpy.ndarray[source]

Samples a given number of data points based on the initialisation data.

This is an abstract method that must be implemented for each child object. This method should provide two modes of operation:

  • if data_row is None, the sample should be from the distribution of the whole dataset that was used to initialise this class; and

  • if data_row is a numpy array with a data point, the sample should be from the vicinity of this data point.

Parameters
data_rowUnion[numpy.ndarray, numpy.void], optional (default=None)

A data point. If given, the sample will be generated around that point.

samples_numberinteger, optional (default=50)

The number of samples to be generated.

Returns
samplesnumpy.ndarray

Sampled data.

Raises
NotImplementedError

This is an abstract method and has not been implemented.