fatf.utils.data.density.DensityCheck

class fatf.utils.data.density.DensityCheck(data_set: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, neighbours: int = 7, distance_function: Optional[Callable[[Union[numpy.ndarray, numpy.void], Union[numpy.ndarray, numpy.void]], float]] = None, normalise_scores: bool = True)[source]

Checks and scores density in the data_set and for new data points.

A density score for a particular data point is calculated by looking at the distance of the n-th neighbour defined by the neighbours parameter. If this distance is relatively large (in comparison to all the other data point-to-data point distances in the data_set) it means that this particular point lies in a low density region. The scores can be normalised to [0, 1] range by setting the normalise_scores parameter to True (the default value). Since the minimum and the maximum value of the scores in the data set are used when (normalised) scoring a new data point the score may go out of the [0, 1] range. To avoid this from happening please look into the clip parameter in the score_data_point method.

Parameters
data_setnumpy.ndarray

A 2-dimensional numpy array (either classic or structured) of a base type (strings and/or numbers).

categorical_indicesList[column indices], optional (default=None)

A list of column indices that should be treat as categorical features. If None the categorical column indices will be inferred by checking the type of the data_set for a classic numpy array and the type of every column for a structured numpy array.

neighboursinteger, optional (default=7)

The number of closest neighbours to be considered when calculating the density score.

distance_functionCallable[[data row, data row], number], optional (default=None)

If None the sum of Euclidean distance for numerical features and binary distance (0 when the values are the same and 1 otherwise) for categorical features will be used as a distance function. Alternatively, the user may provide a Python function that will be used to calculate a distance between two data points. This function takes as an input two 1-dimensional numpy arrays (for classic numpy arrays) or numpy voids (fro structured numpy arrays) of equal length and outputs a number representing a distance between them. The distance function is assumed to return the same distance regardless of the order in which the input parameters are given.

normalise_scoresboolean, optional (default=True)

A boolean parameter indicating whether to normalise the density scores (True) or not (False). The scores are normalised by subtracting the minimum value and dividing by the new (after subtracting the minimum) maximum value.

Attributes
data_setnumpy.ndarray

A data set used to compute the density scores.

neighboursinteger

The number of neighbours used to calculate the density scores.

normalise_scoresboolean

Indicates whether the scores should be normalised to a [0, 1] range.

distance_matrixnumpy.ndarray

An 2-dimensional, square and diagonally symmetric array with distances between every pair of rows in the data_set.

scoresnumpy.ndarray

A 1-dimensional array with a density score for every row in the data_set.

scores_minnumber

The minimum density score (extracted before the normalisation if one is performed).

scores_maxnumber

The maximum density score (extracted before the normalisation if one is performed).

_samples_numberinteger

The number of data points (rows) in the data_set.

_numerical_indicesList[column indices]

An array holding indices of numerical columns in the data_set array.

_categorical_indicesList[column indices]

An array holding indices of categorical columns in the data_set array.

_is_structuredboolean

Indicates whether the input data_set is a structured array (True) or a classic numpy array (False).

_distance_functionCallable[[data row, data row], number]

A Python function used to calculate distances between data points.

Raises
AttributeError

The distance function does not require exactly 2 non-optional parameters.

IncorrectShapeError

The data_set array is not 2-dimensional.

IndexError

Some of the provided categorical column indices are invalid for the data_set array.

TypeError

The data_set array is not of a base type (strings and/or numbers). The neighbours parameter is not an integer. The distance_function is neither None nor Python callable (a function). The normalise_scores parameter is not a boolean. The categorical_indices parameter is not a Python list.

ValueError

The neighbours parameter is smaller than 1 or larger than the number of instances (rows) in the data_set array.

Warns
UserWarning

If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the categorical_indices parameter) the user is warned that they will be added to the list of categorical features.

Methods

filter_data_set(alpha)

Returns the data points that are in alpha-dense areas.

score_data_point(data_point, numpy.void], clip)

Calculates a density score for the data_point.

filter_data_set(alpha: float = 0.8) → numpy.ndarray[source]

Returns the data points that are in alpha-dense areas.

A data points in an alpha-dense region have a density score larger or equal to alpha. For normalised scores alpha should be between 0 and 1, whereas for unnormalised scores it must be equal to or larger than 0.

Parameters
alphanumber, optional (default=0.8)

The score above which instances should be kept.

Returns
filtered_data_setnumpy.ndarray

Data points with density score larger than alpha (extracted from the data_set).

Raises
TypeError

The alpha parameter is not a number.

ValueError

The alpha parameter is not between 0 and 1 for the normalised scores or is not larger or equal to 0 for unnormalised scores.

Warns
UserWarning

Chosen alpha parameter is too large and none of the data points were selected.

score_data_point(data_point: Union[numpy.ndarray, numpy.void], clip: bool = True) → float[source]

Calculates a density score for the data_point.

Parameters
data_pointUnion[numpy.array, numpy.void]

A data row. For numpy arrays this will be a numpy ndarray. For structured numpy arrays this will be numpy void.

clipboolean, optional (default=True)

If True and the scores are normalised (this class was initialised with the normalise_scores parameter set to True, which is the default option) the score of the provided data point will be clipped to fit the [0, 1] range. If the scores are not normalised this parameter is ignored.

Returns
scorenumber

A density score for the data_point.

Raises
IncorrectShapeError

The data point is not 1-dimensional numpy array (either numpy ndarray for classic numpy arrays or numpy void for structured numpy arrays). The data point does not have the same number of columns (features) as the data set used to initialise this class.

TypeError

The data point is not of a base type (strings and/or numbers). The dtype of the data point is too different from the dtype of the data set used to initialise this class. The clip parameter is not a boolean.

Warns
UserWarning

The minimum and maximum score values for this class are the same, therefore the score normalisation cannot be performed. In this case the score will be 0 if it is below the min/max, 1 if it is above the min/max and otherwise it stays the same.

Examples using fatf.utils.data.density.DensityCheck