fatf.utils.data.density
.DensityCheck¶
-
class
fatf.utils.data.density.
DensityCheck
(data_set: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, neighbours: int = 7, distance_function: Optional[Callable[[Union[numpy.ndarray, numpy.void], Union[numpy.ndarray, numpy.void]], float]] = None, normalise_scores: bool = True)[source]¶ Checks and scores density in the
data_set
and for new data points.A density score for a particular data point is calculated by looking at the distance of the n-th neighbour defined by the
neighbours
parameter. If this distance is relatively large (in comparison to all the other data point-to-data point distances in thedata_set
) it means that this particular point lies in a low density region. The scores can be normalised to [0, 1] range by setting thenormalise_scores
parameter toTrue
(the default value). Since the minimum and the maximum value of the scores in the data set are used when (normalised) scoring a new data point the score may go out of the [0, 1] range. To avoid this from happening please look into theclip
parameter in thescore_data_point
method.- Parameters
- data_setnumpy.ndarray
A 2-dimensional numpy array (either classic or structured) of a base type (strings and/or numbers).
- categorical_indicesList[column indices], optional (default=None)
A list of column indices that should be treat as categorical features. If
None
the categorical column indices will be inferred by checking the type of thedata_set
for a classic numpy array and the type of every column for a structured numpy array.- neighboursinteger, optional (default=7)
The number of closest neighbours to be considered when calculating the density score.
- distance_functionCallable[[data row, data row], number], optional (default=None)
If
None
the sum of Euclidean distance for numerical features and binary distance (0 when the values are the same and 1 otherwise) for categorical features will be used as a distance function. Alternatively, the user may provide a Python function that will be used to calculate a distance between two data points. This function takes as an input two 1-dimensional numpy arrays (for classic numpy arrays) or numpy voids (fro structured numpy arrays) of equal length and outputs a number representing a distance between them. The distance function is assumed to return the same distance regardless of the order in which the input parameters are given.- normalise_scoresboolean, optional (default=True)
A boolean parameter indicating whether to normalise the density scores (
True
) or not (False
). The scores are normalised by subtracting the minimum value and dividing by the new (after subtracting the minimum) maximum value.
- Attributes
- data_setnumpy.ndarray
A data set used to compute the density scores.
- neighboursinteger
The number of neighbours used to calculate the density scores.
- normalise_scoresboolean
Indicates whether the scores should be normalised to a [0, 1] range.
- distance_matrixnumpy.ndarray
An 2-dimensional, square and diagonally symmetric array with distances between every pair of rows in the
data_set
.- scoresnumpy.ndarray
A 1-dimensional array with a density score for every row in the
data_set
.- scores_minnumber
The minimum density score (extracted before the normalisation if one is performed).
- scores_maxnumber
The maximum density score (extracted before the normalisation if one is performed).
- _samples_numberinteger
The number of data points (rows) in the
data_set
.- _numerical_indicesList[column indices]
An array holding indices of numerical columns in the
data_set
array.- _categorical_indicesList[column indices]
An array holding indices of categorical columns in the
data_set
array.- _is_structuredboolean
Indicates whether the input
data_set
is a structured array (True
) or a classic numpy array (False
).- _distance_functionCallable[[data row, data row], number]
A Python function used to calculate distances between data points.
- Raises
- AttributeError
The distance function does not require exactly 2 non-optional parameters.
- IncorrectShapeError
The
data_set
array is not 2-dimensional.- IndexError
Some of the provided categorical column indices are invalid for the
data_set
array.- TypeError
The
data_set
array is not of a base type (strings and/or numbers). Theneighbours
parameter is not an integer. Thedistance_function
is neitherNone
nor Python callable (a function). Thenormalise_scores
parameter is not a boolean. Thecategorical_indices
parameter is not a Python list.- ValueError
The
neighbours
parameter is smaller than 1 or larger than the number of instances (rows) in thedata_set
array.
- Warns
- UserWarning
If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the
categorical_indices
parameter) the user is warned that they will be added to the list of categorical features.
Methods
filter_data_set
(alpha)Returns the data points that are in alpha-dense areas.
score_data_point
(data_point, numpy.void], clip)Calculates a density score for the
data_point
.-
filter_data_set
(alpha: float = 0.8) → numpy.ndarray[source]¶ Returns the data points that are in alpha-dense areas.
A data points in an alpha-dense region have a density score larger or equal to
alpha
. For normalised scoresalpha
should be between 0 and 1, whereas for unnormalised scores it must be equal to or larger than 0.- Parameters
- alphanumber, optional (default=0.8)
The score above which instances should be kept.
- Returns
- filtered_data_setnumpy.ndarray
Data points with density score larger than
alpha
(extracted from thedata_set
).
- Raises
- TypeError
The
alpha
parameter is not a number.- ValueError
The alpha parameter is not between 0 and 1 for the normalised scores or is not larger or equal to 0 for unnormalised scores.
- Warns
- UserWarning
Chosen
alpha
parameter is too large and none of the data points were selected.
-
score_data_point
(data_point: Union[numpy.ndarray, numpy.void], clip: bool = True) → float[source]¶ Calculates a density score for the
data_point
.- Parameters
- data_pointUnion[numpy.array, numpy.void]
A data row. For numpy arrays this will be a numpy ndarray. For structured numpy arrays this will be numpy void.
- clipboolean, optional (default=True)
If
True
and the scores are normalised (this class was initialised with thenormalise_scores
parameter set toTrue
, which is the default option) the score of the provided data point will be clipped to fit the [0, 1] range. If the scores are not normalised this parameter is ignored.
- Returns
- scorenumber
A density score for the
data_point
.
- Raises
- IncorrectShapeError
The data point is not 1-dimensional numpy array (either numpy ndarray for classic numpy arrays or numpy void for structured numpy arrays). The data point does not have the same number of columns (features) as the data set used to initialise this class.
- TypeError
The data point is not of a base type (strings and/or numbers). The dtype of the data point is too different from the dtype of the data set used to initialise this class. The
clip
parameter is not a boolean.
- Warns
- UserWarning
The minimum and maximum score values for this class are the same, therefore the score normalisation cannot be performed. In this case the score will be 0 if it is below the min/max, 1 if it is above the min/max and otherwise it stays the same.