fatf.utils.data.density.DensityCheck¶
- 
class 
fatf.utils.data.density.DensityCheck(data_set: numpy.ndarray, categorical_indices: Optional[List[Union[str, int]]] = None, neighbours: int = 7, distance_function: Optional[Callable[[Union[numpy.ndarray, numpy.void], Union[numpy.ndarray, numpy.void]], float]] = None, normalise_scores: bool = True)[source]¶ Checks and scores density in the
data_setand for new data points.A density score for a particular data point is calculated by looking at the distance of the n-th neighbour defined by the
neighboursparameter. If this distance is relatively large (in comparison to all the other data point-to-data point distances in thedata_set) it means that this particular point lies in a low density region. The scores can be normalised to [0, 1] range by setting thenormalise_scoresparameter toTrue(the default value). Since the minimum and the maximum value of the scores in the data set are used when (normalised) scoring a new data point the score may go out of the [0, 1] range. To avoid this from happening please look into theclipparameter in thescore_data_pointmethod.- Parameters
 - data_setnumpy.ndarray
 A 2-dimensional numpy array (either classic or structured) of a base type (strings and/or numbers).
- categorical_indicesList[column indices], optional (default=None)
 A list of column indices that should be treat as categorical features. If
Nonethe categorical column indices will be inferred by checking the type of thedata_setfor a classic numpy array and the type of every column for a structured numpy array.- neighboursinteger, optional (default=7)
 The number of closest neighbours to be considered when calculating the density score.
- distance_functionCallable[[data row, data row], number], optional (default=None)
 If
Nonethe sum of Euclidean distance for numerical features and binary distance (0 when the values are the same and 1 otherwise) for categorical features will be used as a distance function. Alternatively, the user may provide a Python function that will be used to calculate a distance between two data points. This function takes as an input two 1-dimensional numpy arrays (for classic numpy arrays) or numpy voids (fro structured numpy arrays) of equal length and outputs a number representing a distance between them. The distance function is assumed to return the same distance regardless of the order in which the input parameters are given.- normalise_scoresboolean, optional (default=True)
 A boolean parameter indicating whether to normalise the density scores (
True) or not (False). The scores are normalised by subtracting the minimum value and dividing by the new (after subtracting the minimum) maximum value.
- Attributes
 - data_setnumpy.ndarray
 A data set used to compute the density scores.
- neighboursinteger
 The number of neighbours used to calculate the density scores.
- normalise_scoresboolean
 Indicates whether the scores should be normalised to a [0, 1] range.
- distance_matrixnumpy.ndarray
 An 2-dimensional, square and diagonally symmetric array with distances between every pair of rows in the
data_set.- scoresnumpy.ndarray
 A 1-dimensional array with a density score for every row in the
data_set.- scores_minnumber
 The minimum density score (extracted before the normalisation if one is performed).
- scores_maxnumber
 The maximum density score (extracted before the normalisation if one is performed).
- _samples_numberinteger
 The number of data points (rows) in the
data_set.- _numerical_indicesList[column indices]
 An array holding indices of numerical columns in the
data_setarray.- _categorical_indicesList[column indices]
 An array holding indices of categorical columns in the
data_setarray.- _is_structuredboolean
 Indicates whether the input
data_setis a structured array (True) or a classic numpy array (False).- _distance_functionCallable[[data row, data row], number]
 A Python function used to calculate distances between data points.
- Raises
 - AttributeError
 The distance function does not require exactly 2 non-optional parameters.
- IncorrectShapeError
 The
data_setarray is not 2-dimensional.- IndexError
 Some of the provided categorical column indices are invalid for the
data_setarray.- TypeError
 The
data_setarray is not of a base type (strings and/or numbers). Theneighboursparameter is not an integer. Thedistance_functionis neitherNonenor Python callable (a function). Thenormalise_scoresparameter is not a boolean. Thecategorical_indicesparameter is not a Python list.- ValueError
 The
neighboursparameter is smaller than 1 or larger than the number of instances (rows) in thedata_setarray.
- Warns
 - UserWarning
 If some of the string-based columns in the input data array were not indicated to be categorical features by the user (via the
categorical_indicesparameter) the user is warned that they will be added to the list of categorical features.
Methods
filter_data_set(alpha)Returns the data points that are in alpha-dense areas.
score_data_point(data_point, numpy.void], clip)Calculates a density score for the
data_point.- 
filter_data_set(alpha: float = 0.8) → numpy.ndarray[source]¶ Returns the data points that are in alpha-dense areas.
A data points in an alpha-dense region have a density score larger or equal to
alpha. For normalised scoresalphashould be between 0 and 1, whereas for unnormalised scores it must be equal to or larger than 0.- Parameters
 - alphanumber, optional (default=0.8)
 The score above which instances should be kept.
- Returns
 - filtered_data_setnumpy.ndarray
 Data points with density score larger than
alpha(extracted from thedata_set).
- Raises
 - TypeError
 The
alphaparameter is not a number.- ValueError
 The alpha parameter is not between 0 and 1 for the normalised scores or is not larger or equal to 0 for unnormalised scores.
- Warns
 - UserWarning
 Chosen
alphaparameter is too large and none of the data points were selected.
- 
score_data_point(data_point: Union[numpy.ndarray, numpy.void], clip: bool = True) → float[source]¶ Calculates a density score for the
data_point.- Parameters
 - data_pointUnion[numpy.array, numpy.void]
 A data row. For numpy arrays this will be a numpy ndarray. For structured numpy arrays this will be numpy void.
- clipboolean, optional (default=True)
 If
Trueand the scores are normalised (this class was initialised with thenormalise_scoresparameter set toTrue, which is the default option) the score of the provided data point will be clipped to fit the [0, 1] range. If the scores are not normalised this parameter is ignored.
- Returns
 - scorenumber
 A density score for the
data_point.
- Raises
 - IncorrectShapeError
 The data point is not 1-dimensional numpy array (either numpy ndarray for classic numpy arrays or numpy void for structured numpy arrays). The data point does not have the same number of columns (features) as the data set used to initialise this class.
- TypeError
 The data point is not of a base type (strings and/or numbers). The dtype of the data point is too different from the dtype of the data set used to initialise this class. The
clipparameter is not a boolean.
- Warns
 - UserWarning
 The minimum and maximum score values for this class are the same, therefore the score normalisation cannot be performed. In this case the score will be 0 if it is below the min/max, 1 if it is above the min/max and otherwise it stays the same.