fatf.accountability.data.measures
.sampling_bias¶
-
fatf.accountability.data.measures.
sampling_bias
(dataset: numpy.ndarray, column_index: Union[int, str], groupings: Optional[List[Union[float, Tuple[str]]]] = None, numerical_bins_number: int = 5, treat_as_categorical: Optional[bool] = None) → Tuple[List[int], numpy.ndarray, List[str]][source]¶ Computes information needed for evaluating and remedying sampling bias.
Computes the number of instances per sub-population defined by the input parameters, the weights that can be used for cost-sensitive learning to mitigate the sampling bias and the names of each sub-population (in terms of the selected feature and its values).
Note
To evaluate the sampling bias in terms of a binary
True
/False
answer please use thefatf.accountability.data.measures.sampling_bias_check
function orfatf.accountability.data.measures.sampling_bias_grid_check
function to see sub-population pairwise sampling bias.For warnings raised by this method please see the documentation of
fatf.utils.data.tools.validate_indices_per_bin
function.- Parameters
- dataset, column_index, groupings, numerical_bins_number, and treat_as_categorical
These parameters are described in the documentation of
fatf.utils.data.tools.group_by_column
function and are used to define a grouping (i.e. sub-populations). If you have your own index-based grouping and would like to get counts and weights for cost-sensitive learning, please consider usingfatf.accountability.data.measures.sampling_bias_indexed
function.
- Returns
- countsList[integers]
A number of data points for each sub-population defined by partitioning of the selected feature.
- weightsnumpy.ndarray
A weight for every instance (that could be grouped, i.e. assigned to one of the sub-populations) in the input
dataset
. The weights are useful for training a cost-sensitive classifier to mitigate the sampling bias. The weights are inversely proportional to the number of instance occurrences for every sub-population.- bin_namesList[strings]
The name of every sub-population (binning results) defined by the feature ranges for a numerical feature and feature value sets for a categorical feature.