# fatf.accountability.data.measures.sampling_bias¶

fatf.accountability.data.measures.sampling_bias(dataset: numpy.ndarray, column_index: Union[int, str], groupings: Optional[List[Union[float, Tuple[str]]]] = None, numerical_bins_number: int = 5, treat_as_categorical: Optional[bool] = None) → Tuple[List[int], numpy.ndarray, List[str]][source]

Computes information needed for evaluating and remedying sampling bias.

Computes the number of instances per sub-population defined by the input parameters, the weights that can be used for cost-sensitive learning to mitigate the sampling bias and the names of each sub-population (in terms of the selected feature and its values).

Note

To evaluate the sampling bias in terms of a binary True/False answer please use the fatf.accountability.data.measures.sampling_bias_check function or fatf.accountability.data.measures.sampling_bias_grid_check function to see sub-population pairwise sampling bias.

For warnings raised by this method please see the documentation of fatf.utils.data.tools.validate_indices_per_bin function.

Parameters
dataset, column_index, groupings, numerical_bins_number, and treat_as_categorical

These parameters are described in the documentation of fatf.utils.data.tools.group_by_column function and are used to define a grouping (i.e. sub-populations). If you have your own index-based grouping and would like to get counts and weights for cost-sensitive learning, please consider using fatf.accountability.data.measures.sampling_bias_indexed function.

Returns
countsList[integers]

A number of data points for each sub-population defined by partitioning of the selected feature.

weightsnumpy.ndarray

A weight for every instance (that could be grouped, i.e. assigned to one of the sub-populations) in the input dataset. The weights are useful for training a cost-sensitive classifier to mitigate the sampling bias. The weights are inversely proportional to the number of instance occurrences for every sub-population.

bin_namesList[strings]

The name of every sub-population (binning results) defined by the feature ranges for a numerical feature and feature value sets for a categorical feature.