fatf.accountability.data.measures.sampling_bias(dataset, column_index, groupings=None, numerical_bins_number=5, treat_as_categorical=None)[source]

Computes information needed for evaluating and remedying sampling bias.

Computes the number of instances per sub-population defined by the input parameters, the weights that can be used for cost-sensitive learning to mitigate the sampling bias and the names of each sub-population (in terms of the selected feature and its values).


To evaluate the sampling bias in terms of a binary True/False answer please use the fatf.accountability.data.measures.sampling_bias_check function or fatf.accountability.data.measures.sampling_bias_grid_check function to see sub-population pairwise sampling bias.

For warnings raised by this method please see the documentation of fatf.utils.data.tools.validate_indices_per_bin function.

dataset, column_index, groupings, numerical_bins_number, and treat_as_categorical

These parameters are described in the documentation of fatf.utils.data.tools.group_by_column function and are used to define a grouping (i.e. sub-populations). If you have your own index-based grouping and would like to get counts and weights for cost-sensitive learning, please consider using fatf.accountability.data.measures.sampling_bias_indexed function.


A number of data points for each sub-population defined by partitioning of the selected feature.


A weight for every instance (that could be grouped, i.e. assigned to one of the sub-populations) in the input dataset. The weights are useful for training a cost-sensitive classifier to mitigate the sampling bias. The weights are inversely proportional to the number of instance occurrences for every sub-population.


The name of every sub-population (binning results) defined by the feature ranges for a numerical feature and feature value sets for a categorical feature.

Examples using fatf.accountability.data.measures.sampling_bias