Using Grouping to Evaluate Robustness of Data and Models

Tutorial Contents

In this tutorial, we show how data grouping can be used to evaluate bias – from the accountability perspective – of a data set (sampling bias) and a predictive model (systematic performance bias). The former can help us to determine whether defined sub-populations are well represented in a data set – similar to the data fairness consideration in the previous tutorial. The latter, can help us with identifying sub-populations in a data set for which a predictive model under-performs – similar to the model fairness discussion in the previous tutorial.

First, we need to load numpy:

>>> import numpy as np

Now, let us load and prepare the Iris data set:

>>> import fatf.utils.data.datasets as fatf_datasets

>>> iris_data_dict = fatf_datasets.load_iris()
>>> iris_data = iris_data_dict['data']
>>> iris_target = iris_data_dict['target'].astype(int)

Note

For more information about the Iris data set and its structure, please refer the the Exploring the Grouping Concept – Defining Sub-Populations tutorial or the data set description on the UCI repository website.

Grouping the Data Set

For the purpose of this tutorial we will group the data set based on the third feature of the data set:

>>> iris_feature_names = iris_data_dict['feature_names']

>>> selected_feature_index = 2
>>> iris_feature_names[selected_feature_index]
'petal length (cm)'

Now, let us assume that for some, unknown, reason there are two important split values on this feature: 2.5 and 4.75:

>>> import fatf.utils.data.tools as fatf_data_tools

>>> selected_feature_groups = [2.5, 4.75]
>>> selected_feature_grouping = fatf_data_tools.group_by_column(
...     iris_data,
...     selected_feature_index,
...     groupings=selected_feature_groups)
>>> selected_feature_grouping[1]
['x <= 2.5', '2.5 < x <= 4.75', '4.75 < x']

Sampling Bias

Given these two important splits we can now inspect these groupings and see whether we have a comparable number of data points in each of them:

>>> len(selected_feature_grouping[0][0])
50
>>> len(selected_feature_grouping[0][1])
45
>>> len(selected_feature_grouping[0][2])
55

The number of data points for all the sub-populations seems to be (roughly) equally distributed. The only pair of sub-populations which may indicate a sampling bias is the second and the third one: 2.5 < petal length (cm) <= 4.75 and 4.75 < petal length (cm). For completeness, let us the fatf.accountability.data.measures.sampling_bias_grid_check function:

>>> import fatf.accountability.data.measures as fatf_accountability_data

>>> counts_per_grouping = [len(i) for i in selected_feature_grouping[0]]
>>> fatf_accountability_data.sampling_bias_grid_check(counts_per_grouping)
array([[False, False, False],
       [False, False,  True],
       [False,  True, False]])

As expected, the only pair of sub-populations violating sampling bias criterion with the default threshold of 0.8 are sub-populations with indices 1 and 2 making them: 2.5 < petal length (cm) <= 4.75 and 4.75 < petal length (cm)

Note

Please note that the same result can be achieved without doing the data grouping manually. To this end, you may use the fatf.accountability.data.measures.sampling_bias function, wchich internaly groups the data based on the specified feature index. The Measuring Robustness of a Data Set – Sampling Bias code example shows how to use it.

Systematic Performance Bias

Before we can evaluate robustness of a model, we first need one trained on the Iris data set:

>>> import fatf.utils.models as fatf_models
>>> clf = fatf_models.KNN()
>>> clf.fit(iris_data, iris_target)

We also need predictions of this model on a data set that we will use to evaluate its robustness; in this case we will use the training data:

>>> iris_pred = clf.predict(iris_data)

Before we can compute any performance metric, let us get confusion matrices for each sub-population:

>>> import fatf.utils.metrics.tools as fatf_metrics_tools

>>> grouping_cm = fatf_metrics_tools.confusion_matrix_per_subgroup_indexed(
...     selected_feature_grouping[0],
...     iris_target,
...     iris_pred,
...     labels=np.unique(iris_target).tolist())

Note

UserWarning

The above function call will generate 2 warnings:

UserWarning: Some of the given labels are not present in either of the input arrays: {1, 2}.
UserWarning: Some of the given labels are not present in either of the input arrays: {0}.

These are because for some of the sub-populations the ground truth (target) and the prediction vectors may only hold a single label, therefore the confusion matrix calculator is not aware of the rest and has to resort to using the labels specified in the labels parameter. Printing the unique target and prediction values of the first sub-population shows exactly this scenario happening:

>>> np.unique(iris_target[selected_feature_grouping[0][0]])
array([0])
>>> np.unique(iris_pred[selected_feature_grouping[0][0]])
array([0])

This happens as the selected feature – petal length (cm) – is a very good predictor of the first class. For more details you may want to have a look at the data transparency section of the grouping tutorial where this feature is explained in relation to the ground truth using the data descrition funcitonality of this package.

With confusion matrices for every grouping we can generate any performance metric. For the purposes of this tutorial let us look at accuracy:

>>> import fatf.utils.metrics.metrics as fatf_metrics

>>> group_0_acc = fatf_metrics.accuracy(grouping_cm[0])
>>> group_0_acc
1.0
>>> group_1_acc = fatf_metrics.accuracy(grouping_cm[1])
>>> group_1_acc
0.9777777777777777
>>> group_2_acc = fatf_metrics.accuracy(grouping_cm[2])
>>> group_2_acc
0.9090909090909091

The accuracy seems to be comparable across sub-populations. Clearly none of the sub-populations defined on the petal length feature suffers from a performance bias as measured by accuracy. For completeness, let us test for the systematic performance bias with the fatf.accountability.models.measures.systematic_performance_bias_grid function:

>>> import fatf.accountability.models.measures as fatf_accountability_models

>>> fatf_accountability_models.systematic_performance_bias_grid(
...     [group_0_acc, group_1_acc, group_2_acc])
array([[False, False, False],
       [False, False, False],
       [False, False, False]])

As expected, there is no systematic performance bias for these sub-populations given the predictive model at hand.

Note

In this part of the tutorial we used the fatf.utils.metrics.tools.confusion_matrix_per_subgroup_indexed function to get a confusion matrix for each of the sub-populations and used these to compute the corresponding accuracies. All of these steps are combined by the fatf.utils.metrics.subgroup_metrics.performance_per_subgroup function, therefore making the task of evaluating systematic performance bias easier. An example of how to use this function can be found in Measuring Robustness of a Predictive Model – Systematic Performance Bias code example.


In this tutorial we saw how to use data grouping to evaluate important accountability aspects of data sets and predictive models. This tutorial concludes the series of tutorials focused around data grouping. In the next one we move on to predictive models (Explaining a Machine Learning Model: ICE and PD) and predictions (Explaining Machine Learning Predictions: LIME and Counterfactuals) transparency. For data sets transparency please refer to the last section of the Exploring the Grouping Concept – Defining Sub-Populations tutorial.

Relevant FAT Forensics Examples

The following examples provide more structured and code-focused use-cases of a group-based data and models inspection to evaluate their accountability: