Using Grouping to Evaluate Robustness of Data and Models¶
Tutorial Contents
In this tutorial, we show how data grouping can be used to evaluate bias – from the accountability perspective – of a data set (sampling bias) and a predictive model (systematic performance bias). The former can help us to determine whether defined sub-populations are well represented in a data set – similar to the data fairness consideration in the previous tutorial. The latter, can help us with identifying sub-populations in a data set for which a predictive model under-performs – similar to the model fairness discussion in the previous tutorial.
First, we need to load numpy:
>>> import numpy as np
Now, let us load and prepare the Iris data set:
>>> import fatf.utils.data.datasets as fatf_datasets
>>> iris_data_dict = fatf_datasets.load_iris()
>>> iris_data = iris_data_dict['data']
>>> iris_target = iris_data_dict['target'].astype(int)
Note
For more information about the Iris data set and its structure, please refer the the Exploring the Grouping Concept – Defining Sub-Populations tutorial or the data set description on the UCI repository website.
Grouping the Data Set¶
For the purpose of this tutorial we will group the data set based on the third feature of the data set:
>>> iris_feature_names = iris_data_dict['feature_names']
>>> selected_feature_index = 2
>>> iris_feature_names[selected_feature_index]
'petal length (cm)'
Now, let us assume that for some, unknown, reason there are two important
split values on this feature: 2.5
and 4.75
:
>>> import fatf.utils.data.tools as fatf_data_tools
>>> selected_feature_groups = [2.5, 4.75]
>>> selected_feature_grouping = fatf_data_tools.group_by_column(
... iris_data,
... selected_feature_index,
... groupings=selected_feature_groups)
>>> selected_feature_grouping[1]
['x <= 2.5', '2.5 < x <= 4.75', '4.75 < x']
Sampling Bias¶
Given these two important splits we can now inspect these groupings and see whether we have a comparable number of data points in each of them:
>>> len(selected_feature_grouping[0][0])
50
>>> len(selected_feature_grouping[0][1])
45
>>> len(selected_feature_grouping[0][2])
55
The number of data points for all the sub-populations seems to be (roughly)
equally distributed. The only pair of sub-populations which may indicate a
sampling bias is the second and the third one:
2.5 < petal length (cm) <= 4.75
and 4.75 < petal length (cm)
. For
completeness, let us the
fatf.accountability.data.measures.sampling_bias_grid_check
function:
>>> import fatf.accountability.data.measures as fatf_accountability_data
>>> counts_per_grouping = [len(i) for i in selected_feature_grouping[0]]
>>> fatf_accountability_data.sampling_bias_grid_check(counts_per_grouping)
array([[False, False, False],
[False, False, True],
[False, True, False]])
As expected, the only pair of sub-populations violating sampling bias
criterion with the default threshold of 0.8
are sub-populations with
indices 1 and 2 making them: 2.5 < petal length (cm) <= 4.75
and
4.75 < petal length (cm)
Note
Please note that the same result can be achieved without doing the data
grouping manually. To this end, you may use the
fatf.accountability.data.measures.sampling_bias
function, wchich
internaly groups the data based on the specified feature index. The
Measuring Robustness of a Data Set – Sampling Bias
code example shows how to use it.
Systematic Performance Bias¶
Before we can evaluate robustness of a model, we first need one trained on the Iris data set:
>>> import fatf.utils.models as fatf_models
>>> clf = fatf_models.KNN()
>>> clf.fit(iris_data, iris_target)
We also need predictions of this model on a data set that we will use to evaluate its robustness; in this case we will use the training data:
>>> iris_pred = clf.predict(iris_data)
Before we can compute any performance metric, let us get confusion matrices for each sub-population:
>>> import fatf.utils.metrics.tools as fatf_metrics_tools
>>> grouping_cm = fatf_metrics_tools.confusion_matrix_per_subgroup_indexed(
... selected_feature_grouping[0],
... iris_target,
... iris_pred,
... labels=np.unique(iris_target).tolist())
Note
UserWarning
The above function call will generate 2 warnings:
UserWarning: Some of the given labels are not present in either of the input arrays: {1, 2}.
UserWarning: Some of the given labels are not present in either of the input arrays: {0}.
These are because for some of the sub-populations the ground truth (target)
and the prediction vectors may only hold a single label, therefore the
confusion matrix calculator is not aware of the rest and has to resort to
using the labels specified in the labels
parameter. Printing the unique
target and prediction values of the first sub-population shows exactly this
scenario happening:
>>> np.unique(iris_target[selected_feature_grouping[0][0]])
array([0])
>>> np.unique(iris_pred[selected_feature_grouping[0][0]])
array([0])
This happens as the selected feature – petal length (cm) – is a very good predictor of the first class. For more details you may want to have a look at the data transparency section of the grouping tutorial where this feature is explained in relation to the ground truth using the data descrition funcitonality of this package.
With confusion matrices for every grouping we can generate any performance metric. For the purposes of this tutorial let us look at accuracy:
>>> import fatf.utils.metrics.metrics as fatf_metrics
>>> group_0_acc = fatf_metrics.accuracy(grouping_cm[0])
>>> group_0_acc
1.0
>>> group_1_acc = fatf_metrics.accuracy(grouping_cm[1])
>>> group_1_acc
0.9777777777777777
>>> group_2_acc = fatf_metrics.accuracy(grouping_cm[2])
>>> group_2_acc
0.9090909090909091
The accuracy seems to be comparable across sub-populations. Clearly none of
the sub-populations defined on the petal length feature suffers from a
performance bias as measured by accuracy. For completeness, let us test
for the systematic performance bias with the
fatf.accountability.models.measures.systematic_performance_bias_grid
function:
>>> import fatf.accountability.models.measures as fatf_accountability_models
>>> fatf_accountability_models.systematic_performance_bias_grid(
... [group_0_acc, group_1_acc, group_2_acc])
array([[False, False, False],
[False, False, False],
[False, False, False]])
As expected, there is no systematic performance bias for these sub-populations given the predictive model at hand.
Note
In this part of the tutorial we used the
fatf.utils.metrics.tools.confusion_matrix_per_subgroup_indexed
function to get a confusion matrix for each of the sub-populations and used
these to compute the corresponding accuracies. All of these steps are
combined by the
fatf.utils.metrics.subgroup_metrics.performance_per_subgroup
function, therefore making the task of evaluating systematic performance
bias easier. An example of how to use this function can be found in
Measuring Robustness of a Predictive Model – Systematic Performance Bias
code example.
In this tutorial we saw how to use data grouping to evaluate important accountability aspects of data sets and predictive models. This tutorial concludes the series of tutorials focused around data grouping. In the next one we move on to predictive models (Explaining a Machine Learning Model: ICE and PD) and predictions (Explaining Machine Learning Predictions: LIME and Counterfactuals) transparency. For data sets transparency please refer to the last section of the Exploring the Grouping Concept – Defining Sub-Populations tutorial.
Relevant FAT Forensics Examples¶
The following examples provide more structured and code-focused use-cases of a group-based data and models inspection to evaluate their accountability: