.. title:: Using Grouping to Evaluate Robustness of Data and Models .. _tutorials_grouping_robustness: Using Grouping to Evaluate Robustness of Data and Models ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ .. topic:: Tutorial Contents In this tutorial, we show how data grouping can be used to evaluate bias -- from the accountability perspective -- of a data set (*sampling bias*) and a predictive model (*systematic performance bias*). The former can help us to determine whether defined sub-populations are well represented in a data set -- similar to the :ref:`data fairness ` consideration in the :ref:`previous tutorial `. The latter, can help us with identifying sub-populations in a data set for which a predictive model under-performs -- similar to the :ref:`model fairness ` discussion in the :ref:`previous tutorial `. First, we need to load numpy:: >>> import numpy as np Now, let us load and prepare the Iris data set:: >>> import fatf.utils.data.datasets as fatf_datasets >>> iris_data_dict = fatf_datasets.load_iris() >>> iris_data = iris_data_dict['data'] >>> iris_target = iris_data_dict['target'].astype(int) .. note:: For more information about the Iris data set and its structure, please refer the the :ref:`tutorials_grouping` tutorial or the `data set description`_ on the `UCI repository`_ website. Grouping the Data Set ===================== For the purpose of this tutorial we will group the data set based on the third feature of the data set:: >>> iris_feature_names = iris_data_dict['feature_names'] >>> selected_feature_index = 2 >>> iris_feature_names[selected_feature_index] 'petal length (cm)' Now, let us assume that for some, unknown, reason there are two important split values on this feature: ``2.5`` and ``4.75``:: >>> import fatf.utils.data.tools as fatf_data_tools >>> selected_feature_groups = [2.5, 4.75] >>> selected_feature_grouping = fatf_data_tools.group_by_column( ... iris_data, ... selected_feature_index, ... groupings=selected_feature_groups) >>> selected_feature_grouping[1] ['x <= 2.5', '2.5 < x <= 4.75', '4.75 < x'] Sampling Bias ============= Given these two important splits we can now inspect these groupings and see whether we have a comparable number of data points in each of them:: >>> len(selected_feature_grouping[0][0]) 50 >>> len(selected_feature_grouping[0][1]) 45 >>> len(selected_feature_grouping[0][2]) 55 The number of data points for all the sub-populations seems to be (roughly) equally distributed. The only pair of sub-populations which may indicate a *sampling bias* is the second and the third one: ``2.5 < petal length (cm) <= 4.75`` and ``4.75 < petal length (cm)``. For completeness, let us the :func:`fatf.accountability.data.measures.sampling_bias_grid_check` function:: >>> import fatf.accountability.data.measures as fatf_accountability_data >>> counts_per_grouping = [len(i) for i in selected_feature_grouping[0]] >>> fatf_accountability_data.sampling_bias_grid_check(counts_per_grouping) array([[False, False, False], [False, False, True], [False, True, False]]) As expected, the only pair of sub-populations violating *sampling bias* criterion with the default threshold of ``0.8`` are sub-populations with indices 1 and 2 making them: ``2.5 < petal length (cm) <= 4.75`` and ``4.75 < petal length (cm)`` .. note:: Please note that the same result can be achieved without doing the data grouping manually. To this end, you may use the :func:`fatf.accountability.data.measures.sampling_bias` function, wchich internaly groups the data based on the specified feature index. The :ref:`sphx_glr_sphinx_gallery_auto_accountability_xmpl_accountability_data_measure.py` code example shows how to use it. Systematic Performance Bias =========================== Before we can evaluate robustness of a model, we first need one trained on the Iris data set:: >>> import fatf.utils.models as fatf_models >>> clf = fatf_models.KNN() >>> clf.fit(iris_data, iris_target) We also need predictions of this model on a data set that we will use to evaluate its robustness; in this case we will use the training data:: >>> iris_pred = clf.predict(iris_data) Before we can compute any performance metric, let us get confusion matrices for each sub-population:: >>> import fatf.utils.metrics.tools as fatf_metrics_tools >>> grouping_cm = fatf_metrics_tools.confusion_matrix_per_subgroup_indexed( ... selected_feature_grouping[0], ... iris_target, ... iris_pred, ... labels=np.unique(iris_target).tolist()) .. note:: UserWarning The above function call will generate 2 warnings:: UserWarning: Some of the given labels are not present in either of the input arrays: {1, 2}. UserWarning: Some of the given labels are not present in either of the input arrays: {0}. These are because for some of the sub-populations the ground truth (target) and the prediction vectors may only hold a single label, therefore the confusion matrix calculator is not aware of the rest and has to resort to using the labels specified in the ``labels`` parameter. Printing the unique target and prediction values of the first sub-population shows exactly this scenario happening:: >>> np.unique(iris_target[selected_feature_grouping[0][0]]) array([0]) >>> np.unique(iris_pred[selected_feature_grouping[0][0]]) array([0]) This happens as the selected feature -- petal length (cm) -- is a very good predictor of the first class. For more details you may want to have a look at the :ref:`data transparency section ` of the :ref:`grouping tutorial ` where this feature is explained in relation to the ground truth using the data descrition funcitonality of this package. With confusion matrices for every grouping we can generate any performance metric. For the purposes of this tutorial let us look at *accuracy*:: >>> import fatf.utils.metrics.metrics as fatf_metrics >>> group_0_acc = fatf_metrics.accuracy(grouping_cm[0]) >>> group_0_acc 1.0 >>> group_1_acc = fatf_metrics.accuracy(grouping_cm[1]) >>> group_1_acc 0.9777777777777777 >>> group_2_acc = fatf_metrics.accuracy(grouping_cm[2]) >>> group_2_acc 0.9090909090909091 The accuracy seems to be comparable across sub-populations. Clearly none of the sub-populations defined on the petal length feature suffers from a performance bias as measured by accuracy. For completeness, let us test for the systematic performance bias with the :func:`fatf.accountability.models.measures.systematic_performance_bias_grid` function:: >>> import fatf.accountability.models.measures as fatf_accountability_models >>> fatf_accountability_models.systematic_performance_bias_grid( ... [group_0_acc, group_1_acc, group_2_acc]) array([[False, False, False], [False, False, False], [False, False, False]]) As expected, there is no systematic performance bias for these sub-populations given the predictive model at hand. .. note:: In this part of the tutorial we used the :func:`fatf.utils.metrics.tools.confusion_matrix_per_subgroup_indexed` function to get a confusion matrix for each of the sub-populations and used these to compute the corresponding accuracies. All of these steps are combined by the :func:`fatf.utils.metrics.subgroup_metrics.performance_per_subgroup` function, therefore making the task of evaluating systematic performance bias easier. An example of how to use this function can be found in :ref:`sphx_glr_sphinx_gallery_auto_accountability_xmpl_accountability_models_measure.py` code example. ---- In this tutorial we saw how to use data grouping to evaluate important accountability aspects of data sets and predictive models. This tutorial concludes the series of tutorials focused around data grouping. In the next one we move on to predictive models (:ref:`tutorials_model_explainability`) and predictions (:ref:`tutorials_prediction_explainability`) transparency. For data sets transparency please refer to the **last section** of the :ref:`tutorials_grouping` tutorial. Relevant FAT Forensics Examples =============================== The following examples provide more structured and code-focused use-cases of a group-based data and models inspection to evaluate their accountability: * :ref:`sphx_glr_sphinx_gallery_auto_accountability_xmpl_accountability_data_measure.py`, * :ref:`sphx_glr_sphinx_gallery_auto_accountability_xmpl_accountability_models_measure.py`. .. _`data set description`: https://archive.ics.uci.edu/ml/datasets/iris .. _`UCI repository`: https://archive.ics.uci.edu/ml/index.php