.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_sphinx_gallery_auto_fairness_xmpl_fairness_data_measure.py: ================================ Measuring Fairness of a Data Set ================================ This example illustrates how to find unfair rows in a data set using the :func:`fatf.fairness.data.measures.systemic_bias` function and how to check whether each class is distributed equally between values of a selected feature, i.e. measuring the Sample Size Disparity (with :func:`fatf.utils.data.tools.group_by_column` function). .. note:: Please note that this example uses a data set that is represented as a structured numpy array, which supports mixed data types among columns with the features (columns) being index by the feature name rather than by consecutive integers. .. code-block:: default # Author: Kacper Sokol # License: new BSD from pprint import pprint import numpy as np import fatf.utils.data.datasets as fatf_datasets import fatf.fairness.data.measures as fatf_dfm import fatf.utils.data.tools as fatf_data_tools print(__doc__) # Load data hr_data_dict = fatf_datasets.load_health_records() hr_X = hr_data_dict['data'] hr_y = hr_data_dict['target'] hr_feature_names = hr_data_dict['feature_names'] hr_class_names = hr_data_dict['target_names'] Systemic Bias ------------- Before we proceed, we need to select which feature are **protected**, i.e. which ones are illegal to use when generating the prediction. We use them to see whether the data set contains rows that differ in some of the protected features and the labels (ground truth) but not in the rest of the features. The example presented below is rather naive as we do not have access to a more complicated dataset within the FAT Forensics package. To demonstrate the functionality of the we indicate all but one feature to be protected, hence we are guaranteed to find quite a few unfair rows in the health records data set. This means that "unfair" data rows are the ones that have the same value of the *diagnosis* feature (with rest of the feature values being unimportant) and differ in their target (ground truth) value. Systematic bias is expressed here as a square matrix (numpy array) of length equal to the number of rows in the data array. Each element of this matrix is a boolean indicating whether the rows in the data array with a particular pair of indices (the row and column indices of the boolean matrix) violate the aforementioned fairness criterion. .. code-block:: default # Select which features should be treated as protected protected_features = [ 'name', 'email', 'age', 'weight', 'gender', 'zipcode', 'dob' ] # Compute the data fairness matrix data_fairness_matrix = fatf_dfm.systemic_bias(hr_X, hr_y, protected_features) # Check if the data set is unfair (at least one unfair pair of data points) is_data_unfair = fatf_dfm.systemic_bias_check(data_fairness_matrix) # Identify which pairs of indices cause the unfairness unfair_pairs_tuple = np.where(data_fairness_matrix) unfair_pairs = [] for i, j in zip(*unfair_pairs_tuple): pair_a, pair_b = (i, j), (j, i) if pair_a not in unfair_pairs and pair_b not in unfair_pairs: unfair_pairs.append(pair_a) # Print out whether the fairness condition is violated if is_data_unfair: unfair_n = len(unfair_pairs) unfair_fill = ('is', '') if unfair_n == 1 else ('are', 's') print('\nThere {} {} pair{} of data points that violates the fairness ' 'criterion.\n'.format(unfair_fill[0], unfair_n, unfair_fill[1])) else: print('The data set is fair.\n') # Show the first pair of violating rows pprint(hr_X[[unfair_pairs[0][0], unfair_pairs[0][1]]]) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none There are 26 pairs of data points that violates the fairness criterion. array([('Heidi Mitchell', 'uboyd@hotmail.com', 74, 52, 'female', '1121', 'cancer', '03/06/2018'), ('Kimberly Kent', 'wilsoncarla@mitchell-gree', 63, 51, 'male', '2003', 'cancer', '16/06/2017')], dtype=[('name', '` .. container:: sphx-glr-download :download:`Download Jupyter notebook: xmpl_fairness_data_measure.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_