.. title:: Defining Sub-Groups in Your Data .. _tutorials_grouping: Exploring the Grouping Concept -- Defining Sub-Populations ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ .. topic:: Tutorial Contents In this tutorial, we introduce the concept of sub-populations that is very useful in a number of :ref:`Fairness ` and :ref:`Accountability `. The idea is to find indices of data points (rows) or the data points themselves that form separate subpopulations, which may be underrepresented in the training data or disparately treated by a predictive model. As always, we start by importing FAT Forensics inside a Python interpreter:: $ python >>> import fatf We also import numpy and pretty print (``pprint``) as we will need these for the tutorial:: >>> import numpy as np >>> from pprint import pprint There are two different types of features that our data can be grouped on: * categorical -- usually string-based or numerical with a finite number of possible values with groups being defined by *disjoint* sub-sets of that feature space, or * numerical -- integer- or floating point-based with the groups defined by ranges on the feature space. Data Grouping ============= Grouping Based on a Numerical Feature ------------------------------------- The very basic grouping that one may desire is based on the unique values of the *target vector* for a classification task. Such grouping can be useful when inspecting the distribution of certain features between different classes. Before we continue, let us load the iris data set, which we will use for this tutorial:: >>> import fatf.utils.data.datasets as fatf_datasets >>> iris_data_dict = fatf_datasets.load_iris() >>> iris_data = iris_data_dict['data'] The *target array* can be accessed via the ``'target'`` key in the ``iris_data_dict`` dictionary:: >>> iris_target = iris_data_dict['target'].astype(int) >>> iris_target.shape (150,) Since the ``iris_target`` vector is 1-dimensional and the grouping function (:func:`fatf.utils.data.tools.group_by_column`) requires a data set, i.e. a 2-dimensional numpy array, we need to reshape it to ``(150, 1)`` first:: >>> iris_target_2d = iris_target.reshape((150, 1)) >>> iris_target_2d.shape (150, 1) Now, that we have the target array in the right shape, let us see what are all the unique values that it holds:: >>> np.unique(iris_target_2d) array([0, 1, 2]) Since the target array is a numerical array, when we use the :func:`fatf.utils.data.tools.group_by_column` function, it will infer that the type of the column is *numerical* and treat it as such, which will lead to binning the target vector rather than grouping it based on its unique values (the second parameter in the function call below indicates that we want to perform the grouping based on the 1st column, which index is ``0``):: >>> import fatf.utils.data.tools as fatf_data_tools >>> target_grouping_num = fatf_data_tools.group_by_column(iris_target_2d, 0) .. module:: fatf.utils.data.tools :noindex: The :func:`group_by_column` function returns a 2-tuple with the first element being the grouping (row indices) and the second being group names. Let us see what groups we got:: >>> target_grouping_num[1] ['x <= 0.4', '0.4 < x <= 0.8', '0.8 < x <= 1.2000000000000002', '1.2000000000000002 < x <= 1.6', '1.6 < x'] As expected, the grouping is numerical with 5 bind (the default). To get the desired binning -- three groups, one with 0s, one with 1s and the last one with 2s -- we can either modify the bin boundaries or force the :func:`group_by_column` function to treat this numerical vector as a categorical one. Let us start with the custom numerical binning:: >>> target_grouping_num_custom = fatf_data_tools.group_by_column( ... iris_target_2d, 0, groupings=[0.5, 1.5]) >>> target_grouping_num_custom[1] ['x <= 0.5', '0.5 < x <= 1.5', '1.5 < x'] This binning should create 3 bins, each with 50 indices; the first one with indices between 0 and 49, the second one with indices between 50 and 99 and the last one with indices between 100 and 149 as this is how the labels are distributed (they are ordered, to see this you can inspect the ``iris_target`` variable):: >>> len(target_grouping_num_custom[0]) 3 >>> target_grouping_num_custom[0][0] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49] >>> target_grouping_num_custom[0][1] [50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99] >>> target_grouping_num_custom[0][2] [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149] Grouping Based on a Numerical Feature -- Treat as a Categorical --------------------------------------------------------------- Instead of this elaborate workaround we can simply setting the ``treat_as_categorical`` parameter to ``True`` and the numerical vector will be treated as a categorical one, therefore grouping based on the unique values in that vector:: >>> target_grouping_num_cat = fatf_data_tools.group_by_column( ... iris_target_2d, 0, treat_as_categorical=True) >>> target_grouping_num_cat[1] ['(0,)', '(1,)', '(2,)'] Therefore, the first group has indices of the target value equal to 1, etc. Now, let us see whether the grouping is the same as the one that we got in ``target_grouping_num_custom``:: >>> np.array_equal(target_grouping_num_custom[0][0], ... np.sort(target_grouping_num_cat[0][0])) True >>> np.array_equal(target_grouping_num_custom[0][1], ... np.sort(target_grouping_num_cat[0][1])) True >>> np.array_equal(target_grouping_num_custom[0][2], ... np.sort(target_grouping_num_cat[0][2])) True You may wonder why the groups are defined as tuples with just one element, i.e. ``'(0,)'``. This is because the categorical groupings can be customised to include multiple values in a single group. To see how that works, let us group ``0`` and ``1`` together with the other group including just ``2``. To this end, we use the ``groupings`` parameter again but this time in a manner that is recognised by a categorical grouping:: >>> target_grouping_num_cat_custom = fatf_data_tools.group_by_column( ... iris_target_2d, ... 0, ... treat_as_categorical=True, ... groupings=[(0, 1), (2, )]) >>> target_grouping_num_cat_custom[1] ['(0, 1)', '(2,)'] That looks promising, let us see whether the grouping is correct:: >>> np.array_equal(np.sort(target_grouping_num_cat_custom[0][0]), ... list(range(100))) True >>> np.array_equal(np.sort(target_grouping_num_cat_custom[0][1]), ... list(range(100, 150))) True .. _tutorials_grouping_data_transparency: Using Grouping to Inspect a Data Set -- Data Transparency ========================================================= Now that we know how to create groupings of data sets we can use this knowledge to inspect transparency of the Iris data set. First, let us choose one of the features -- ``'petal length (cm)'`` with index ``2`` -- and see how it is distributed for the whole data set:: >>> arbitrary_feature_index = 2 >>> arbitrary_feature_name = \ ... iris_data_dict['feature_names'][arbitrary_feature_index] >>> arbitrary_feature_name 'petal length (cm)' Let us see how this feature is distributed for the whole data set using the :func:`fatf.transparency.data.describe_functions.describe_array` function:: >>> import fatf.transparency.data.describe_functions as fatf_describe_data >>> petal_length_desc = fatf_describe_data.describe_array(iris_data[:, 2]) >>> pprint(petal_length_desc) {'25%': 1.600000023841858, '50%': 4.3500001430511475, '75%': 5.099999904632568, 'count': 150, 'max': 6.9, 'mean': 3.7580001, 'min': 1.0, 'nan_count': 0, 'std': 1.7594041} Therefore, we have 150 data points with values ranging between ``1.0`` and ``6.9``. The mean of this feature for the whole data set is ``3.76`` while the median is ``4.35``. Let us compare it against the distribution of this feature for each of the three classes. To this end, we first need to split the Iris data set using the grouping that we created in the previous section:: >>> petal_length_class_0 = iris_data[target_grouping_num_cat[0][0], ... arbitrary_feature_index] >>> petal_length_class_1 = iris_data[target_grouping_num_cat[0][1], ... arbitrary_feature_index] >>> petal_length_class_2 = iris_data[target_grouping_num_cat[0][2], ... arbitrary_feature_index] Now, we can create a description of the ``'petal length (cm)'`` feature for each class separately:: >>> petal_length_class_0_desc = fatf_describe_data.describe_array( ... petal_length_class_0) >>> print(iris_data_dict['target_names'][0]) setosa >>> pprint(petal_length_class_0_desc) {'25%': 1.399999976158142, '50%': 1.5, '75%': 1.5750000178813934, 'count': 50, 'max': 1.9, 'mean': 1.462, 'min': 1.0, 'nan_count': 0, 'std': 0.1719186} >>> petal_length_class_1_desc = fatf_describe_data.describe_array( ... petal_length_class_1) >>> print(iris_data_dict['target_names'][1]) versicolor >>> pprint(petal_length_class_1_desc) {'25%': 4.0, '50%': 4.3500001430511475, '75%': 4.599999904632568, 'count': 50, 'max': 5.1, 'mean': 4.26, 'min': 3.0, 'nan_count': 0, 'std': 0.46518815} >>> petal_length_class_2_desc = fatf_describe_data.describe_array( ... petal_length_class_2) >>> print(iris_data_dict['target_names'][2]) virginica >>> pprint(petal_length_class_2_desc) {'25%': 5.099999904632568, '50%': 5.549999952316284, '75%': 5.8750001192092896, 'count': 50, 'max': 6.9, 'mean': 5.552, 'min': 4.5, 'nan_count': 0, 'std': 0.54634786} The results displayed above show that the ``'petal length (cm)'`` feature is a good predictor of the iris plant type. The *setosa* type has petals of length between 1 and 1.9 centimeters, whereas the petals of the *versicolor* type are between 3 and 5.1 centimeters and the *virginica* type has petals in the range of 4.5 and 6.9 centimeters. This tells us that petal length can help us to distinguish between the *setosa* type and the other two types, however by just using this feature we cannot tell apart the *versicolor* and *virginica* types. ---- Getting all these insights was possible because of a data grouping based on the target value (iris type). In addition to **data transparency** we can use groupings to inspect *fairness* and *accountability* of data sets and predictive models -- group-based disparity metrics and data sampling issues/\ robustness of a predictive model respectively. The next two tutorials show how to use grouping for *fairness* (:ref:`tutorials_grouping_fairness`) and *accountability* (:ref:`tutorials_grouping_robustness`) of data and models. To learn more about transparency you can have a look at the :ref:`tutorials_model_explainability` or the :ref:`tutorials_prediction_explainability` tutorial or code examples referenced therein. Relevant FAT Forensics Examples =============================== The following examples provide more structured and code-focused use-cases of the data grouping function and the data description functionality: * :ref:`sphx_glr_sphinx_gallery_auto_transparency_xmpl_transparency_data_desc.py`, * :ref:`sphx_glr_sphinx_gallery_auto_fairness_xmpl_fairness_models_measure.py`.