.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code or run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_sphinx_gallery_auto_accountability_xmpl_accountability_predictions_density.py: ==================================================== Measuring Robustness of a Prediction -- Data Density ==================================================== This example illustrates how to use a data density estimation to "measure" to what extent a prediction should be trusted. .. code-block:: default # Author: Kacper Sokol # License: new BSD import matplotlib.pyplot as plt import numpy as np import fatf.utils.data.datasets as fatf_datasets import fatf.utils.models as fatf_models import fatf.utils.data.density as fatf_density print(__doc__) # Prepare a colourmap cmap = np.array([ plt.get_cmap('Pastel1').colors[0], plt.get_cmap('Pastel1').colors[1], plt.get_cmap('Pastel1').colors[2] ]) # Load data iris_data_dict = fatf_datasets.load_iris() iris_X = iris_data_dict['data'] iris_y = iris_data_dict['target'].astype(int) iris_feature_names = iris_data_dict['feature_names'] iris_class_names = iris_data_dict['target_names'] First, we visualise the data set based on two (arbitrarily chosen) features to help us guide choosing a data point in a sparse region and another one in a dense region. .. code-block:: default # Choose 2 features chosen_features_index = [1, 2] chosen_features_name = iris_feature_names[chosen_features_index] # Get a subset of the data set with these two features iris_X_chosen_features = iris_X[:, chosen_features_index] # Plot the dataset using the 2 selected features plt.scatter( iris_X_chosen_features[:, 0], iris_X_chosen_features[:, 1], c=cmap[iris_y]) plt.title('Iris data set scatter plot') plt.xlabel(chosen_features_name[0]) plt.ylabel(chosen_features_name[1]) plt.show() .. image:: /sphinx_gallery_auto/accountability/images/sphx_glr_xmpl_accountability_predictions_density_001.png :class: sphx-glr-single-img Now, we train a predictive model on the whole data set using these two features. We will use it to evaluate the quality and robustness of the predictions it provides. .. code-block:: default # Train a model clf = fatf_models.KNN() clf.fit(iris_X_chosen_features, iris_y) # Create two data points: one in a dense region and one in a sparse region point_dense = np.array([[3.0, 5.0]]) point_sparse = np.array([[3.5, 3.25]]) # Plot the dense and sparse data points plt.scatter( iris_X_chosen_features[:, 0], iris_X_chosen_features[:, 1], c=cmap[iris_y]) plt.scatter( point_dense[:, 0], point_dense[:, 1], c='r', s=200, marker='X', label='dense point') plt.scatter( point_sparse[:, 0], point_sparse[:, 1], c='g', s=200, marker='X', label='sparse point') plt.title('Dense and sparse data points visualisation') plt.xlabel(chosen_features_name[0]) plt.ylabel(chosen_features_name[1]) plt.legend(loc='upper right') plt.show() .. image:: /sphinx_gallery_auto/accountability/images/sphx_glr_xmpl_accountability_predictions_density_002.png :class: sphx-glr-single-img Now, let us get the predictions for both these data points. .. code-block:: default # Predict the two data points point_dense_prediction = clf.predict(point_dense)[0] point_sparse_prediction = clf.predict(point_sparse)[0] # Print out the predictions... print('The **dense** data point is of class: {} ({}).'.format( iris_class_names[point_dense_prediction], point_dense_prediction)) print('The **sparse** data point is of class: {} ({}).'.format( iris_class_names[point_sparse_prediction], point_sparse_prediction)) # ...and plot them with the class assignment plt.scatter( iris_X_chosen_features[:, 0], iris_X_chosen_features[:, 1], c=cmap[iris_y]) plt.scatter( point_dense[:, 0], point_dense[:, 1], c=cmap[[int(point_dense_prediction)]], s=200, marker='X', label='dense point -- {}'.format(iris_class_names[point_dense_prediction])) plt.scatter( point_sparse[:, 0], point_sparse[:, 1], c=cmap[[int(point_sparse_prediction)]], s=200, marker='X', label='sparse point -- {}'.format( iris_class_names[point_sparse_prediction])) plt.title('Dense and sparse data points visualisation') plt.xlabel(chosen_features_name[0]) plt.ylabel(chosen_features_name[1]) plt.legend(loc='upper right') plt.show() .. image:: /sphinx_gallery_auto/accountability/images/sphx_glr_xmpl_accountability_predictions_density_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none The **dense** data point is of class: virginica (2). The **sparse** data point is of class: versicolor (1). Now that we know how these two data points are distributed and what class is assigned to them by the classifier we can see how a **Data Density** score can help us. .. code-block:: default # Print the class names for the probability vector print('(The probability vector is based on the following class ordering: ' '{}.)\n'.format(iris_class_names)) # Initialise a density estimator iris_density = fatf_density.DensityCheck(iris_X_chosen_features) # Get a denisty estimation for the dense point density_score_dense = iris_density.score_data_point(point_dense[0, :]) # Get the probability predictions for the dense point probabilities_dense = clf.predict_proba(point_dense)[0] # Get a denisty estimation for the sparse point density_score_sparse = iris_density.score_data_point(point_sparse[0, :]) # Get the probability predictions for the sparse point probabilities_sparse = clf.predict_proba(point_sparse)[0] print('The prediction probabilities for the sparse data point are: {}, with a ' 'density score of {}.'.format(probabilities_sparse, density_score_sparse)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (The probability vector is based on the following class ordering: ['setosa' 'versicolor' 'virginica'].) The prediction probabilities for the sparse data point are: [0. 1. 0.], with a density score of 0.9481604611988929. The above result clearly shows that despite the high probability -- 1 -- of the *versicolor* class for the **sparse** data point, the prediction should not be trusted. The high density score for that point -- 0.95 -- indicates that this data point lies in a region of low density, therefore the classifier has not seen many data points in this region, hence the model is simply overconfident. .. code-block:: default print('The prediction probabilities for the dense data point are: {}, with a ' 'density score of {}.'.format(probabilities_dense, density_score_dense)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none The prediction probabilities for the dense data point are: [0. 0.33333333 0.66666667], with a density score of 0.10186747546117353. On the other hand, the results for the **dense** data point, which the classifier indicates to be of class *virginica* with 0.67 probability, should be trusted given a low density score of 0.1. This data point clearly lies in a dense region, therefore the classifier has seen plenty of other data points in that region during the trainig procedure. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.993 seconds) .. _sphx_glr_download_sphinx_gallery_auto_accountability_xmpl_accountability_predictions_density.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: binder-badge .. image:: https://mybinder.org/badge_logo.svg :target: https://mybinder.org/v2/gh/fat-forensics/fat-forensics-doc/master?filepath=notebooks/sphinx_gallery_auto/accountability/xmpl_accountability_predictions_density.ipynb :width: 150 px .. container:: sphx-glr-download :download:`Download Python source code: xmpl_accountability_predictions_density.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: xmpl_accountability_predictions_density.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_