fatf.utils.data.tools.group_by_column

fatf.utils.data.tools.group_by_column(dataset: numpy.ndarray, column_index: Union[int, str], groupings: Optional[List[Union[float, Tuple[str]]]] = None, numerical_bins_number: int = 5, treat_as_categorical: Optional[bool] = None) → Tuple[List[List[int]], List[str]][source]

Groups row indices of an array based on value grouping of a chosen column.

If selected column is numerical, by default the values are grouped into 5 bins equally distributed between the minimum and the maximum value of the column. The number of bins can be changed with the numerical_bins_number if desired. Alternatively, the exact bin boundaries can be given via the groupings parameter.

For categorical columns, the default binning is one bin for every unique value in the selected column. This behaviour can be changed by providing the groupings parameter, where multiple values can be selected to create one bin.

Parameters
datasetnumpy.ndarray

A dataset to be used for grouping the row indices.

column_indexUnion[string, integer]

A column index (a string for structured numpy arrays or an integer for unstructured arrays) of the column based on which the row indices will be partitioned.

groupingsList[Union[number, Tuple[string]]], optional (default=None)

A list of user-specified groupings for the selected column. The default grouping for categorical (textual) columns is splitting them by all the unique values therein. The numerical columns are, by default, binned into 5 bins (see the numerical_bins_number parameter) uniformly distributed between the minimum and the maximum value of the column. To introduce custom binning for a categorical column groupings parameter should be a list of tuples, where every tuple represents a single group. For example, a column with the following unique values ['a', 'b', 'c', 'd'] can be split into two groups: ['a', 'd'] and ['b', 'c'] by providing [('a', 'd'), ('b', 'c')] grouping. For numerical columns custom grouping should be introduced as a list of bucket boundaries. Every bucket includes all the values that are less or equal to the specified bucket boundary and greater than the previous boundary if one is given.

numerical_bins_numberinteger, optional (default=5)

The number of bins used for default binning of numerical columns.

treat_as_categoricalboolean, optional (default=None)

Whether the selected column should be treated as a categorical or numerical feature. If set to None, the type of the column will be inferred from the data therein. If set to False, the column will be treated as numerical unless it is string-based in which case a warning will be emitted and the column will be treated as numerical despite this setting. Finally, if set to True, the column will be treated as categorical.

Returns
indices_per_binList[List[integer]]

A list of lists with the latter one holding row indices of a particular group.

bin_namesList[string]

A list holding a description of each group.

Raises
IncorrectShapeError

The input dataset is not 2-dimensional.

IndexError

The supplied column_index is not valid for the input dataset.

TypeError

The column index is neither a string nor an integer. The numerical bins number is not an integer. The groupings parameter is neither a list not None. One of the grouping bin boundaries (for a numerical feature column) is not a number. One of the groupings (for a categorical feature column) is not a tuple. The treat_as_categorical parameter is neither a boolean nor None.

ValueError

The input dataset is not of a base type. The numerical bins number is less than 2. The groupings list is empty. The numbers in the groupings parameter are not monotonically increasing (for a numerical column). There are duplicate values shared among tuples in the grouping parameter or one of the values does not appear in the selected column (for a categorical column).

Warns
UserWarning

When grouping is done on a categorical column a warning is emitted when some of the values in that column are not accounted for, i.e. they are not included in the groupings parameter. Also, if some of the rows are not included in any of the groupings, a warning is shown. Missing row indices may be a result of some of the values being not-a-number for a numerical column and missing some of the unique values for a categorical column. treat_as_categorical parameter is set to False, however the feature selected is string-based (i.e. categorical), therefore cannot be treated as a numerical one.

Examples using fatf.utils.data.tools.group_by_column