fatf.utils.data.tools
.group_by_column¶
-
fatf.utils.data.tools.
group_by_column
(dataset: numpy.ndarray, column_index: Union[int, str], groupings: Optional[List[Union[float, Tuple[str]]]] = None, numerical_bins_number: int = 5, treat_as_categorical: Optional[bool] = None) → Tuple[List[List[int]], List[str]][source]¶ Groups row indices of an array based on value grouping of a chosen column.
If selected column is numerical, by default the values are grouped into 5 bins equally distributed between the minimum and the maximum value of the column. The number of bins can be changed with the
numerical_bins_number
if desired. Alternatively, the exact bin boundaries can be given via thegroupings
parameter.For categorical columns, the default binning is one bin for every unique value in the selected column. This behaviour can be changed by providing the
groupings
parameter, where multiple values can be selected to create one bin.- Parameters
- datasetnumpy.ndarray
A dataset to be used for grouping the row indices.
- column_indexUnion[string, integer]
A column index (a string for structured numpy arrays or an integer for unstructured arrays) of the column based on which the row indices will be partitioned.
- groupingsList[Union[number, Tuple[string]]], optional (default=None)
A list of user-specified groupings for the selected column. The default grouping for categorical (textual) columns is splitting them by all the unique values therein. The numerical columns are, by default, binned into 5 bins (see the
numerical_bins_number
parameter) uniformly distributed between the minimum and the maximum value of the column. To introduce custom binning for a categorical columngroupings
parameter should be a list of tuples, where every tuple represents a single group. For example, a column with the following unique values['a', 'b', 'c', 'd']
can be split into two groups:['a', 'd']
and['b', 'c']
by providing[('a', 'd'), ('b', 'c')]
grouping. For numerical columns custom grouping should be introduced as a list of bucket boundaries. Every bucket includes all the values that are less or equal to the specified bucket boundary and greater than the previous boundary if one is given.- numerical_bins_numberinteger, optional (default=5)
The number of bins used for default binning of numerical columns.
- treat_as_categoricalboolean, optional (default=None)
Whether the selected column should be treated as a categorical or numerical feature. If set to
None
, the type of the column will be inferred from the data therein. If set toFalse
, the column will be treated as numerical unless it is string-based in which case a warning will be emitted and the column will be treated as numerical despite this setting. Finally, if set toTrue
, the column will be treated as categorical.
- Returns
- indices_per_binList[List[integer]]
A list of lists with the latter one holding row indices of a particular group.
- bin_namesList[string]
A list holding a description of each group.
- Raises
- IncorrectShapeError
The input
dataset
is not 2-dimensional.- IndexError
The supplied
column_index
is not valid for the inputdataset
.- TypeError
The column index is neither a string nor an integer. The numerical bins number is not an integer. The
groupings
parameter is neither a list notNone
. One of the grouping bin boundaries (for a numerical feature column) is not a number. One of the groupings (for a categorical feature column) is not a tuple. Thetreat_as_categorical
parameter is neither a boolean norNone
.- ValueError
The input
dataset
is not of a base type. The numerical bins number is less than 2. Thegroupings
list is empty. The numbers in thegroupings
parameter are not monotonically increasing (for a numerical column). There are duplicate values shared among tuples in thegrouping
parameter or one of the values does not appear in the selected column (for a categorical column).
- Warns
- UserWarning
When grouping is done on a categorical column a warning is emitted when some of the values in that column are not accounted for, i.e. they are not included in the
groupings
parameter. Also, if some of the rows are not included in any of the groupings, a warning is shown. Missing row indices may be a result of some of the values being not-a-number for a numerical column and missing some of the unique values for a categorical column.treat_as_categorical
parameter is set toFalse
, however the feature selected is string-based (i.e. categorical), therefore cannot be treated as a numerical one.