fatf.utils.data.datasets.load_data

fatf.utils.data.datasets.load_data(file_path, dtype=None, feature_names=None)[source]

Loads a dataset from a file.

The dataset file must be formatted in the comma separated value (csv) standard with , used as the delimiter. The first row of the file must be a header formatted as follows: n_samples,n_features,class_name_1,class_name_2,..., for example 150,5,red,green,blue,black indicates that there are 150 data points, with 5 features and 4 possible classes: red, green, blue and black. The classes should be given in an order that matches the lexicographical ordering of the unique class values. For example, given that the class values in the data are: 3, 2, 4 and 1 the assignment would be: 1–red, 2–green, 3–blue and 4–black. The rest of the csv file will be treated as a data array, with the last column being treated as the target (class) variable. The type of each column will be inferred if the dtype parameter is set to None, otherwise the array will be cased into the provided dtype. In case the columns in the data are of different types or the user-provided dtype defines the columns to be of multiple types a structured numpy array is used to represent the data.

Parameters
file_pathstring

Path to the csv data file.

dtypeUnion[type, numpy.dtype, string, List[Tuple[string, string]], List[Tuple[string, type]], List[Tuple[string, numpy.dtype]]], optional (default=None)

dtypes used to read the csv data. Defaults to None in which case the types will be inferred. The user can provide either a single type for the whole array (as a built-in Python type, numpy’s dtype or a string representation of a numpy’s dtype) or a list of tuples representing the name (string) and type (see above) of every column in the data array. In the latter case they user may choose to provide the list of types for the whole dataset, including the target column, or just the columns representing features.

feature_namesList[string]

List of strings representing the feature names. Defaults to None in which case features are given default names (‘feature_0’, etc.) or if a structured dtype parameter is provided the names given in the dtype parameter are used.

Returns
dataDict[string, numpy.ndarray]

A dictionary representation of the dataset storing all the relevant information under the following keys: ‘data’, ‘target’, ‘target_names’, ‘feature_names’.

Raises
TypeError

If provided, one of the feature names in the feature_names parameter is not a string; the feature_names parameter is neither of the allowed types (None or a list); the first element of one of the dtype tuples is not a string or the dtype parameter is neither of the allowed types (None, a list of tuples, a built-in Python type, numpy’s dtype or a string representation of a numpy’s dtype).

ValueError

The number of feature names is inconsistent with the data header, the feature names are provided both in the feature_names and dtype parameters, a tuple in the list of complex dtypes is malformatted, or the number of type definitions in the dtype parameter is inconsistent with the number of features in the dataset.