fatf.utils.data.datasets
.load_data¶
-
fatf.utils.data.datasets.
load_data
(file_path: str, dtype: Union[None, type, numpy.dtype, str, List[Tuple[str, str]], List[Tuple[str, numpy.dtype]]] = None, feature_names: List[str] = None) → Dict[str, numpy.ndarray][source]¶ Loads a dataset from a file.
The dataset file must be formatted in the comma separated value (csv) standard with
,
used as the delimiter. The first row of the file must be a header formatted as follows:n_samples,n_features,class_name_1,class_name_2,...
, for example150,5,red,green,blue,black
indicates that there are 150 data points, with 5 features and 4 possible classes: red, green, blue and black. The classes should be given in an order that matches the lexicographical ordering of the unique class values. For example, given that the class values in the data are: 3, 2, 4 and 1 the assignment would be: 1–red, 2–green, 3–blue and 4–black. The rest of the csv file will be treated as a data array, with the last column being treated as the target (class) variable. The type of each column will be inferred if thedtype
parameter is set toNone
, otherwise the array will be cased into the provided dtype. In case the columns in the data are of different types or the user-provided dtype defines the columns to be of multiple types a structured numpy array is used to represent the data.- Parameters
- file_pathstring
Path to the csv data file.
- dtypeUnion[type, numpy.dtype, string, List[Tuple[string, string]], List[Tuple[string, type]], List[Tuple[string, numpy.dtype]]], optional (default=None)
dtypes used to read the csv data. Defaults to None in which case the types will be inferred. The user can provide either a single type for the whole array (as a built-in Python type, numpy’s dtype or a string representation of a numpy’s dtype) or a list of tuples representing the name (string) and type (see above) of every column in the data array. In the latter case they user may choose to provide the list of types for the whole dataset, including the target column, or just the columns representing features.
- feature_namesList[string]
List of strings representing the feature names. Defaults to None in which case features are given default names (‘feature_0’, etc.) or if a structured
dtype
parameter is provided the names given in thedtype
parameter are used.
- Returns
- dataDict[string, numpy.ndarray]
A dictionary representation of the dataset storing all the relevant information under the following keys: ‘data’, ‘target’, ‘target_names’, ‘feature_names’.
- Raises
- TypeError
If provided, one of the feature names in the
feature_names
parameter is not a string; thefeature_names
parameter is neither of the allowed types (None or a list); the first element of one of thedtype
tuples is not a string or thedtype
parameter is neither of the allowed types (None, a list of tuples, a built-in Python type, numpy’s dtype or a string representation of a numpy’s dtype).- ValueError
The number of feature names is inconsistent with the data header, the feature names are provided both in the
feature_names
anddtype
parameters, a tuple in the list of complexdtype
s is malformatted, or the number of type definitions in thedtype
parameter is inconsistent with the number of features in the dataset.