etl_data_reader.py¶
-
class
etl_data_reader.
ETLDataReader
(path: str)¶ A class which contains helper functions to load, process and filter the data from the ETL data set.
-
dataset_types
¶ A dict which maps the data set parts to their type.
- Type
dict
-
path
¶ The path to the folder with the data set (should also contain ‘euc_c059.dat’).
- Type
str
-
data_set_parts_with_dummy [ETLDataNames]
A list of the data set parts which have a dummy entry at the beginning.
-
__read_dataset_part_parallel
(data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]¶ Read, process and filter one part (ex.: ETL1) of the ETL data set in parallel.
This method is the actual parallel implementation of the ‘read_dataset_part’ method. It is run in ‘processes’ many subprocesses.
Note
The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_part’ method.
Warning
Will throw an error if not all parts of the data set can be found in ‘self.pathdata_set’. Also if the images do not get resized to the same size.
- Parameters
data_set – The data set part which should be loaded.
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the form
-
__read_dataset_part_sequential
(data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]¶ Read, process and filter one part (ex.: ETL1) of the ETL data set sequentially.
This method is the actual sequential implementation of the ‘read_dataset_part’ method. It is run completely in the main process.
Note
The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_part’ method.
Warning
Will throw an error if not all parts of the data set can be found in ‘self.pathdata_set’. Also if the images do not get resized to the same size.
- Parameters
data_set – The data set part which should be loaded.
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the form
-
__read_dataset_whole_parallel
(include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]¶ Read, process and filter the whole ETL data set (ETL1 - ETL9G) in multiple processes.
This method is the actual parallel implementation of the ‘read_dataset_whole’ method. It is run in ‘processes’ many subprocesses.
Note
The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_whole’ method.
Caution
Reading the whole dataset with all available entries will use up a lot of memory.
Warning
Will throw an error if not all parts and files of the data set can be found in ‘self.path’. Also if the images do not get resized to the same size.
- Parameters
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the form
-
__read_dataset_whole_sequential
(include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]¶ Read, process and filter the whole ETL data set (ETL1 - ETL9G) sequentially.
This method is the actual parallel implementation of the ‘read_dataset_part’ method. It is run in completely in the main process.
Note
The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_whole’ method.
Caution
Reading the whole dataset with all available entries will use up a lot of memory.
Warning
Will throw an error if not all parts and files of the data set can be found in ‘self.path’. Also if the images do not get resized to the same size.
- Parameters
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the form
-
init_dataset_types
()¶ Initialize the dictionary of dataset_types and their codes
-
process_image
(imageF: PIL.Image.Image, img_size: Tuple[int, int], img_depth: int) → numpy.array¶ Processes the given ETL-image.
The image will be resized to ‘img_size’ and the color channel depth will be normalized to its ‘img_depth’.
- Parameters
imageF – The image which should be processed.
img_size – The size which the image should be resized to (no resizing if any component < 1).
img_depth – The gray scale depth of the image (no normalization when set to < 1).
- Returns
The processed image.
-
read_dataset_file
(part: int, data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], resize: Tuple[int, int] = (64, 64), normalize: bool = True, show_loading_bar: bool = True) → Tuple[numpy.array, numpy.array]¶ Reads, process and filters all entries from the ETL data set file.
Note
The loaded images will be a numpy array with dtype=float16.
- Parameters
part – The part which should be loaded from the given data set part (only the number).
data_set – The data set part which should be loaded (ex.: ‘ETL1’).
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the given file in the form
-
read_dataset_part
(data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True, save_to: str = '') → Tuple[numpy.array, numpy.array]¶ Read, process and filter one part (ex.: ETL1) of the ETL data set.
Note
The loaded images will be a numpy array with dtype=float16.
Warning
Will throw an error if not all parts of the data set can be found in ‘self.pathdata_set’. Also if the images do not get resized to the same size. Throws an FileNotFoundError if the path to save the images to is not valid.
- Parameters
data_set – The data set part which should be loaded.
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
save_to – If set to a path to a directory all images will be saved there as a jpg image.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the form
-
read_dataset_whole
(include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True, save_to: str = '') → Tuple[numpy.array, numpy.array]¶ Read, process and filter the whole ETL data set (ETL1 - ETL9G).
Note
The loaded images will be a numpy array with dtype=float16.
Caution
Reading the whole dataset with all available entries will use up a lot of memory.
Warning
Will throw an error if not all parts and files of the data set can be found in ‘self.path’. Also if the images do not get resized to the same size. Throws an FileNotFoundError if the path to save the images to is not valid.
- Parameters
include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.
processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.
resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).
normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.
save_to – If set to a path to a directory all images will be saved there as a jpg image.
- Returns
(images, labels).
- Return type
The loaded and filtered data set entries in the form
-
save_to_file
(x: numpy.ndarray, y: numpy.ndarray, save_to: str, name: int = 1)¶ Saves all images and labels to file.
Creates a folder structure in which all images for one label are stored in a folder. The names of these folders are the labels encoded as an int. Additionally a file “encoding.txt” is saved. This file contains a string representaiton of a dict to convert from the int encoding to the matching string label. It can be restored with loading the string from disk and than calling eval() or ast.literal_eval() on this string.
Warning
Throws an FileNotFoundError if the path to save the images to is not valid.
- Parameters
x – a numpy array containing all images.
y – a numpy array containing all labels.
save_to – the path to the folder where the image and labels should be saved
name – an integer from which the names should start (Defaults to 1).
-
select_entries
(label: str, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>]) → bool¶ Checks if the given entry given by ‘label’ should be included in the loaded data set.
- Parameters
label – The label which should be checked if it should be included.
include – All character types which should be included.
- Returns
True if the entry should be included, False otherwise.
- Return type
bool
-