etl_data_reader.py

class etl_data_reader.ETLDataReader(path: str)

A class which contains helper functions to load, process and filter the data from the ETL data set.

codes

ETLCodes instance for decoding the ETL data set labels.

Type

ETLCodes

dataset_types

A dict which maps the data set parts to their type.

Type

dict

path

The path to the folder with the data set (should also contain ‘euc_c059.dat’).

Type

str

data_set_parts_with_dummy [ETLDataNames]

A list of the data set parts which have a dummy entry at the beginning.

__read_dataset_part_parallel(data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]

Read, process and filter one part (ex.: ETL1) of the ETL data set in parallel.

This method is the actual parallel implementation of the ‘read_dataset_part’ method. It is run in ‘processes’ many subprocesses.

Note

The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_part’ method.

Warning

Will throw an error if not all parts of the data set can be found in ‘self.pathdata_set’. Also if the images do not get resized to the same size.

Parameters
  • data_set – The data set part which should be loaded.

  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the form

__read_dataset_part_sequential(data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]

Read, process and filter one part (ex.: ETL1) of the ETL data set sequentially.

This method is the actual sequential implementation of the ‘read_dataset_part’ method. It is run completely in the main process.

Note

The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_part’ method.

Warning

Will throw an error if not all parts of the data set can be found in ‘self.pathdata_set’. Also if the images do not get resized to the same size.

Parameters
  • data_set – The data set part which should be loaded.

  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the form

__read_dataset_whole_parallel(include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]

Read, process and filter the whole ETL data set (ETL1 - ETL9G) in multiple processes.

This method is the actual parallel implementation of the ‘read_dataset_whole’ method. It is run in ‘processes’ many subprocesses.

Note

The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_whole’ method.

Caution

Reading the whole dataset with all available entries will use up a lot of memory.

Warning

Will throw an error if not all parts and files of the data set can be found in ‘self.path’. Also if the images do not get resized to the same size.

Parameters
  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the form

__read_dataset_whole_sequential(include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], resize: Tuple[int, int] = (64, 64), normalize: bool = True) → Tuple[numpy.array, numpy.array]

Read, process and filter the whole ETL data set (ETL1 - ETL9G) sequentially.

This method is the actual parallel implementation of the ‘read_dataset_part’ method. It is run in completely in the main process.

Note

The loaded images will be a numpy array with dtype=float16. This method should only be called through the ‘read_dataset_whole’ method.

Caution

Reading the whole dataset with all available entries will use up a lot of memory.

Warning

Will throw an error if not all parts and files of the data set can be found in ‘self.path’. Also if the images do not get resized to the same size.

Parameters
  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the form

init_dataset_types()

Initialize the dictionary of dataset_types and their codes

process_image(imageF: PIL.Image.Image, img_size: Tuple[int, int], img_depth: int) → numpy.array

Processes the given ETL-image.

The image will be resized to ‘img_size’ and the color channel depth will be normalized to its ‘img_depth’.

Parameters
  • imageF – The image which should be processed.

  • img_size – The size which the image should be resized to (no resizing if any component < 1).

  • img_depth – The gray scale depth of the image (no normalization when set to < 1).

Returns

The processed image.

read_dataset_file(part: int, data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], resize: Tuple[int, int] = (64, 64), normalize: bool = True, show_loading_bar: bool = True) → Tuple[numpy.array, numpy.array]

Reads, process and filters all entries from the ETL data set file.

Note

The loaded images will be a numpy array with dtype=float16.

Parameters
  • part – The part which should be loaded from the given data set part (only the number).

  • data_set – The data set part which should be loaded (ex.: ‘ETL1’).

  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the given file in the form

read_dataset_part(data_set: etldr.etl_data_names.ETLDataNames, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True, save_to: str = '') → Tuple[numpy.array, numpy.array]

Read, process and filter one part (ex.: ETL1) of the ETL data set.

Note

The loaded images will be a numpy array with dtype=float16.

Warning

Will throw an error if not all parts of the data set can be found in ‘self.pathdata_set’. Also if the images do not get resized to the same size. Throws an FileNotFoundError if the path to save the images to is not valid.

Parameters
  • data_set – The data set part which should be loaded.

  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

  • save_to – If set to a path to a directory all images will be saved there as a jpg image.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the form

read_dataset_whole(include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>], processes: int = 1, resize: Tuple[int, int] = (64, 64), normalize: bool = True, save_to: str = '') → Tuple[numpy.array, numpy.array]

Read, process and filter the whole ETL data set (ETL1 - ETL9G).

Note

The loaded images will be a numpy array with dtype=float16.

Caution

Reading the whole dataset with all available entries will use up a lot of memory.

Warning

Will throw an error if not all parts and files of the data set can be found in ‘self.path’. Also if the images do not get resized to the same size. Throws an FileNotFoundError if the path to save the images to is not valid.

Parameters
  • include – All character types (Kanji, Hiragana, Symbols, stc.) which should be included.

  • processes – The number of processes which should be used for loading the data. Every process will run on a separate CPU core. Therefore it is recommended to not use more than (virtual) processor cores are available.

  • resize – The size the image should be resized (if resize < 1 the images will not be resized). Defaults to (64, 64).

  • normalize – Should the gray values be normalized between [0.0, 1.0]. Defaults to True.

  • save_to – If set to a path to a directory all images will be saved there as a jpg image.

Returns

(images, labels).

Return type

The loaded and filtered data set entries in the form

save_to_file(x: numpy.ndarray, y: numpy.ndarray, save_to: str, name: int = 1)

Saves all images and labels to file.

Creates a folder structure in which all images for one label are stored in a folder. The names of these folders are the labels encoded as an int. Additionally a file “encoding.txt” is saved. This file contains a string representaiton of a dict to convert from the int encoding to the matching string label. It can be restored with loading the string from disk and than calling eval() or ast.literal_eval() on this string.

Warning

Throws an FileNotFoundError if the path to save the images to is not valid.

Parameters
  • x – a numpy array containing all images.

  • y – a numpy array containing all labels.

  • save_to – the path to the folder where the image and labels should be saved

  • name – an integer from which the names should start (Defaults to 1).

select_entries(label: str, include: List[etldr.etl_character_groups.ETLCharacterGroups] = [<ETLCharacterGroups.all: '.*'>]) → bool

Checks if the given entry given by ‘label’ should be included in the loaded data set.

Parameters
  • label – The label which should be checked if it should be included.

  • include – All character types which should be included.

Returns

True if the entry should be included, False otherwise.

Return type

bool