Getting started

ETLCDB_data_reader

A python package for conveniently loading the ETLCDB. The complete documentation including the API can be found here.

Intro

The ETLCDB is a collection of roughly 1.600.000 handwritten characters. Notably it includes Japanese Kanji, Hiragana and Katakana. The data set can be found on the ETLCDB website (a registration is needed to download the data set).
Because the data set is stored in a custom data structure it can be hard to load. This python package provides an easy way to load this data set and filter entries.
An example of using this package can be found in my application: DaKanji. There it was used for training an CNN to recognize hand written Japanese characters, numbers and roman letters.
General information about the data set can be found in the table below.

name

type

content

res

Bit depth

code

samples per label

total samples

ETL1

M-Type

Numbers
Roman
Symbols
Katakana

64x63

4

JIS X 0201

~1400

141319

ETL2

K-Type

Hiragana
Katakana
Kanji
Roman
Symbols

60x60

6

CO59

~24

52796

ETL3

C-Type

Numeric
Capital Roman
Symbols

72x76

4

JIS X 0201

200

9600

ETL4

C-Type

Hiragana

72x76

4

JIS X 0201

120

6120

ETL5

C-Type

Katakana

72x76

4

JIS X 0201

~200

10608

ETL6

M-Type

Katakana
Symbols

64x63

4

JIS X 0201

1383

157662

ETL7

M-Type

Hiragana
Symbols

64x63

4

JIS X 0201

160

16800

ETL8 (8B)

8B-Type

Hiragana
Kanji

64x63

1

JIS X 0208

160

157662

ETL9 (8G)

8G-Type

Hiragana
Kanji

128x127

4

JIS X 0208

200

607200

ETL10 (9B)

9B-Type

Hiragana
Kanji

64x63

1

JIS X 0208

160

152960

ETL11 (9G)

9G-Type

Hiragana
Kanji

128x127

4

JIS X 0208

200

607200

Note:
The ETL6 and ETL7 parts include half width katakana which are stored as roman letters. As an example: “ケ” is stored as “ke”. Those are automatically converted from this package. Also full width numbers and letters are converted when using the package. Example: 0 -> 0 and A -> A

Setup

First download the wheel from the releases page. Now install the wheel with:

pip install .\path\to\etl_data_reader_CaptainDario-2.0-py3-none-any.whl

Or install it directly via https:

pip install https://github.com/CaptainDario/ETLCDB_data_reader/releases/download/v2.1.3/etl_data_reader_CaptainDario-2.1.3-py3-none-any.whl

Assuming you already have downloaded the ETLCDB. You have to do some renaming of the data set folders and files. First rename the folders like this:

  • ETL8B -> ETL1

  • ETL8G -> ETL9

  • ETL9B -> ETL10

  • ETL9G -> ETL11.

Finally rename all files in the folders to have a naming scheme like:

  • ETL_data_setETLXETLX_Y
    (X and Y are numbers)

On the ETLCDB website is also a file called “euc_co59.dat” provided. This file should also be included in the “data set”-folder on the same level as the data set part folders.

The folder structure should look like this now:

ETL_data_set_folder (main folder)
|   euc_co59.dat
|
|---ETL1
|       ETL1_1
|          |
|       ETL1_13
|---ETL2
|       ETL2_1
|          |
|       ETL2_5
|
|--- |
|
|---ETL10
|       ETL10_1
|          |
|       ETL10_5
|---ETL11
        ETL11_1
           |
        ETL11_50

Usage

Now you can import the package with:

import etldr

To load the data set you need an ETLDataReader-instance.

path_to_data_set = "the\path\to\the\data\set"

reader = etldr.DataReader(path_to_data_set)

where path_to_data_set should be the path to the main folder of your data set copy.
Example: “E:/data/ETL_data_set/”

Now there are basically three ways to load data.

Load one data set file

from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.katakana, ETLCharacterGroups.number]

imgs, labels = reader.read_dataset_file(2, ETLDataNames.ETL7, include)

This will load “…ETL_data_set_folderETL7ETL7_2”.

And store the images and labels which are either katakana or number in the variables imgs and labels.

Load one data set part

from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.kanji, ETLCharacterGroups.hiragana]

imgs, labels = reader.read_dataset_part(ETLDataNames.ETL2, include)

This will load all files in the folder “…ETL_data_set_folderETL2". Namely: …ETL2ETL2_1, …ETL2ETL2_1 ,…, …ETL2ETL2_5.

And store the images and labels which are either kanji or hiragana in the variables imgs and labels.

Load the whole data set

Warning: This will use a lot of memory.

from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]

imgs, labels = reader.read_dataset_whole(include)

This will load all roman and symbol characters from the whole ETLCDB.

Load the whole data set using multiple processes

Warning: This will use a lot of memory.

from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]

imgs, labels = reader.read_dataset_whole(include, 16)

This will load all roman and symbol characters from the whole ETLCDB using 16 processes.

Note: filtering data set entries

As the examples above already showed the loading of data set entries can be restricted to certain groups. Those groups can be seen in: etl_character_groups.py

Note: processing the images while loading

All of the above methods have the optional parameters:
resize : Tuple[int, int] = (64, 64)
and
normalize : bool = True
The resize-parameter resizes all images to the given size.
The normalize-parameter normalizes the grayscale values of the images between $[0.0, 1.0]$.

Warning: If those parameters are set to negative values no resizing/normalization will be done.
This will lead to an error if the data set is read with ``read_dataset_whole()``!

Limitations

This implementation does not allow to access all the stored data. Currently one can load:

  • image

  • label of the image

of every ETLCDB entry.

However this package should be easily extendable to add support for accessing the other data.

Development notes

For development python 3.9 was used.

documentation

The documentation was made with Sphinx and m2r. m2r is being used to automatically convert this README.md to .rst. This happens when the sphinx-build-command is invoked in the ‘docs’-folder.
Build the docs (should be run in docs folder):

sphinx-build source build

packages

A list of all packages needed for development can be found in ‘requirements.txt’.

testing

Some simple test cases are defined in the tests folder. Testing was only performed on Windows 10.
All tests can be executed with:

python tests\test_etldr.py

Specific tests can be run with:

python tests\test_etldr.py etldr.test_read_dataset_part_parallel

Those commands should be executed on the top level of this package.

building the wheel

The wheel can be build with:

python setup.py sdist bdist_wheel

Additional Notes

Pull requests and issues are welcome.

If you open a pull request make sure to run the tests before.