Getting started¶
ETLCDB_data_reader¶
A python package for conveniently loading the ETLCDB. The complete documentation including the API can be found here.
Intro¶
The ETLCDB is a collection of roughly 1.600.000 handwritten characters.
Notably it includes Japanese Kanji, Hiragana and Katakana.
The data set can be found on the ETLCDB website (a registration is needed to download the data set).
Because the data set is stored in a custom data structure it can be hard to load.
This python package provides an easy way to load this data set and filter entries.
An example of using this package can be found in my application: DaKanji. There it was used for training an CNN to recognize hand written Japanese characters, numbers and roman letters.
General information about the data set can be found in the table below.
name |
type |
content |
res |
Bit depth |
code |
samples per label |
total samples |
---|---|---|---|---|---|---|---|
ETL1 |
M-Type |
Numbers |
64x63 |
4 |
JIS X 0201 |
~1400 |
141319 |
ETL2 |
K-Type |
Hiragana |
60x60 |
6 |
CO59 |
~24 |
52796 |
ETL3 |
C-Type |
Numeric |
72x76 |
4 |
JIS X 0201 |
200 |
9600 |
ETL4 |
C-Type |
Hiragana |
72x76 |
4 |
JIS X 0201 |
120 |
6120 |
ETL5 |
C-Type |
Katakana |
72x76 |
4 |
JIS X 0201 |
~200 |
10608 |
ETL6 |
M-Type |
Katakana |
64x63 |
4 |
JIS X 0201 |
1383 |
157662 |
ETL7 |
M-Type |
Hiragana |
64x63 |
4 |
JIS X 0201 |
160 |
16800 |
ETL8 (8B) |
8B-Type |
Hiragana |
64x63 |
1 |
JIS X 0208 |
160 |
157662 |
ETL9 (8G) |
8G-Type |
Hiragana |
128x127 |
4 |
JIS X 0208 |
200 |
607200 |
ETL10 (9B) |
9B-Type |
Hiragana |
64x63 |
1 |
JIS X 0208 |
160 |
152960 |
ETL11 (9G) |
9G-Type |
Hiragana |
128x127 |
4 |
JIS X 0208 |
200 |
607200 |
Note:
The ETL6 and ETL7 parts include half width katakana which are stored as roman letters.
As an example: “ケ” is stored as “ke”.
Those are automatically converted from this package.
Also full width numbers and letters are converted when using the package.
Example: 0 -> 0 and A -> A
Setup¶
First download the wheel from the releases page. Now install the wheel with:
pip install .\path\to\etl_data_reader_CaptainDario-2.0-py3-none-any.whl
Or install it directly via https:
pip install https://github.com/CaptainDario/ETLCDB_data_reader/releases/download/v2.1.3/etl_data_reader_CaptainDario-2.1.3-py3-none-any.whl
Assuming you already have downloaded the ETLCDB. You have to do some renaming of the data set folders and files. First rename the folders like this:
ETL8B -> ETL1
ETL8G -> ETL9
ETL9B -> ETL10
ETL9G -> ETL11.
Finally rename all files in the folders to have a naming scheme like:
ETL_data_setETLXETLX_Y
(X and Y are numbers)
On the ETLCDB website is also a file called “euc_co59.dat” provided. This file should also be included in the “data set”-folder on the same level as the data set part folders.
The folder structure should look like this now:
ETL_data_set_folder (main folder)
| euc_co59.dat
|
|---ETL1
| ETL1_1
| |
| ETL1_13
|---ETL2
| ETL2_1
| |
| ETL2_5
|
|--- |
|
|---ETL10
| ETL10_1
| |
| ETL10_5
|---ETL11
ETL11_1
|
ETL11_50
Usage¶
Now you can import the package with:
import etldr
To load the data set you need an ETLDataReader
-instance.
path_to_data_set = "the\path\to\the\data\set"
reader = etldr.DataReader(path_to_data_set)
where path_to_data_set
should be the path to the main folder of your data set copy.
Example: “E:/data/ETL_data_set/”
Now there are basically three ways to load data.
Load one data set file¶
from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.katakana, ETLCharacterGroups.number]
imgs, labels = reader.read_dataset_file(2, ETLDataNames.ETL7, include)
This will load “…ETL_data_set_folderETL7ETL7_2”.
And store the images and labels which are either katakana or number in the variables imgs
and labels
.
Load one data set part¶
from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.kanji, ETLCharacterGroups.hiragana]
imgs, labels = reader.read_dataset_part(ETLDataNames.ETL2, include)
This will load all files in the folder “…ETL_data_set_folderETL2".
Namely: …ETL2ETL2_1, …ETL2ETL2_1 ,…, …ETL2ETL2_5.
And store the images and labels which are either kanji or hiragana in the variables imgs
and labels
.
Load the whole data set¶
Warning: This will use a lot of memory.
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]
imgs, labels = reader.read_dataset_whole(include)
This will load all roman and symbol characters from the whole ETLCDB.
Load the whole data set using multiple processes¶
Warning: This will use a lot of memory.
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]
imgs, labels = reader.read_dataset_whole(include, 16)
This will load all roman and symbol characters from the whole ETLCDB using 16 processes.
Note: filtering data set entries¶
As the examples above already showed the loading of data set entries can be restricted to certain groups. Those groups can be seen in: etl_character_groups.py
Note: processing the images while loading¶
All of the above methods have the optional parameters:
resize : Tuple[int, int] = (64, 64)
and
normalize : bool = True
The resize
-parameter resizes all images to the given size.
The normalize
-parameter normalizes the grayscale values of the images between $[0.0, 1.0]$.
Warning:
If those parameters are set to negative values no resizing/normalization will be done.
This will lead to an error if the data set is read with ``read_dataset_whole()``!
Limitations¶
This implementation does not allow to access all the stored data. Currently one can load:
image
label of the image
of every ETLCDB entry.
However this package should be easily extendable to add support for accessing the other data.
Development notes¶
For development python 3.9 was used.
documentation¶
The documentation was made with Sphinx and m2r.
m2r is being used to automatically convert this README.md to .rst.
This happens when the sphinx-build
-command is invoked in the ‘docs’-folder.
Build the docs (should be run in docs folder):
sphinx-build source build
packages¶
A list of all packages needed for development can be found in ‘requirements.txt’.
testing¶
Some simple test cases are defined in the tests folder.
Testing was only performed on Windows 10.
All tests can be executed with:
python tests\test_etldr.py
Specific tests can be run with:
python tests\test_etldr.py etldr.test_read_dataset_part_parallel
Those commands should be executed on the top level of this package.
Additional Notes¶
Pull requests and issues are welcome.
If you open a pull request make sure to run the tests before.