Akshara-Malayalam
Language
Malayalam
Modality
Printed
Details Description
We randomly crop word images from 1000 document page images to create the Akshara-Malayalam dataset. The pages are taken from multiple books and scanned using a flatbed scanner. The pages are scanned in 600 DPI. We manually annotate ground truth transcriptions of cropped word images. This dataset consists of 1,00,019 word images, and their corresponding ground truth transcriptions. We divide this dataset into Training, Validation, and Test Sets consisting of 80,146, 9,893, and 9,980 word images and their corresponding ground truth transcriptions. There are 41,961 unique Malayalam words in the training set.
Training Set:
train.zip contains folder named “images” with 80,113 word level images, “train_gt.txt” file containing image name and ground truth transcription separated by “Tab space”, and “vocabulary.txt” contains list of 41,961 unique words in the Training set.
Validation Set:
val.zip contains folder named “images” with 9,787 word level images, and “val_gt.txt” containing image name and ground truth text separated by “Tab space”.
Test Set:
test.zip contains folder named “images” with 10,113 level images, and “test_gt.txt” containing image name and ground truth text separated by “Tab space”.
Sample Word Level Images from Training Set
Image | Ground Truth |
---|---|
License
This dataset is under the license CC BY 4.0. For more details, please see the data_license.doc file.