Dataset

We use our benchmark dataset, IIIT-INDIC-HW for this competition only for training purposes. The dataset contains word images of ten different Indic scripts such as Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Odia, Malayalam, Tamil, Telugu, and Urdu . The user can use additional data for training purposes. But they need to provide proper information about the additional data for training.

Script / Language Training Set Validation Set Test
Bengali 82554 1000 5000
Devanagari 69583 1000 5000
Gujarati 82563 1000 5000
Gurumukhi 81042 1000 5000
Kannada 73517 1000 5000
Malayalam 85270 1000 5000
Odia 73400 1000 5000
Tamil 75736 1000 5000
Telugu 80637 1000 5000
Urdu 71207 1000 5000
Statistics of Training and Validation Datasets

Input and output Format Specifications

Training and validation sets contain word images (in ‘.jpg’ format) and corresponding ground truth transcriptions are available in ‘train.txt’ and ‘val.txt’, respectively. ‘train.txt’ and ‘val.txt’ contain the path of training and validation images along with ground truth transcriptions corresponding to each word image, separated by a white space. The output should be saved as ‘script name_result.txt’ (e.g., bengali_result.txt) which contains names of test word images and corresponding predictions separated by a white space in each line.


Training Dataset

Training Dataset can be download Bengali , Devanagari , Gujarati , Gurumukhi , Kannada , Malayalam , Odia , Tamil , Telugu and Urdu


Validation Dataset

Validation Dataset can be downloaded


Test Dataset

Test Dataset can be download Bengali , Devanagari , Gujarati , Gurumukhi , Kannada , Malayalam , Odia , Tamil , Telugu and Urdu