IHTR 2023

DataSet

Home
DataSet

Dataset

We use our benchmark dataset, IIIT-INDIC-HW for this competition only for training purposes. The dataset contains word images of ten different Indic scripts such as Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Odia, Malayalam, Tamil, Telugu, and Urdu . The user can use additional data for training purposes. But they need to provide proper information about the additional data for training.

Script / Language	Training Set	Validation Set	Test
Bengali	82554	1000	5000
Devanagari	69583	1000	5000
Gujarati	82563	1000	5000
Gurumukhi	81042	1000	5000
Kannada	73517	1000	5000
Malayalam	85270	1000	5000
Odia	73400	1000	5000
Tamil	75736	1000	5000
Telugu	80637	1000	5000
Urdu	71207	1000	5000

Statistics of Training and Validation Datasets

Input and output Format Specifications

Training and validation sets contain word images (in ‘.jpg’ format) and corresponding ground truth transcriptions are available in ‘train.txt’ and ‘val.txt’, respectively. ‘train.txt’ and ‘val.txt’ contain the path of training and validation images along with ground truth transcriptions corresponding to each word image, separated by a white space. The output should be saved as ‘script name_result.txt’ (e.g., bengali_result.txt) which contains names of test word images and corresponding predictions separated by a white space in each line.

Training Dataset

Training Dataset can be download Bengali , Devanagari , Gujarati , Gurumukhi , Kannada , Malayalam , Odia , Tamil , Telugu and Urdu

Validation Dataset

Validation Dataset can be downloaded

Test Dataset

Test Dataset can be download Bengali , Devanagari , Gujarati , Gurumukhi , Kannada , Malayalam , Odia , Tamil , Telugu and Urdu