Dataset

Existing Non-Indic Handwritten Datasets

IAM Dataset: IAM
GNHK Dataset: GNHK
SCUT-HCCDoc: SCUT-HCCDoc
All these non-Indic handwritten datasets can be used to pre-train word detection models.



Existing Indic Handwritten Datasets

IIIT-Indic-HW-UC Dataset: IIIT-Indic-HW-UC
IHWWR-1.0 Dataset: IHWWR-1.0
PLHWTR-1.0 Dataset: PLHWTR-1.0
IHTR 2023 Dataset: IHTR-2023
IHTR 2022 Dataset: IHTR-2022
All these Indic handwritten datasets can be used to pre-train word recognition models.



Competition Dataset: IHDR-2025 Dataset


Sample Images

IMG IMG IMG

IMG IMG IMG

Fig.1 Presents examples of handwritten document images in several Indic languages, captured under uncontrolled conditions. These camera-captured images display a range of challenging characteristics, including blurred text, overexposure, perspective distortion, varied illumination, unwanted background elements, low-resolution text, shadowed text, and oriented text, among other complexities.



Character List of Script/Language

Character set can be downloaded from these links Bengali Gujarati Hindi Kannada Malayalam Marathi Odia Punjabi Tamil Telugu Special Charlist .

Please note that the special character list provided is not exhaustive and may not include all characters present in word-images. It's common practice to designate a special character, not part of the target language script, as a "don't care" character. This helps the model ignore non-essential characters, which may improve accuracy by reducing the character set handled by the OCR model.



Dataset for Task A: Handwritten Text Word Detection

Validation Set

Validation sets can be downloaded from these links Bengali Gujarati Hindi Kannada Malayalam Marathi Odia Punjabi Tamil Telugu . The validation set includes handwritten page images (e.g., 120_9135.jpg) along with their corresponding ground truth files (e.g., 120_9135_wbb_gt.txt). Each ground truth file contains word-level bounding box information in the format: x, y, width, height — with values separated by tabs, and listed in reading order. Participants may use the validation set for fine-tuning their models.

Test Set

Test sets can be downloaded from these links Bengali Gujarati Hindi Kannada Malayalam Marathi Odia Punjabi Tamil Telugu . The test set includes only handwritten page images (e.g., 328_39691.jpg).

Output Format Specification

The output for Task-A should be saved in a file named as image name.txt (e.g., 328_39691.txt). Each file must include word-level bounding box information in the format: x, y, width, height — with values separated by tabs and arranged in reading order. The output format must exactly match that of the validation set.



Dataset for Task B: Page Recognition and Reading

Validation Set

Validation sets can be downloaded from these links Bengali Gujarati Hindi Kannada Malayalam Marathi Odia Punjabi Tamil Telugu . The validation set consists of handwritten page images (e.g., 120_9135.jpg) and their corresponding ground truth files (e.g., 120_9135.txt). Each ground truth file provides text transcriptions arranged in reading order. Participants are encouraged to use the validation set for model fine-tuning.

Test Set

Test sets can be downloaded from these links Bengali Gujarati Hindi Kannada Malayalam Marathi Odia Punjabi Tamil Telugu . The test set includes only handwritten page images (e.g., 328_39691.jpg).

Output Format Specification

The output for Task-B should be saved in a file named as image name.txt (e.g., 328_39691.txt). Each file must contain the recognized text arranged in reading order.

Participants may use any other public datasets for training purposes. They must include the names of these additional datasets in their report.

The dataset is freely available for academic and research purposes.