NLTM OCR

Call for Benchmarking

Home
Call for Benchmarking

Call for Benchmarking Indic Dataset for Printed , Handwritten and Scene Text

India is a linguistically diverse country with a multitude of languages spoken across the nation. However, it officially recognizes 22 languages as Scheduled Languages, which are considered the formal languages of India. These languages hold special status for various administrative and educational purposes. Here is a list of the 22 officially recognized languages of India: Assamese , Bengali , Bodo, Dogri, Gujarati, Hindi, Kannada , Kashmiri , Konkani , Maithili , Malayalam , Manipuri , Marathi, Nepali, Odia , Punjabi, Sanskrit , Santali , Sindhi , Tamil, Telugu, Urdu

The 13 major Indian Languages are Assamese , Bengali, Hindi, Kannada , Malayalam , Manipuri , Marathi, Nepali, Odia , Punjabi, Tamil, Telugu and Urdu . Our Team is building an Indic language dataset for research purposes along with MeitY , Bhashini for Handwritten , Printed and Scene Text images which includes books, newspapers, online articles, government documents , handwritten samples and real time scene board images .

Benchmarking is a critical process in the development of natural language processing (NLP) and text recognition systems. In the context of Indic languages, benchmarking plays a vital role in assessing and improving the accuracy and efficacy of these systems. With the linguistic diversity and unique characteristics of Indic languages, creating robust benchmarks is essential.

Contributors are encouraged to upload annotated data of Handwritten , Printed and Scene Text for Indic Languages
Ethical Considerations: Contributors may upload documents with no copyright issues and also obtain consent when necessary.

For Feedback and Support to the contributors and users of the dataset contact ( nltmocriiith@gmail.com )