Introduction
Rich textual content found in natural settings contains valuable information that greatly enhances understanding of the environment in contemporary times. Such text is utilized for tasks like image search, translation, transliteration, assistive technologies (especially for the visually impaired), autonomous navigation, and more.
Scene text recognition is increasingly crucial, with its solution promising advancements in various downstream tasks like above. Despite significant progress in this field, there are areas for further enhancement. These include developing models that can handle diverse languages, fonts, layouts, and styles, as well as creating solutions resilient to text-related image imperfections such as blurriness, occlusion, and uneven illumination. Researchers have sought to tackle these challenges by curating datasets tailored to specific problems, each highlighting distinct features and representing subsets of real-world challenges.
Because Indian language scripts are visually more complex, and their output space is much larger than English languages, not all Latin STR models can mimic performance in Indian STR solutions. STR solutions have not progressed in the case of Indian languages, which are spoken by 17% of the world population, due to a lack of real datasets and models that are better equipped to handle the inherent complexities of the languages. Non-Latin languages have made less progress, and existing Latin STR models need to generalize better to different languages. Through this competition we aim to tackle this challenge within the scene text domain of Document Analysis and Recognition community.
In line with the goal of achieving accuracy in scene text recognition across diverse languages, this competition targets nearly all Indian languages. These languages, constituting approximately 17% of the world’s population, share syntactic and semantic similarities and serve as crucial means of communication.
The challenge comprises only one task - cropped word image recognition. The participants have to predict the text in all the cropped word images of the test set in a total of 10 targeted Indian languages.