Dataset Details

Home
Dataset Details

IndicSTR12-Kannada

Language

Kannada

Modality

Scene Text

Details Description

This dataset's images have been crawled from Google images using various keyword-based searches to cover all the daily avenues wherein Indic language text can be observed in natural settings. A few of them are mentioned here - wall paintings, railway stations, Signboards, shop/temple/mosque/gurudwara name-boards, advertisement banners, political protests, house plates, etc. Because they have been crawled from a search engine, they come from various sources, offering multiple conditions under which images were captured. Curated images have blurred, non-iconic/iconic text, low-resolution, occlusion, curved text, perspective projections due to non-frontal viewpoints, etc. Words within the images are automatically segmented to get individual word images. Then, we manually create the ground truth transcription of each word. This dataset consists of 1080 word images and their corresponding ground truth transcriptions. We divide this dataset into Training and Test Sets comprising 810 and 270 word images and their corresponding ground truth transcriptions, respectively. There are 631 unique Kannada words in the training set.

Training Set:

train.zip contains folder named “images” with 1575 word level images, “train_gt.txt” file containing image name and ground truth transcription separated by “Tab space”, and “list_of_unique_words.txt” contains list of 998 unique words in the Training set.

Test Set:

test.zip contains folder named “images” with 526 word level images, and “test_gt.txt” containing image name and ground truth text separated by “Tab space”.

Downloads

Train Test Logout

Sample Word Level Images from Training Set

Image	Ground Truth
	ಹೊಸಪೇಟೆ
	ಕೃಷ್ಣಾ
	ಹೆಜ್ಜೆ
	ಅಂಜನಾದ್ರಿ
	ಪ್ರಿಂಟರ್ರ್ಸ್
	ತುಳುನಾಡಿನ
	ನಿಲ್ಲಾಣ
	ಸುಸ್ವಾಗತ
	ದ್ವಿತೀಯ
	ನಮ್ಮ
	ಸ್ವಚ್ಛತೆಯನ್ನು
	ಕೊಳ್ಳುವ
	ನಗರ
	ಮೈಸೂರು
	ಆಪ್ಟಿಕ್ಸ್
	ಕರಿಬಸಪ್ಪ
	ಅಧ್ಯಕ್ಷರು
	ಮೈಸೂರು
	ಸರ್.ಎಂ.ವಿಶ್ವೇಶ್ವರಯ್ಯ
	ಚಂದ್ರಕಲಾ
	ಮನದ
	ಆ
	ದೇವನು
	ಪ್ರತಿಬಿಂಬ
	ಪಂಚಾಯತ್
	ವರ
	ಇಲ್ಲಿಯವರೆಗೆ
	ಬಡ
	ಧನ್ಯನಾದೆ
	ಕೊಡಲಾಗುತ್ತದೆ
	ಟೈಲರ್
	ಕೋವಿಡ್
	ಕಲ್ಪತರು
	ನಿರ್ಗಮನ
	ಸಮೃದ್ಧಿ
	ಇವರಿಗೆ
	ಕೆ.ರಾಕೇಶ್
	ನೂತನ
	ಮತ
	ಕಡೆಗೆ
	ವಿರೋಧಿ
	ವಿಕ್ರಮ್
	ರಾಷ್ಟ್ರ
	ನವೆಂಬರ್
	ಅನಿರ್ಧಿಷ್ಟಾವಧಿ
	ಕರ್ನಾಟಕ
	ಕಾರ್
	ಅಕ್ಸೆಸರೀಸ್
	ಹಳಿಯಾಳ
	ಭಾರತೀಯ
	ಸರ್ವಿಸ್
	ಸ್ಟಿಚ್
	ನಾಡ
	ಪ್ರಭು
	ಬೆಂಗಳೂರು
	ತರಹದ
	ಮತ್ತೆ
	ಪ್ರಜಾಪ್ರಭುತ್ವ
	ಎಸ್ಎಲ್ಎಸ್
	ವಿಸ್ತರಣೆ
	ಆರ್ಟ್ಸ್
	ಕೆನರಾ
	ರಕ್ಷಣಾ
	ನಿನ್ನ
	ಆಸ್ತಿಯೂ
	ರಾಮಚಂದ್ರ
	ರುಚಿ
	ಮಂಜುಳ
	ಚಿoಪಾಂಜಿ
	ಮಂಗಳೂರು
	ಕರ್ನಾಟಕ
	ಅರಣ್ಯ
	ಧರಣಿ
	ನಂಜೇಗೌಡ
	ಸಂಚಿತ
	ಬೆಂಗಳೂರು
	ಪ್ರಾಧಿಕಾರ
	ವಿಮಾನ
	ಬೆಂಗಳೂರು
	ಪೊಲೀಸ್
	ಮುಲಾನಿ
	ಎಚ್ಚರ!
	ಹೋಟೆಲ್
	ನಮ್ಮೊಂದಿಗೆ
	ಉಮಾ
	ಪಾದಚಾರಿ
	ಶ್ರೀ ಕೆಂಪರಾಜು
	ಮುಕ್ತಾಯ
	ಕನ್ನಡಿಗರ
	ವೀಕ್ಷಿಸಲು
	ಭಾವಪೂರ್ಣ
	ದಿ:-
	ಕಾವೇರಿ
	ಹಿಂದು
	ಚಂದ್ರೋದಯ
	ಇ.ಡಬ್ಲ್ಯೂ.ಎಸ್
	ಫ್ಯಾಷನ್
	ಫ್ಯಾಷನ್
	ನಂದಿ
	ತಾಲೂಕಿನ

Citation

If you use this dataset, please refer these papers

@inproceedings{mathew2017benchmarking, 
  title={Benchmarking scene text recognition in Devanagari, Telugu and Malayalam}, 
  author={Mathew, Minesh and Jain, Mohit and Jawahar, CV}, 
  booktitle={2017 14th IAPR international conference on document analysis and recognition (ICDAR)}, 
  volume={7}, 
  pages={42--46}, 
  year={2017}, 
  organization={IEEE} 
} 

@inproceedings{gunna2021transfer, 
  title={Transfer learning for scene text recognition in Indian languages}, 
  author={Gunna, Sanjana and Saluja, Rohit and Jawahar, CV}, 
  booktitle={International Conference on Document Analysis and Recognition}, 
  pages={182--197}, 
  year={2021}, 
  organization={Springer} 
} 

@inproceedings{lunia2023indicstr12, 
  title={IndicSTR12: A Dataset for Indic Scene Text Recognition}, 
  author={Lunia, Harsh and Mondal, Ajoy and Jawahar, CV}, 
  booktitle={International Conference on Document Analysis and Recognition}, 
  pages={233--250}, 
  year={2023}, 
  organization={Springer} 
}

License

This dataset is under the license CC BY 4.0. For more details, please see the data_license.doc file.