Implementation of Hindi OCRs and Applications for Parliament Debates

Hindi OCR was implemented in the Lok Sabha, and it was successfully carried out by the Consortium members for the OCR project - IIIT Hyderabad, IIT Delhi, IIT Jodhpur, IIT Bombay, CDAC Noida, and Punjabi University Patiala. The proceedings from Lok Sabha are then made available in scanned (pdf/image) format and are mainly in the Hindi language. PDL (Parliament Digital Library) provides archived information about various parliamentary documents at eparlib.nic.in. As the data is available in image format, text extraction, searching, editing, indexing, and other text processing techniques cannot be applied to the documents. Instead of manually transcribing the text, which is time-consuming and prone to errors, an OCR system was required to process the formatted scanned document and convert the Unicode text into digital format. The project team, which is led by IIIT Hyderabad, developed high-accuracy recognisers for printed, handwritten, and scene text for all the 22 scheduled Indian languages. This will open up opportunities for the technologies that use Indian language OCR systems. The technologies will be made available to start-ups and industries, state and Central Governments, banks, service centres, NGOs, researchers and students. The group from IIIT, Hyderabad has also developed Indian language OCR systems for 13 Indian languages under the “Natural Language Translation mission (Bhashini)” funded by the Ministry of Electronics and Information Technology (Government of India). They plan to deploy the OCR products in different organizations based on their requirements, as use-case implementation for the improvement of OCR based on user feedback.  

Basic Architectural features:

“Natural Language Translation mission (Bhashini)”

OCR features for PDL

  • Making the scanned content available to CDAC/Consortium [PDL]
  • Execution of Hindi OCR developed by consortia on received scanned documents (approximately 15 lakh pages) [CDAC/Consortia]
  • Conversion of OCRed documents to PDF/a format and transferring to PDL [CDAC/Consortia]
  • This does not involve any manual correction or post-editing

The current use-case implementation is being carried out by CDAC Noida (Front-End and execution agency), IIIT Hyderabad, and Punjabi University, Patiala for Hindi OCR.