Implementation of Hindi OCRs and Applications for Parliament Debates

Implementation of Hindi OCR for Parliament Debates under the Project titled “OCRs and Applications in India Languages “National Language Translation Mission (NLTM) -BHASHINI,

Funded by Ministry of Electronics and Information Technology (Government of India).

Objective of the OCR project

The one of the objectives of the project is to develop robust recognizers that can recognize printed text in scanned documents for 13 Indian languages.

The first version of Hindi OCR has been developed under the project.

It is planned to deploy the OCR products in different organizations based on their requirements, as use-case implementation for improvement of OCR based on user’s feedback.

Use-Case is Implementation of Hindi OCR for Parliament Debates

Statement: Parliament Digital Library (PDL) provides archived information about various parliamentary documents and debates at eparlib.nic.in. The proceedings from Lok Sabha are available in scanned (pdf/image) format and are in Hindi language. As the data is in image format, text extraction, searching, editing, indexing and other text processing techniques cannot be applied to the documents.

Hence Debates can’t be searchable as these are in non-Unicode font format, these will be converted into Unicode searchable format.

An OCR system is required to process the formatted scanned document and to generate the Unicode text output.

Datset

project entails conversion of over an estimate of approximate Volume of 10-15 lakh pages and services

The current use-case implementation is being carried out by CDAC Noida (Front-End and execution agency), IIIT Hyderabad and Punjabi University, Patiala for Hindi OCR

Consortium members for the OCR project are IIIT Hyderabad, IIT Delhi, IIT Jodhpur, IIT Bombay, CDAC Noida and Punjabi University Patiala. 

Basic architectural diagram

How OCR works

Only three steps are necessary to digitize a Hindi document:

Step 01 : Scan a Hindi document or open a scanned document:

Step 03: Let HindiOCR recognize the document

Step 04 : Export the digitized and editable Hindi text to an office program (click to select text).