How to get started

OCRC is a system for post-correction of OCR generated text. This guide will help you to create a new project for the correction of an ocr-ed document and explain to you the basic functionality of the system

Contents

Data for input

The OCRC system works on the output of the ABBYY Finereader OCR engine (subsequently called FR Engine). The FR Engine allows to save OCR results in an XML format providing details about the recognition process, and, in particular, image coordinates for every character it recognized. OCRC enabels you to directly compare your OCR results to the source images, so it needs this kind of input, together with the source images.

Before you start to create a new project in the OCRC system, be sure to have your data available in the following form:

Note that, while the names and locations of both folders are up to you, the file extensions .xml and .tif are mandatory.

The New Project Wizard, and how to save and open existing projects

Create a new project

Once you have prepared your data in the form explained above, you will find it very easy to create a new project for post-correction. Choose File > New Project in the menu, and a wizard will ask you to specify both the xml and the tif folder. After the wizard has finished, the system will perform a pre-processing on youzr document, which can, according to its size and your settings, take several minutes. After the pre-processing is finished, the first page of your document will appear on the screen, and you're ready to start with the correction process.

Save/ Open existing projects

Projects that were created using the above procedure can be dumped to hard disk and opened again at any time using the standard commands Open Project, Save and Save as in the File menu.

The Text View

The tool offers two basic modes for correcting errors. The text view presents your document in its standard reading order - page by page, the contents of the document can be viewed and corrected. Each word of the OCR result is presented together with the respective snippet of the original image - this makes the direct comparison and verification much faster and more reliable. Besides, another pane displays the complete image of the current page, highlighting the currently selected word with a box. This helps to keep your orientation and also to decipher regions where the snippets are hard to read.

Spot suspicious words: Similar to what you know from e.g. word processing programs, words which are considered to be erroneous are underlined in red color. To explicitly remove the 'suspicious' status of a word, right-click this word and choose the option Mark as correct.

Correct words: Right-click words to edit them using the keyboard or by selecting one of the suggested correction candidates. Words that were manually changed are underlined in green color.

Delete words: You can easily remove a word by selecting it and pressing the Del-button of your keyboard.

Split a word in two: Often the OCR engine fails to recognize word borders correctly. In this case simply add a space character in your correction to indicate the word border.

The Text View offers more features. To find out how to merge words or how to start Batch Views from the Text View, read Section ???

The Batch View

While the Text View displays your document in

Basic features for the correction of erroneous words

Once you have created a project from abbyy output, you are ready to start and correct errors. This section explains the basic edit functions for tokens, that is, to change words manually, to select a suitable correction candidate offered by the system, or to delete, split or merge tokens. The next section will introduce the batch processing features for faster correction.

Editing word strings manually

Deleting, merging or splitting tokens

Batch processing for repeated and systematic errors