External OCR Processing

Top Previous Next

Through proper configuration, UnForm can be configured to use a third party OCR processing tool rather than the default open source Tesseract tool. Commercial tools can provide higher OCR accuracy than Tesseract. The goal of this integration is to convert images and PDF files that do not contain text to PDF files that contain text. A PDF file with text enables automated text extraction during the inbound document parsing process, enabling Image Manager jobs to process those documents in an automated fashion, often without human intervention.

This integration is controlled by a configuration section in uf101d.ini. This section is not provided by default, but can be added to the file. Here is an example:

[ocrdrop]

path=c:/sdsi/ocrdrop

resultscontain=fromuf.

The path=path setting defines where incoming documents should be placed when they do not contain text that UnForm can use. This path should be a location that the OCR tool is configured to monitor and process. The OCR tool and UnForm do not necessarily have to be on the same system, but must be able to share a file system location, both products with full permission to control the contents of that path.

Whenever UnForm is parsing an incoming document, if it finds there is no text, it will create a copy of the incoming file in the specified path, and then remove the now redundant file from its inbound processing. The expectation is that the OCR processing tool will pick up the file, generate a PDF file with text, and drop that PDF in another path that UnForm is monitoring as an inbound source path. This will then become a new inbound document, this time with a text layer that can be extracted for processing.

To prevent UnForm from re-submitting a file that the external tool is unable produce text for in its generated PDF, a name match is performed when determining whether to submit the file to the external tool. If an incoming file name contains the string found in resultscontain=string, it will not be submitted to the configured path, but instead processed normally by UnForm. All UnForm-generated files created in path will contain a prefix "fromuf.", so if the OCR processing tool is configured to include the name of the file it receives from UnForm, the string "fromuf." is an appropriate value. However, any string can be specified, as long as the files generated by the OCR tool contain that string.

In a nutshell, if a third-party tool can monitor a directory for documents to process, and can place resulting PDF files with text in another directory UnForm is monitoring, you can configure UnForm to send it documents, and receive the results through an inbound source path.