CONFIGURING TESSERACT

Top Previous Next

If your installation uses the Image Manager, you might wish to install Tesseract, an open source OCR engine. If configured, and UnForm receives images or PDF files without a text layer, it will process those images through Tesseract to build a text layer. Note that if your implementation relies heavily on OCR processing, better results are typically achieved with a commercial OCR tool that is able to produce PDF files with a text layer that UnForm can read.

Windows

There are several sources for tesseract on Windows. One such source for Windows, recommended by the Tesseract project, is:

https://github.com/UB-Mannheim/tesseract/wiki

Linux

Tesseract can be installed on Linux using standard package management tools. Use the following commands:

•Redhat/Fedora/CentOS: yum install tesseract

•Ubuntu/Debian: apt-get install tesseract-ocr

•SUSE Linux: visit https://software.opensuse.org to locate rpm install packages for your version.

Redhat Enterprise users may need to first install EPEL repository extensions (Fedora installs should already have these available). The following web page provides copy/paste command lines and information:

https://fedoraproject.org/wiki/EPEL

As of this writing, there are relevant command lines:

•CentOS 6: yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm

•CentOS 7: yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

•RedHat: subscription-manager repos --enable "rhel-*-optional-rpms" --enable "rhel-*-extras-rpms"

Alternate Repository

Alternatively, if your package manager installs an older version and you wish to have a more recent version, you can find instructions on installing or updating from a custom repository here:

https://github.com/tesseract-ocr/tesseract/wiki

Note that Tesseract 4 and higher has proven to produce generally better results than version 3.

Once installed, restart the UnForm server and it should locate Tesseract automatically, assuming you choose a default installation path. On Linux, it will locate it in /usr/local/bin or /usr/bin. If necessary, you can manually configure the path to the tesseract executable in uf101d.ini, the [tesseract] section, setting tesseract=path.

Preprocess Filter

When Tesseract is called by UnForm, it is provided an image or set of image pages. The format of these pages is either the native image format imported by the inbound source, or JPEG files if the incoming file is a PDF. UnForm uses GhostScript to convert the PDF file to JPEG page files. If desired, you can configure an Image Magick command line as a pre-process filter that is run against these image files. This is configured in two lines in uf101d.ini. First, define an Image Magick command line fragment in the [images] section. The name of this entry can contain "jpg", "jpeg", or "tif" to determine the resulting format of the command. The default format is "png", so any other name results in PNG images being given to Tesseract. An example might be this:

tesspng="%i" -blur 1x1 -monochrome "%o" >/dev/null 2>&1

Then instruct UnForm to use this filter before running the file through Tesseract, by adding this line in the [tesseract] section:

filter=tesspng