Document Sources

Top Previous Next

Document sources are locations where UnForm can obtain inbound documents; that is, documents that are not generated via printing or direct uploading. Each source has an associated inbound library where documents and their properties are maintained. These libraries are not normally accessible in the regular browser interface, but an administrator can view them under the ~inbound library category.

Sources are maintained with Inbound Sources on the Admin menu of the archive browser interface

Each document in a source library is either assigned to a user or available for assignment by any user. Once documents pass all validation tests, which can be particularly extensive if a job is applied to a document, the document can be transferred to a standard UnForm archive library.

There are three types of document sources:

•Directories (Folders) on the UnForm server system. A directory can be anywhere that the UnForm server can read and remove files from. As documents arrive in that directory, UnForm picks them up and moves them to an inbound library, performs initial parsing for image and text extraction, and then optionally applies jobs to those documents. A common use for directory sources is for scan-to-disk operations.

A wildcard filter can be applied to limit the documents that are picked up by that source definition.

A zip file can be placed in the directory and files will be extracted from it for processing, rather than attempting to process the zip file itself as a document.

File system sources support sub-directories and process files from those as well.

•Dedicated email mailboxes, monitored by IMAP, with attachments extracted for processing, and the email and key attributes are stored as well. It is important to note that this must be a dedicated mailbox, not one shared with any user or other program, as UnForm will remove all email for either processing or discarding.

A white list enables spam control, where only mail from specific addresses or domains will be accepted. In addition, only certain attachment types can be specified.

An email source can receive mail with alternate To addresses, either via aliasing or mail list configuration. When an email arrives at the source inbox, the actual To address is tested to see if it contains +sourceid or .sourceid for an alternate inbound source. If so, the documents are delivered to that source rather than the defined email source library. For example, email addressed to inbound+expvend@somewhere.com will deliver attachments to the source "expvend" if it exists.

•Image Manager users can upload documents or zip files into a source library, and those documents will be assigned to them automatically.

Note that unlike standard document libraries, the system will purge library recovery data for dates prior to any existing documents in the inbound source libraries. This is due to the intentional transient nature of inbound documents.

Document Types

When files arrive in a source, they are parsed in various ways automatically.

•Zip files are extracted, and the content files processed individually.

•Image files are scanned for barcodes, and if OCR processing is not disabled, an attempt will be made to use a local OCR engine (Tesseract) to create a PDF file with text, which is then processed as if the PDF file were uploaded directly. In addition, the image pages are converted to single page PNG files for viewing.

•PDF files are scanned for text. If no text is found and OCR processing is not disabled, the PDF file is converted to an image and processed using the local OCR engine (Tesseract). If text is found in the PDF file, that text is used for all OCR zones in Image Manager jobs. In addition, the PDF pages are converted to images for barcode scanning and viewing.

•Other file types, other than text attachments, are simply loaded into the source library for browser viewing.

•If an email contains no attachments (other than text attachments), its body text is used as a document. This is particularly useful for sources that deliver to printing rather than the Image Manager, so that legacy email sources can be used for printing. Such emails also can produce an Image Manager job, though it will be plain text and cannot be managed in jobs with ocr or barcode zones.

Note that for document processing where the quality of OCR results is important, a commercial OCR tool can generally produce better text quality than free software. Any tool that can produce a PDF file with text can be used. Also, some scanner devices can perform OCR processing and scan to PDF files.

The very best "OCR" results are obtained without using OCR at all, but instead by processing PDF files directly from an application that produces PDF files natively. When automating the processing of inbound documents, it is worthwhile asking suppliers for such PDF files.

An AI-based OCR service is available that typically provides significantly better OCR accuracy than a local Tesseract engine.