Many organizations have great capture tools, but aren’t using them effectively, often without realizing the gap between what they’re doing and what they could be doing. You may have invested in capture and OCR technology from ABBYY, AnyDoc, Captiva, Datacap, Kofax, Parascript, ReadSoft, and others, but suffer from what should be unacceptable per-page cost. So what should you do? This post will get you heading in the right direction.
First, focus on those stages that are time-consuming and labor-intensive. This means that scanner speed doesn’t matter half as much as most folks think. But manual data entry, error correction, and rescanning have dramatic impacts on overall capture time. Automating character recognition and indexing therefore present the promise of significantly reducing time and labor.
So how can you get the best optical character recognition (OCR) and indexing possible? This post provides recommendations regarding how organizations should address automated indexing during document capture.
I’ve broken down our 10 recommendations into two sets: 5 that apply to pre-recognition and 5 that apply to the recognition itself.
Five Key Pre-Recognition-related Factors
1. Address document design and availability, document preparation, and document feeding and handling.
- Aim for using the best source material that’s practical to get: original hard copies that are the cleanest version possible, with high contrast between text and background (white/black), no underlines, etc.
- Apply the 80/20 rule by addressing the most common, important, and differentiated document types first, to increase overall OCR accuracy.
- Approach document design with “round trip” in mind, where possible. If you have some control over the design of the documents, then try to get the documents designed for easier recognition. Use soft and hard incentives with the departments and/or organizations that send you documents to be scanned. These range from encouragement and guidelines, to SOPs and the guarantee of better (faster, cheaper, or higher quality) processing for those who are able to provide you with well-designed documents.
- Scanners should handle the widest range of expected documents – minus the low percentage of acceptable exceptions that should be addressed by a special scanner or other handling. In other words, it may be the case that you can address all the incoming documents with your high performance scanner. But often there are some exceptions, like very odd-sized or damaged documents. Some exception-handling is acceptable.
- Capture software products should handle the full range of document sizes for the scanners they support, allowing scanner operators to load documents of any size or color into the document feeder and begin scanning, thereby reducing document preparation time. Don’t neglect this point; many capture software products can’t do what the scanners (and you) want them to do.
2. Address capture, particularly scanning resolution and file format.
- Scan at 300 dpi resolution. In some cases, higher resolution can improve accuracy (e.g. for small font), but it doesn’t help general accuracy and it slows down processing times.
- Resolutions below 300 dpi (e.g. for fax) usually yield lower accuracy, but can be improved with some of the techniques described below (e.g. filtering, validation, and lookups).
- Regarding file format, TIFF provides better OCR accuracy, but PDF is more flexible for post-capture use (e.g. for human readability and across multiple channels). Therefore, evaluate which of three options to use: TIFF only, PDF only, or TIFF-to-PDF (possibly keeping both).
3. Address image enhancement, image correction, and rescan.
- Use run-time image optimization for OCR to determine optimum settings, increase contrast and density, balance the different factors, and automatically convert documents into high-quality, black-and-white images for rapid transport into back-end systems.
- Using image optimization also improves productivity by minimizing the time required to obtain scanner settings for optimal image clarity for documents of mixed sizes, colors, contrasts, and brightness.
- Some tools include notification capabilities to alert operators of problems with the scanner, paper jams, folded corners, or other problems. This makes it easier for the operator to quickly fix the problem and continue scanning, minimizing downtime.
- An innovative approach (e.g. in Kofax VRS, either as part of the capture software or OEMed into the scanner) is to first convert documents to grayscale images. VRS then analyzes them, determines the proper settings for the document, and converts the image to black and white. The system thus produces high-quality images without the large file sizes and slower processing associated with grayscale images.
- Grayscale image can be high-quality, “virtual hard copy” that can be resampled instead of requiring hard-copy rescan to achieve acceptable bi-tonal image quality. The last resort is to save the (large) grayscale image instead of the (smaller) bi-tonal image.
4. Address de-skewing and page layout analysis and improvement.
- De-skew pages in preprocessing so that word lines are horizontal.
- Note that the layout of pages and white space cannot be changed in already-received documents. To address such issues requires a longer-term approach addressing document design (see #1 above).
- The goal should be to reduce layout complexity in each page, to reduce variability between documents in each class, and to increase variability between document classes.
- Documents should have sufficient white space between lines, columns, and at page edges. This will help the identification of text boundaries during page layout analysis.
5. Address character image optimization for OCR (character edge analysis).
- Character edge optimization should be tuned for OCR, which has different requirements than human readability (and aesthetic) optimization.
- Capture and OCR software should be designed for this kind of tuning. Often, both modules can do it, so you may have to decide on the division of labor.
Five Key Recognition- and Indexing-related Factors
1. Address OCR – specifically, matching character edges to pattern images and deciding on which character it is.
- Obviously, select adequate OCR software for your requirements. The relevant capabilities should include the ability to accept the “optimized” output of the upstream scanner and capture software, and (to get a bit geeky) appropriate pattern images in the OCR software database and the appropriate algorithms in OCR software.
- The good news is that the best capture solutions OEM the best mainstream OCR engines and can integrate with others. The best approach is to start with the default approach and add or swap engines only if necessary.
2. Address OCR engine training.
- OCR engine training can be useful, e.g. for large, complex, document sets with high variability within document classes and low variability between document classes (e.g. if “complaint letters” all look different from each other, but don’t look very different from other kinds of letters). But training is very time-consuming and expensive, and it usually isn’t worth it.
3. Address OCR engine voting.
- OCR engine voting consists of an array of engines that process the characters. The final answer is the most common or some other result of voting. A more complex configuration is a cascade, in which the engines are ordered from simple but fast and cheap, to smarter but slower and more expensive. Each engine is triggered only if the one before it fails to reach a certain confidence level.
- Voting is useful in high-volume, complex use cases, as described in the training bullet above, or where the document class range is wide – e.g. in mailroom applications.
4. Try intelligent document recognition (IDR) – cautiously.
- Chaining together different recognitions into a cascade is often used for intelligent document recognition (IDR) to sort the documents into document classes, thus eliminating that part of prep work. This has the potential to improve capture center efficiency dramatically. But the technology is still relatively young, and the use case has to be appropriate. It is often marketed to address mailrooms, but it actually works effectively with only a subset of mailrooms – i.e. those with a relatively limited number of document types and with big differences between document types. So when you start introducing IDR into your operations, start with smaller, simpler applications.
5. Pursue filters, lookups, databases, and dictionaries.
- Filters, lookups, databases, and dictionaries can be extremely effective in two cases: 1) to reduce errors by constraining which characters or words are acceptable (e.g. a field may be numeric only or refer to these 1,000 place-names only); 2) to automatically constrain or populate other fields based on a key field (e.g. an account number will trigger the constraining or population of other fields). These should be pursued!
What’s surprising is that the actual recognition phase of capture may seem to be the most important step relevant to automated indexing – since it is, after all, the phase where OCR is performed. But you’ll notice that at least half of the factors relevant to successful indexing occur during the pre-recognition steps, particularly in obtaining appropriate image quality for OCR and indexing.
So make sure you address both. Do all you can to address the pre-recognition factors, to make things all the easier for the documents that hit your OCR engine.