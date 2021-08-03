The Canon copiers at Bryn Mawr College automatically apply English-language Optical Character Recognition (OCR) to new scans. However, the OCR process only works with decent quality scans. This article explains how to create scans that will be successfully converted by OCR software.

What is Optical Character Recognition?

A scan is simply a photograph of a page. The textual elements visible in that photograph are not editable, searchable text — they are simply patterns of light and dark. In order for a document to work with text-to-speech or Braille software, screen readers, highlighting and annotation tools, and other assistive technologies, these textual patterns need to be converted to actual text – that is, to a string of characters that can be edited, deleted, searched, etc. — through a process called optical character recognition (OCR). The OCR software looks at the patterns of lights and darks and uses algorithms to determine which patterns are characters and which characters they are.

The output of an OCR conversion is only as good as the input. If the OCR process can’t correctly identify and interpret characters, the text it generates will be nonsense and your PDF will not be accessible.

Making Scans that OCR Correctly

Start with a clean original. Highlighting, underlining, and page damage are primary culprits in preventing the OCR process from properly recognizing text.

Marginalia can also confuse OCR software, producing extraneous characters and interfering with its ability to correctly interpret neighboring words. For best results, erase marginalia or find a clean copy.

Scan with all pages oriented in the same direction and as close to horizontal or vertical as possible. The OCR process can correct for slight skewing, but text on a page that is very tilted will not be interpreted correctly.

Avoid cutting off text or blocking it with your hands, bookmarks, etc. (If the text isn’t visible, it can’t be recognized!)

For best results, scan one page at a time. Most OCR software can recognize that documents scanned “two-up” — that is, with two facing pages in a book or journal scanned the same time — have two columns of text. However, two-up scanning often creates shadows and distortions that can prevent parts of the text from being correctly interpreted. If each page of your original has multiple columns of text, you must scan one page at a time.



Documents That Will Need Special Attention

The OCR software used by the Canon copiers may not be able to successfully some documents, even if you have good scans.