HOME

TheInfoList



OR:

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of
image An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensio ...
s of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing,
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, (extracted)
text-to-speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
, key data and text mining. OCR is a field of research in pattern recognition,
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machine A machine is a physical system using Power (physics), power to apply Force, forces and control Motion, moveme ...
and
computer vision Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...
. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of recognition accuracy for most fonts are now common, and with support for a variety of digital image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components.


History

Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind. In 1914,
Emanuel Goldberg Emanuel Goldberg ( he, עמנואל גולדברג; yi, עמנואל גאָלדבערג; russian: Эмануэль Гольдберг) (born: 31 August 1881; died: 13 September 1970) was an Israeli physicist and inventor. He was born in Moscow a ...
developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the
Optophone The optophone is a device, used by the blind, that scans text and generates time-varying chords of tones to identify letters. It is one of the earliest known applications of sonification. Dr. Edmund Fournier d'Albe of Birmingham University invente ...
, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s,
Emanuel Goldberg Emanuel Goldberg ( he, עמנואל גולדברג; yi, עמנואל גאָלדבערג; russian: Эмануэль Гольдберг) (born: 31 August 1881; died: 13 September 1970) was an Israeli physicist and inventor. He was born in Moscow a ...
developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted USA Patent number 1,838,389 for the invention. The patent was acquired by IBM.


Blind and visually impaired users

In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-
font In movable type, metal typesetting, a font is a particular #Characteristics, size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "Sort (typesetting), sort") for each glyph. A typeface consists of a range of ...
OCR, which could recognize text printed in virtually any font (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s.) Kurzweil decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologiesthe CCD flatbed scanner and the text-to-speech synthesizer. On January 13, 1976, the successful finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to
Xerox Xerox Holdings Corporation (; also known simply as Xerox) is an American corporation that sells print and digital document products and services in more than 160 countries. Xerox is headquartered in Norwalk, Connecticut (having moved from St ...
, which had an interest in further commercializing paper-to-computer text conversion. Xerox eventually spun it off as Scansoft, which merged with Nuance Communications. In the 2000s, OCR was made available online as a service (WebOCR), in a
cloud computing Cloud computing is the on-demand availability of computer system resources, especially data storage ( cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over m ...
environment, and in mobile applications like real-time translation of foreign-language signs on a
smartphone A smartphone is a portable computer device that combines mobile telephone and computing functions into one unit. They are distinguished from feature phones by their stronger hardware capabilities and extensive mobile operating systems, whic ...
. With the advent of smart-phones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have OCR functionality built into the operating system will typically use an OCR API to extract the text from the image file captured and provided by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable for ...
s, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters.


Applications

OCR engines have been developed into many kinds of domain-specific OCR applications, such as receipt OCR, invoice OCR, check OCR, legal billing document OCR. They can be used for: * Data entry for business documents, e.g.
Cheque A cheque, or check (American English; see spelling differences) is a document that orders a bank (or credit union) to pay a specific amount of money from a person's account to the person in whose name the cheque has been issued. The pers ...
, passport, invoice, bank statement and receipt * Automatic number plate recognition *In airports, for passport recognition and information extraction * Automatic insurance documents key information extraction * Traffic sign recognition * Extracting business card information into a contact list * More quickly make textual versions of printed documents, e.g. book scanning for
Project Gutenberg Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, as well as to "encourage the creation and distribution of eBooks." It was founded in 1971 by American writer Michael S. Hart and is the oldest digital li ...
* Make electronic images of printed documents searchable, e.g.
Google Books Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical ...
* Converting handwriting in real-time to control a computer ( pen computing) * Defeating CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. The purpose can also be to test the robustness of CAPTCHA anti-bot systems. * Assistive technology for blind and visually impaired users *Writing the instructions for vehicles by identifying CAD images in a database that are appropriate to the vehicle design as it changes in real time. *Making scanned documents searchable by converting them to searchable PDFs


Types

* Optical character recognition (OCR)targets typewritten text, one
glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
or
character Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
at a time. * Optical word recognitiontargets typewritten text, one word at a time (for languages that use a
space Space is the boundless three-dimensional extent in which objects and events have relative position and direction. In classical physics, physical space is often conceived in three linear dimensions, although modern physicists usually con ...
as a word divider). (Usually just called "OCR".) * Intelligent character recognition (ICR)also targets handwritten
printscript Block letters (known as printscript, manuscript, print writing or ball and stick in academics) are a sans-serif (or "gothic") style of writing Latin script in which the letters are individual glyphs, with no joining. Elementary education in Eng ...
or
cursive Cursive (also known as script, among other names) is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functional ...
text one glyph or character at a time, usually involving
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. * Intelligent word recognition (IWR)also targets handwritten
printscript Block letters (known as printscript, manuscript, print writing or ball and stick in academics) are a sans-serif (or "gothic") style of writing Latin script in which the letters are individual glyphs, with no joining. Elementary education in Eng ...
or
cursive Cursive (also known as script, among other names) is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functional ...
text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script. OCR is generally an "offline" process, which analyses a static document. There are cloud based services which provide an online OCR API service.
Handwriting movement analysis Handwriting movement analysis is the study and analysis of the movements involved in handwriting and drawing. It forms an important part of graphonomics, which became established after the "International Workshop on Handwriting Movement Analysi ...
can be used as input to handwriting recognition. Instead of merely using the shapes of glyphs and words, this technique is able to capture motions, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the end-to-end process more accurate. This technology is also known as "on-line character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition".


Techniques


Pre-processing

OCR software often "pre-processes" images to improve the chances of successful recognition. Techniques include: * De- skewIf the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. * Despeckleremove positive and negative spots, smoothing edges * BinarisationConvert an image from color or greyscale to black-and-white (called a " binary image" because there are two colors). The task of binarisation is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarisation itself is necessary since most commercial recognition algorithms work only on binary images since it proves to be simpler to do so. In addition, the effectiveness of the binarisation step influences to a significant extent the quality of the character recognition stage and the careful decisions are made in the choice of the binarisation employed for a given input image type; since the quality of the binarisation method employed to obtain the binary result depends on the type of the input image (scanned document,
scene text Scene text is text that appears in an image captured by a camera in an outdoor environment. The detection and recognition of scene text from camera captured images are computer vision tasks which became important after smart phones with good ca ...
image, historical degraded document etc.). * Line removalCleans up non-glyph boxes and lines * Layout analysis or "zoning"Identifies columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables. * Line and word detectionEstablishes baseline for word and character shapes, separates words if necessary. * Script recognitionIn multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. * Character isolation or "segmentation"For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. * Normalize aspect ratio and
scale Scale or scales may refer to: Mathematics * Scale (descriptive set theory), an object defined on a set of points * Scale (ratio), the ratio of a linear dimension of a model to the corresponding dimension of the original * Scale factor, a number ...
Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.


Text recognition

There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters. * ''Matrix matching'' involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching", " pattern recognition", or " image correlation". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly. * ''Feature extraction'' decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software.
Nearest neighbour classifiers In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...
such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. Software such as
Cuneiform Cuneiform is a logo- syllabic script that was used to write several languages of the Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is named for the characteristic wedg ...
and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognized with high confidence on the first pass to recognize better the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded). Modern OCR software include
Google Docs Google Docs is an online word processor included as part of the free, web-based Google Docs Editors suite offered by Google, which also includes: Google Sheets, Google Slides, Google Drawings, Google Forms, Google Sites and Google Keep. G ...
OCR, ABBYY FineReader and Transym. Others like OCRopus and Tesseract uses neural networks which are trained to recognize whole lines of text instead of focusing on single characters. A new technique known as iterative OCR automatically crops a document into sections based on page layout. OCR is performed on the sections individually using variable character confidence level thresholds to maximize page-level OCR accuracy. A patent from the United States Patent Office has been issued for this method The OCR result can be stored in the standardized ALTO format, a dedicated XML schema maintained by the United States
Library of Congress The Library of Congress (LOC) is the research library that officially serves the United States Congress and is the ''de facto'' national library of the United States. It is the oldest federal cultural institution in the country. The librar ...
. Other common formats include hOCR and PAGE XML. For a list of optical character recognition software see Comparison of optical character recognition software.


Post-processing

OCR accuracy can be increased if the output is constrained by a lexicona list of words that are allowed to occur in a document. This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like
proper noun A proper noun is a noun that identifies a single entity and is used to refer to that entity (''Africa'', '' Jupiter'', ''Sarah'', '' Microsoft)'' as distinguished from a common noun, which is a noun that refers to a class of entities (''contine ...
s. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy. The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation. "Near-neighbor analysis" can make use of
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...
frequencies to correct errors, by noting that certain words are often seen together. For example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The Levenshtein Distance algorithm has also been used in OCR post-processing to further optimize results from an OCR API.


Application-specific optimizations

In recent years, the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression, or rich information contained in color images. This strategy is called "Application-Oriented OCR" or "Customized OCR", and has been applied to OCR of license plates,
invoice An invoice, bill or tab is a commercial document issued by a seller to a buyer relating to a sale transaction and indicating the products, quantities, and agreed-upon prices for products or services the seller had provided the buyer. Pay ...
s, screenshots, ID cards, driver licenses, and automobile manufacturing. ''
The New York Times ''The New York Times'' (''the Times'', ''NYT'', or the Gray Lady) is a daily newspaper based in New York City with a worldwide readership reported in 2020 to comprise a declining 840,000 paid print subscribers, and a growing 6 million paid ...
'' has adapted the OCR technology into a proprietary tool they entitle, ''Document Helper'', that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review the contents.


Workarounds

There are several techniques for solving the problem of character recognition by means other than improved OCR algorithms.


Forcing better input

Special fonts like
OCR-A OCR-A is a font created in 1968, in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, but also by humans. OCR-A uses simple, thick strokes to ...
, OCR-B, or MICR fonts, with precisely specified sizing, spacing, and distinctive character shapes, allow a higher accuracy rate during transcription in bank check processing. Ironically, however, several prominent OCR engines were designed to capture text in popular fonts such as Arial or Times New Roman, and are incapable of capturing text in these fonts that are specialized and much different from popularly used fonts. As Google Tesseract can be trained to recognize new fonts, it can recognize OCR-A, OCR-B and MICR fonts. "Comb fields" are pre-printed boxes that encourage humans to write more legiblyone glyph per box. These are often printed in a "dropout color" which can be easily removed by the OCR system. Palm OS used a special set of glyphs, known as "
Graffiti Graffiti (plural; singular ''graffiti'' or ''graffito'', the latter rarely used except in archeology) is art that is written, painted or drawn on a wall or other surface, usually without permission and within public view. Graffiti ranges from s ...
" which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware. Users would need to learn how to write these special glyphs. Zone-based OCR restricts the image to a specific part of a document. This is often referred to as "Template OCR".


Crowdsourcing

Crowdsourcing humans to perform the character recognition can quickly process images like computer-driven OCR, but with higher accuracy for recognizing images than that obtained via computers. Practical systems include the Amazon Mechanical Turk and reCAPTCHA. The National Library of Finland has developed an online interface for users to correct OCRed texts in the standardized ALTO format. Crowd sourcing has also been used not to perform character recognition directly but to invite software developers to develop image processing algorithms, for example, through the use of rank-order tournaments.


Accuracy

Commissioned by the
U.S. Department of Energy The United States Department of Energy (DOE) is an executive department of the U.S. federal government that oversees U.S. national energy policy and manages the research and development of nuclear power and nuclear weapons in the United States. ...
(DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the ''Annual Test of OCR Accuracy'' from 1992 to 1996. Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%; total accuracy can be achieved by human review or Data Dictionary Authentication. Other areas—including recognition of hand printing,
cursive Cursive (also known as script, among other names) is any style of penmanship in which characters are written joined in a flowing manner, generally for the purpose of making writing faster, in contrast to block letters. It varies in functional ...
handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research. The MNIST database is commonly used for testing systems' ability to recognise handwritten digits. Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognized with no incorrect letters. Using a large enough dataset is so important in a neural network based handwriting recognition solutions. On the other hand, producing natural datasets is very complicated and time-consuming. An example of the difficulties inherent in digitizing old text is the inability of OCR to differentiate between the "
long s The long s , also known as the medial s or initial s, is an archaic form of the lowercase letter . It replaced the single ''s'', or one or both of the letters ''s'' in a 'double ''s sequence (e.g., "ſinfulneſs" for "sinfulness" and "poſ� ...
" and "f" characters. Web-based OCR systems for recognizing hand-printed text on the fly have become well known as commercial products in recent years (see Tablet PC history). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by pen computing software, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications. Recognition of cursive text is an active area of research, with recognition rates even lower than that of
hand-printed text Block letters (known as printscript, manuscript, print writing or ball and stick in academics) are a sans-serif (or "gothic") style of writing Latin script in which the letters are individual glyphs, with no joining. Elementary education in Eng ...
. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognizing entire words from a dictionary is easier than trying to parse individual characters from script. Reading the ''Amount'' line of a
cheque A cheque, or check (American English; see spelling differences) is a document that orders a bank (or credit union) to pay a specific amount of money from a person's account to the person in whose name the cheque has been issued. The pers ...
(which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognize all handwritten cursive script. Most programs allow users to set "confidence rates". This means that if the software does not achieve their desired level of accuracy, a user can be notified for manual review. An error introduced by OCR scanning is sometimes termed a "scanno" (by analogy with the term "typo").http://www.hoopoes.com/jargon/entry/scanno.shtml Dead link


Unicode

Characters to support OCR were added to the
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
Standard in June 1993, with the release of version 1.1. Some of these characters are mapped from fonts specific to MICR,
OCR-A OCR-A is a font created in 1968, in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, but also by humans. OCR-A uses simple, thick strokes to ...
or OCR-B.


See also


References


External links


Unicode OCRHex Range: 2440-245F
Optical Character Recognition in Unicode

{{DEFAULTSORT:Optical Character Recognition Applications of artificial intelligence Applications of computer vision Automatic identification and data capture Computational linguistics Unicode Symbols Machine learning task