OCRFeeder
   HOME

TheInfoList



OR:

OCRFeeder is an
optical character recognition Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a sc ...
suite for GNOME, which also supports virtually any command-line OCR engine, such as
CuneiForm Cuneiform is a logo-syllabic script that was used to write several languages of the Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is named for the characteristic wedge-sh ...
, GOCR, Ocrad and
Tesseract In geometry, a tesseract is the four-dimensional analogue of the cube; the tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of six square faces, the hypersurface of the tesseract consists of e ...
. It converts paper documents to digital document files and can serve to make them accessible to visually impaired users. OCRFeeder is free and open-source software subject to the terms of the
GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end users the four freedoms to run, study, share, and modify the software. The license was the first copyleft for general ...
(GPL) version 3 or later. It is available for
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...
and other
Unix-like A Unix-like (sometimes referred to as UN*X or *nix) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Unix-li ...
operating systems.


History

OCRFeeder was started as a
master's thesis A thesis ( : theses), or dissertation (abbreviated diss.), is a document submitted in support of candidature for an academic degree or professional qualification presenting the author's research and findings.International Standard ISO 7144: ...
in
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includi ...
by Joaquim Rocha, who was later hired by
Igalia Igalia is a private, worker-owned, employee-run cooperative model consultancy focused on open source software. Based in A Coruña, Galicia (Spain), Igalia is known for its contributions and commitments to both open-source and open standards. Igal ...
, S.L. and continued development there. The first version was published in March 2009. The OCRFeeder project was initially published and hosted on
Google Code Google Developers (previously Google Code) , application programming interfaces (APIs), and technical resources. The site contains documentation on using Google developer tools and APIs—including discussion groups and blogs for developers usi ...
, temporarily used Gitorious and now uses the GNOME infrastructure. Since 5 April 2010 a software package is included in the official Debian repositories. Version 0.7 from July 30, 2010 brought image pre-processing features, 0.7.1 (November 8, 2010) enabled for scanner access from within OCRFeeder.


Features

OCRFeeder has a simple graphical user interface that is designed to the GNOME Human Interface Guidelines. It performs a Document Layout Analysis and transfers the layout to capable output formats. It searches for content areas, outlines them and guesses the content type (text or image) and processes text areas through the OCR back-end. It can use virtually any command-line OCR engine as back-end and features auto-detection and auto-configuration for all popular free engines. OCR back-ends may be either auto-configured, the necessary command line entered in a GUI dialogue or configured directly via a
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
file. Scan image post-processing including de-skewing can be done. All recognition results can be reviewed and edited before saving to the desired output format. Sessions can be saved and loaded. The suite also includes a spell checker. OCRFeeder has built-in procedures for the post-processing of the raw OCR results returned by the OCR engine. It can remove remaining segmentation to printed lines of text, even with removal of hyphenation. Although OCRFeeder is a GUI tool, it can also run in
command line A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...
mode (as ocrfeeder-cli), which may be a useful tool for automatic document
batch processing Computerized batch processing is a method of running software programs called jobs in batches automatically. While users are required to submit the jobs, no other interaction by the user is required to process the batch. Batches may automatically ...
. In this mode OCRFeeder uses the default OCR engine, which the user can set in the application's preferences. The program is written in
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
and uses the GTK+ library (using
PyGTK PyGTK is a set of Python wrappers for the GTK graphical user interface library. PyGTK is free software and licensed under the LGPL. It is analogous to PyQt/ PySide and wxPython, the Python wrappers for Qt and wxWidgets, respectively. Its ...
). It acts as a
graphical Graphics () are visual images or designs on some surface, such as a wall, canvas, screen, paper, or stone, to inform, illustrate, or entertain. In contemporary usage, it includes a pictorial representation of data, as in design and manufacture ...
front-end for other existing tools. For example, it does not make actual character recognition itself, but uses external programs such as an “OCR engine” that is installed on the system. It can automatically detect and configure
CuneiForm Cuneiform is a logo-syllabic script that was used to write several languages of the Ancient Middle East. The script was in active use from the early Bronze Age until the beginning of the Common Era. It is named for the characteristic wedge-sh ...
, GOCR, Ocrad and
Tesseract In geometry, a tesseract is the four-dimensional analogue of the cube; the tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of six square faces, the hypersurface of the tesseract consists of e ...
as backend OCR engines. Scanners are accessed via SANE. For post-processing of scanned images there is integration of the command-line tool “Unpaper”, among other things. PDF files are processed using
Ghostscript Ghostscript is a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format (PDF) page description languages. Its main purposes are the rasterization or rendering of such page description language file ...
in the backend.


Input and output

OCRFeeder can import data from PDF or graphic files. From 0.7.1a version it supports grabbing images directly from the scanner device. The results can be saved in
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
,
OpenDocument The Open Document Format for Office Applications (ODF), also known as OpenDocument, is an open file format for word processing documents, spreadsheets, presentations and graphics and using ZIP-compressed XML files. It was developed wi ...
,
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). It may also include a limit ...
or PDFVersion 0.7.6
/ref> file formats.
hOCR hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Ma ...
file output is also planned. Initial formatting can be done directly in the program.


References


External links

* {{GNOME Optical character recognition software Free software programmed in Python Software that uses PyGTK GNOME Applications Software that uses GTK