eScriptorium is a
platform for manual or automated segmentation and
text recognition of historical
manuscript
A manuscript (abbreviated MS for singular and MSS for plural) was, traditionally, any document written by hand or typewritten, as opposed to mechanically printed or reproduced in some indirect or automated way. More recently, the term has ...
s and prints.
Details

The software is an
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
software developed at the
Paris Sciences et Lettres University
PSL University (PSL or in French Université PSL, for Paris Sciences et Lettres) is a ''Grands établissements, Grand établissement'' based in Paris, France. It was established in 2010 and formally created as a university in 2019. It is a colle ...
as part of the projects ''Scripta'' and ''RESILIENCE'' with contributions from other institutions, partly funded by the EU's
Horizon 2020
The Framework Programmes for Research and Technological Development, also called Framework Programmes or abbreviated FP1 to FP9, are funding programmes created by the European Union/European Commission to support and foster research in the Europe ...
funding program and a grant from the
Andrew W. Mellon Foundation
The Andrew W. Mellon Foundation, commonly known as the Mellon Foundation, is a New York City-based private foundation with wealth accumulated by Andrew Mellon of the Mellon family of Pittsburgh, Pennsylvania. It is the product of the 1969 merger ...
.
Scanned pages from manuscripts and prints can be imported into eScriptorium and exported as text in various formats (text,
ALTO
The musical term alto, meaning "high" in Italian (Latin: '' altus''), historically refers to the contrapuntal part higher than the tenor and its associated vocal range. In four-part voice leading alto is the second-highest part, sung in ch ...
or
PAGE XML,
TEI). The text areas with text lines in the images are first recognized manually or automatically (segmentation). The text lines are then transcribed manually or automatically.
Both automatic segmentation and text recognition can be trained using manually created or corrected examples (
ground truth
Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference.
Etymology
The ''Oxford English Dictionary'' (s.v. ''ground ...
). The new models created in this way can be shared with others and can therefore be easily reused.
eScriptorium is built on top of the
free OCR software ''Kraken'' by Benjamin Kiessling, a derivative of the OCR software ''
OCRopus
OCRopus is a Free software, free Document Layout Analysis, document analysis and optical character recognition (OCR) system released under the Apache License, Apache License v2.0 with a very modular design using command-line interfaces.
OCRopus i ...
'', which is suitable for handwritten and printed texts and also supports scripts such as Hebrew and Arabic, which are written from right to left.
Comparable programs that offer similar functions to eScriptorium are OCR4All
and
Transkribus.
Individual references
External links
{{Commons category, EScriptorium, eScriptorium
Optical character recognition software
Free software programmed in JavaScript
Free software programmed in Python