Analyzed Layout and Text Object (ALTO) is an open
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
Schema developed by the EU-funded project called METAe.
The standard was initially developed for the description of text
OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation.
ALTO is often used in combination with
Metadata Encoding and Transmission Standard
The Metadata Encoding and Transmission Standard (METS) is a metadata standards, metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema langu ...
(METS) for the description of the whole digitized object and creation of references across the ALTO files, e.g. reading sequence description.
The standard is hosted by the Library of Congress since 2010 and maintained by the Editorial Board initialized at the same time.
In the time from the final version of the ALTO standard in June 2004 (version 1.0) ALTO was maintained by CC
CCS Content Conversion Specialists GmbH, Hamburgup to version 1.4.
Structure
An ALTO file consists of three major sections as children of the root element:
Structure of ALTO Files
/ref>
* section contains metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
about the ALTO file itself and processing information on how the file was created.
* section contains the text and paragraph styles with their individual descriptions:
** has font descriptions
** has paragraph descriptions, e.g. alignment information
* section contains the content information. It is subdivided into elements.
Software support
* ABBYY FineReader
ABBYY FineReader PDF is an optical character recognition (OCR) application developed by ABBYY. First released in 1993, the program runs on Microsoft Windows (Windows 7 or later) and Apple macOS (10.12 Sierra or later). Since v15, the Windows ve ...
* eScriptorium
eScriptorium is a platform for manual or automated segmentation and text recognition of historical manuscripts and prints.
Details
The software is an open source software developed at the Paris Sciences et Lettres University as part of the ...
* Kitodo
Kitodo (Abbr. of ''key to digital objects'') is an open-source software suite intended to support mass digitization projects for cultural heritage institutions. The software implements international standards such as METS, MODS and other format ...
* Tesseract OCR
* Transkribus
Transkribus is a platform for the text recognition, image analysis and structure recognition of historical documents.
The platform was created in the context of the two EU projects "tranScriptorium" (2013–2015) and "READ" (Recognition and En ...
See also
* Metadata Encoding and Transmission Standard
The Metadata Encoding and Transmission Standard (METS) is a metadata standards, metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema langu ...
(METS)
* Dublin Core
140px, Logo of DCMI, maintenance agency for Dublin Core Terms
The Dublin Core vocabulary, also known as the Dublin Core Metadata Terms (DCMT), is a general purpose metadata vocabulary for describing resources of any type. It was first developed ...
, an ISO metadata standard
* Preservation Metadata: Implementation Strategies (PREMIS)
* Open Archives Initiative Protocol for Metadata Harvesting
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI- ...
(OAI-PMH)
* hOCR
hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Ma ...
* PAGE (XML) Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents. Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.
PAGE XML can be used to describe:
* ...
References
External links
ALTO (Analyzed Layout and Text Object) standards
on Library of Congress website
https://altoxml.github.io/
resp
https://github.com/altoxml
ALTOxml on GitHub
More info about METS/ALTO by CCS GmbH
METS ALTO Introduction by CCS GmbH
{{Webarchive, url=https://web.archive.org/web/20140904022519/http://content-conversion.com/wp-content/uploads/2014/09/CCS-METS-ALTO-Info_basic_20140902.pdf , date=2014-09-04
XSLT-Transformations from and to ALTO
XML
Markup languages
Technical communication
Open file formats
Metadata