Analyzed Layout And Text Object
   HOME

TheInfoList



OR:

Analyzed Layout and Text Object (ALTO) is an open
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation. ALTO is often used in combination with
Metadata Encoding and Transmission Standard The Metadata Encoding and Transmission Standard (METS) is a metadata standards, metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema langu ...
(METS) for the description of the whole digitized object and creation of references across the ALTO files, e.g. reading sequence description. The standard is hosted by the Library of Congress since 2010 and maintained by the Editorial Board initialized at the same time. In the time from the final version of the ALTO standard in June 2004 (version 1.0) ALTO was maintained by CC
CCS Content Conversion Specialists GmbH, Hamburg
up to version 1.4.


Structure

An ALTO file consists of three major sections as children of the root element:Structure of ALTO Files
/ref> * section contains
metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...
about the ALTO file itself and processing information on how the file was created. * section contains the text and paragraph styles with their individual descriptions: ** has font descriptions ** has paragraph descriptions, e.g. alignment information * section contains the content information. It is subdivided into elements.


Software support

*
ABBYY FineReader ABBYY FineReader PDF is an optical character recognition (OCR) application developed by ABBYY. First released in 1993, the program runs on Microsoft Windows (Windows 7 or later) and Apple macOS (10.12 Sierra or later). Since v15, the Windows ve ...
*
eScriptorium eScriptorium is a platform for manual or automated segmentation and text recognition of historical manuscripts and prints. Details The software is an open source software developed at the Paris Sciences et Lettres University as part of the ...
*
Kitodo Kitodo (Abbr. of ''key to digital objects'') is an open-source software suite intended to support mass digitization projects for cultural heritage institutions. The software implements international standards such as METS, MODS and other format ...
* Tesseract OCR *
Transkribus Transkribus is a platform for the text recognition, image analysis and structure recognition of historical documents. The platform was created in the context of the two EU projects "tranScriptorium" (2013–2015) and "READ" (Recognition and En ...


See also

*
Metadata Encoding and Transmission Standard The Metadata Encoding and Transmission Standard (METS) is a metadata standards, metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema langu ...
(METS) *
Dublin Core 140px, Logo of DCMI, maintenance agency for Dublin Core Terms The Dublin Core vocabulary, also known as the Dublin Core Metadata Terms (DCMT), is a general purpose metadata vocabulary for describing resources of any type. It was first developed ...
, an ISO metadata standard * Preservation Metadata: Implementation Strategies (PREMIS) *
Open Archives Initiative Protocol for Metadata Harvesting The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI- ...
(OAI-PMH) *
hOCR hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Ma ...
* PAGE (XML)


References


External links


ALTO (Analyzed Layout and Text Object) standards
on Library of Congress website
https://altoxml.github.io/
resp
https://github.com/altoxml
ALTOxml on GitHub
More info about METS/ALTO by CCS GmbH

METS ALTO Introduction by CCS GmbH
{{Webarchive, url=https://web.archive.org/web/20140904022519/http://content-conversion.com/wp-content/uploads/2014/09/CCS-METS-ALTO-Info_basic_20140902.pdf , date=2014-09-04
XSLT-Transformations from and to ALTO
XML Markup languages Technical communication Open file formats Metadata