Page Analysis And Ground Truth Elements
   HOME

TheInfoList



OR:

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents. Comparable to
ALTO (XML) Analyzed Layout and Text Object (ALTO) is an open XML Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text Optical character recognition, OCR and layout information of pages for ...
, it allows the organisation and structure of a page and its contents to be described. PAGE XML can be used to describe: * page content (regions, lines of text, words, glyphs, reading order, text content, ...) * the evaluation of the layout analysis (evaluation profiles, evaluation results, ...) * the cutting of the document image (cutting grids) The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the
University of Salford The University of Salford is a Public university, public research university in Salford, Greater Manchester, Salford, Greater Manchester, England, west of Manchester city centre. The Royal Technical Institute, Salford, which opened in 1896, be ...
in Manchester. It was designed to be used in conjunction with automatic segmentation and transcription techniques ( OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR). The PAGE XML schema is notably used as an export and import format by automatic transcription software such as
eScriptorium eScriptorium is a platform for manual or automated segmentation and text recognition of historical manuscripts and prints. Details The software is an open source software developed at the Paris Sciences et Lettres University as part of the ...
and
Transkribus Transkribus is a platform for the text recognition, image analysis and structure recognition of historical documents. The platform was created in the context of the two EU projects "tranScriptorium" (2013–2015) and "READ" (Recognition and En ...
. It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts.


References

{{Reflist


External links


Documentation

Encoding example


in the ''OCR-D project'', funded by
Deutsche Forschungsgemeinschaft The German Research Foundation ( ; DFG ) is a German research funding organization, which functions as a self-governing institution for the promotion of science and research in the Federal Republic of Germany. In 2019, the DFG had a funding bu ...
.
Documentation "Page Content - Ground Truth and Storage"

Documentation "Evaluation - Metadata, Profile and Results"

Documentation "Dewarping - Ground Truth and Storage"
XML-based standards Optical character recognition Handwriting recognition