The
Open Archives Initiative
The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives t ...
Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for
harvest
Harvesting is the process of collecting plants, animals, or fish (as well as fungi) as food, especially the process of gathering mature crops, and "the harvest" also refers to the collected crops. Reaping is the cutting of grain or pulses fo ...
ing
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
descriptions of records in an archive so that services can be built using metadata from many archives. An
implementation
Implementation is the realization of an application, execution of a plan, idea, scientific modelling, model, design, specification, Standardization, standard, algorithm, policy, or the Management, administration or management of a process or Goal ...
of OAI-PMH must support representing metadata in
Dublin Core
140px, Logo of DCMI, maintenance agency for Dublin Core Terms
The Dublin Core vocabulary, also known as the Dublin Core Metadata Terms (DCMT), is a general purpose metadata vocabulary for describing resources of any type. It was first developed ...
, but may also support additional representations.
The protocol is usually just referred to as the OAI Protocol.
OAI-PMH uses
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
over
HTTP
HTTP (Hypertext Transfer Protocol) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, wher ...
. Version 2.0 of the protocol was released in 2002; the document was last updated in 2015. It has a
Creative Commons license
A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work". A CC license is used when an author wants to give other people the right to share, use, and bu ...
BY-SA.
History
In the late 1990s,
Herbert Van de Sompel (
Ghent University
Ghent University (, abbreviated as UGent) is a Public university, public research university located in Ghent, in the East Flanders province of Belgium.
Located in Flanders, Ghent University is the second largest Belgian university, consisting o ...
) was working with researchers and librarians at
Los Alamos National Laboratory
Los Alamos National Laboratory (often shortened as Los Alamos and LANL) is one of the sixteen research and development Laboratory, laboratories of the United States Department of Energy National Laboratories, United States Department of Energy ...
(US) and called a meeting to address difficulties related to
interoperability
Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader de ...
issues of
e-print servers and
digital repositories. The meeting was held in
Santa Fe, New Mexico
Santa Fe ( ; , literal translation, lit. "Holy Faith") is the capital city, capital of the U.S. state of New Mexico, and the county seat of Santa Fe County. With over 89,000 residents, Santa Fe is the List of municipalities in New Mexico, fourt ...
, in October 1999. A key development from the meeting was the definition of an interface that permitted e-print servers to expose
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention".
Several workshops were held in 2000 at the ACM Digital Libraries conference, at the 1st ACM/IEEE-CS joint conference on Digital libraries and elsewhere to share the ideas from the Santa Fe Convention. It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the
Coalition for Networked Information and the
Digital Library Federation provided funding to establish an
Open Archives Initiative
The Open Archives Initiative (OAI) was an informal organization, in the circle around the colleagues Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson and Simeon Warner, to develop and apply technical interoperability standards for archives t ...
(OAI) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at
Cornell University
Cornell University is a Private university, private Ivy League research university based in Ithaca, New York, United States. The university was co-founded by American philanthropist Ezra Cornell and historian and educator Andrew Dickson W ...
(
Ithaca, New York
Ithaca () is a city in and the county seat of Tompkins County, New York, United States. Situated on the southern shore of Cayuga Lake in the Finger Lakes region of New York (state), New York, Ithaca is the largest community in the Ithaca metrop ...
) in September 2000 aimed to improve the interface developed at the Santa Fe Convention. The specifications were refined over e-mail.
OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in
Washington D.C., and another in February in
Berlin, Germany
Berlin ( ; ) is the capital and largest city of Germany, by both area and population. With 3.7 million inhabitants, it has the highest population within its city limits of any city in the European Union. The city is also one of the states of ...
. Subsequent modifications to the
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
standard by the
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working together in ...
required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible.
OAI workshops
From 2001
CERN
The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in Meyrin, western suburb of Gene ...
, and later in collaboration with
University of Geneva
The University of Geneva (French: ''Université de Genève'') is a public university, public research university located in Geneva, Switzerland. It was founded in 1559 by French theologian John Calvin as a Theology, theological seminary. It rema ...
, has organized bi-annual OAI workshops, which over time have developed to cover most aspects of
open science
Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of society, amateur or professional. Open science is transparent and accessib ...
. Since 2021 the workshop series is named the Geneva Workshop on Innovations in Scholarly Communication, with the nick name OAI reflecting its origin.
Uses
Some commercial
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s use OAI-PMH to acquire more resources.
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
initially included support for OAI-PMH when launching sitemaps, however decided to support only the standard XML
Sitemaps
Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, h ...
format in May 2008. In 2004,
Yahoo!
Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, and its a ...
acquired content from
OAIster
OAIster is an online combined bibliographic catalogue of open access material aggregated using OAI-PMH.
It began at the University of Michigan in 2002 funded by a grant from the Andrew W. Mellon Foundation and with the purpose of establishin ...
(
University of Michigan
The University of Michigan (U-M, U of M, or Michigan) is a public university, public research university in Ann Arbor, Michigan, United States. Founded in 1817, it is the oldest institution of higher education in the state. The University of Mi ...
) that was obtained through metadata harvesting with OAI-PMH.
Wikimedia uses an OAI-PMH repository to provide feeds of
Wikipedia
Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
and related site updates for search engines and other bulk analysis/republishing endeavors. Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing incremental harvesting. NASA's
Mercury metadata search system uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day.
The
mod_oai project is using OAI-PMH to expose content to web crawlers that is accessible from
Apache Web servers.
OAI-PMH has later been applied to sharing of scientific data.
Software
OAI-PMH is based on a
client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
metadata in
Dublin Core
140px, Logo of DCMI, maintenance agency for Dublin Core Terms
The Dublin Core vocabulary, also known as the Dublin Core Metadata Terms (DCMT), is a general purpose metadata vocabulary for describing resources of any type. It was first developed ...
format, and may also provide it in other XML formats.
A number of software systems support the OAI-PMH, including
Fedora,
EThOS
''Ethos'' is a Greek word meaning 'character' that is used to describe the guiding beliefs or ideals that characterize a community, nation, or ideology; and the balance between caution and passion. The Greeks also used this word to refer to the ...
from the
British Library
The British Library is the national library of the United Kingdom. Based in London, it is one of the largest libraries in the world, with an estimated collection of between 170 and 200 million items from multiple countries. As a legal deposit li ...
,
GNU EPrints from the
University of Southampton
The University of Southampton (abbreviated as ''Soton'' in post-nominal letters) is a public university, public research university in Southampton, England. Southampton is a founding member of the Russell Group of research-intensive universit ...
,
Open Journal Systems from the
Public Knowledge Project
The Public Knowledge Project (PKP) is a non-profit research initiative that is focused on the importance of making the results of publicly funded research freely available through open access policies, and on developing strategies for making thi ...
,
Desire2Learn,
DSpace from
MIT
The Massachusetts Institute of Technology (MIT) is a private research university in Cambridge, Massachusetts, United States. Established in 1861, MIT has played a significant role in the development of many areas of modern technology and sc ...
, HyperJournal from the
University of Pisa, Digibib from Digibis,
MyCoRe,
Koha, Primo, DigiTool, Rosetta and MetaLib from
Ex Libris, ArchivalWare from PTFS, DOOR from the eLab in Lugano, Switzerland, panFMP from the
PANGAEA data library,
SimpleDL from Roaring Development, and jOAI from the
National Center for Atmospheric Research
The US National Center for Atmospheric Research (NCAR ) is a US federally funded research and development center (FFRDC) managed by the nonprofit University Corporation for Atmospheric Research (UCAR) and funded by the National Science Foundat ...
.
Archives
A number of large archives support the protocol including
arXiv
arXiv (pronounced as "archive"—the X represents the Chi (letter), Greek letter chi ⟨χ⟩) is an open-access repository of electronic preprints and postprints (known as e-prints) approved for posting after moderation, but not Scholarly pee ...
and the
CERN
The European Organization for Nuclear Research, known as CERN (; ; ), is an intergovernmental organization that operates the largest particle physics laboratory in the world. Established in 1954, it is based in Meyrin, western suburb of Gene ...
Document Server.
See also
*
Data format management
*
Digital curation
*
Digital preservation
In library science, library and archival science, digital preservation is a formal process to ensure that digital information of continuing value remains accessible and usable in the long term. It involves planning, resource allocation, and appli ...
*
File format
A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
*
Dublin Core
140px, Logo of DCMI, maintenance agency for Dublin Core Terms
The Dublin Core vocabulary, also known as the Dublin Core Metadata Terms (DCMT), is a general purpose metadata vocabulary for describing resources of any type. It was first developed ...
, an ISO metadata standard
*
National Digital Information Infrastructure and Preservation Program (NDIIPP)
*
National Digital Library Program
The National Digital Library Program (NDLP) is a project by the United States Library of Congress to assemble a digital library of reproductions of primary source materials to support the study of the history and culture of the United States. ...
(NDLP)
*
Metadata Encoding and Transmission Standard
The Metadata Encoding and Transmission Standard (METS) is a metadata standards, metadata standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using the XML schema langu ...
(METS) maintained by the Library of Congress
*
Preservation Metadata: Implementation Strategies (PREMIS)
*
LOCKSS
The LOCKSS ("Lots of Copies Keep Stuff Safe") project, under the auspices of Stanford University, is a peer-to-peer network that develops and supports an open source system allowing libraries to collect, preserve and provide their readers with ac ...
*
Search as a service
*
Web archiving
Web archiving is the process of collecting, preserving, and providing access to material from the World Wide Web. The aim is to ensure that information is preserved in an archival format for research and the public.
Web archivists typically ...
*
Object Reuse and Exchange (OAI-ORE)
Geneva Workshop on Innovations in Scholarly Communication
References
External links
Suleyman Demirel University Open Archives Harvester*
ttp://www.digitalpreservation.gov/ Library of Congress, National Digital Information Infrastructure and Preservation ProgramLibrary of Congress, Web Capture
{{open access navbox
Online archives
Internet protocols
Metadata
Open access projects
Archival science
de:OAI-PMH