Web archiving is the process of collecting portions of the

World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web ...

to ensure the information is preserved in an

archive An archive is an accumulation of historical records or materials – in any medium – or the physical facility in which they are located. Archives contain primary source documents that have accumulated over the course of an individual or ...

for future researchers, historians, and the public. Web archivists typically employ

web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...

s for automated capture due to the massive size and amount of information on the Web. The largest web archiving organization based on a bulk crawling approach is the

Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by the Internet Archive, a nonprofit based in San Francisco, California. Created in 1996 and launched to the public in 2001, it allows the user to go "back in time" and see ...

, which strives to maintain an archive of the entire Web. The growing portion of human culture created and recorded on the web makes it inevitable that more and more libraries and archives will have to face the challenges of web archiving.

National libraries A national library is a library established by a government as a country's preeminent repository of information. Unlike public libraries, these rarely allow citizens to borrow books. Often, they include numerous rare, valuable, or significant wo ...

, national archives and various consortia of organizations are also involved in archiving culturally important Web content. Commercial web archiving software and services are also available to organizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.

History and development

While curation and organization of the web has been prevalent since the mid- to late-1990s, one of the first large-scale web archiving project was the

Internet Archive The Internet Archive is an American digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, ...

, a non-profit organization created by Brewster Kahle in 1996. The Internet Archive released its own search engine for viewing archived web content, the

, in 2001. As of 2018, the Internet Archive was home to 40 petabytes of data. The Internet Archive also developed many of its own tools for collecting and storing its data, including PetaBox for storing the large amounts of data efficiently and safely, and Heritrix, a web crawler developed in conjunction with the Nordic national libraries. Other projects launched around the same time included a web archiving project by the National Library of Canada, Australia's Pandora, Tasmanian web archives and Sweden's Kulturarw3. From 2001 the International Web Archiving Workshop (IWAW) provided a platform to share experiences and exchange ideas. The International Internet Preservation Consortium (IIPC), established in 2003, has facilitated international collaboration in developing standards and open source tools for the creation of web archives. The now-defunct

Internet Memory Foundation The Internet Memory Foundation (formerly the European Archive Foundation) was a non-profitable foundation whose purpose was archiving content of the World Wide Web. It supported projects and research that included the preservation and protection ...

was founded in 2004 and founded by the

European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body ...

in order to archive the web in Europe. This project developed and released many open source tools, such as "rich media capturing, temporal coherence analysis, spam assessment, and terminology evolution detection." The data from the foundation is now housed by the Internet Archive, but not currently publicly accessible. Despite the fact that there is no centralized responsibility for its preservation, web content is rapidly becoming the official record. For example, in 2017, the United States Department of Justice affirmed that the government treats the President’s tweets as official statements.

Collecting the web

Web archivists generally archive various types of web content including

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...

web pages, style sheets,

JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...

images An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...

, and

video Video is an electronic medium for the recording, copying, playback, broadcasting, and display of moving visual media. Video was first developed for mechanical television systems, which were quickly replaced by cathode-ray tube (CRT) syst ...

. They also archive metadata about the collected resources such as access time, MIME type, and content length. This metadata is useful in establishing authenticity and provenance of the archived collection.

Methods of collection

Remote harvesting

The most common web archiving technique uses

s to automate the process of collecting web pages. Web crawlers typically access web pages in the same manner that users with a browser see the Web, and therefore provide a comparatively simple method of remote harvesting web content. Examples of web crawlers used for web archiving include: * Heritrix *

HTTrack HTTrack is a free and open-source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3. HTTrack allows users to download World Wide Web sites from the Internet to a local computer. ...

Wget GNU Wget (or just Wget, formerly Geturl, also written as its package name, wget) is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from "World Wide Web" and " ''get''." It supports do ...

There exist various free services which may be used to archive web resources "on-demand", using web crawling techniques. These services include the

and

WebCite WebCite was an on-demand archive site, designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or a scholar cited or quoted ...

Database archiving

Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...

content into a standard schema, often using

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...

. Once stored in that standard format, the archived content of multiple databases can then be made available using a single access system. This approach is exemplified by th
DeepArc
an
Xinq
tools developed by the Bibliothèque Nationale de France and the

National Library of Australia The National Library of Australia (NLA), formerly the Commonwealth National Library and Commonwealth Parliament Library, is the largest reference library in Australia, responsible under the terms of the ''National Library Act 1960'' for "mainta ...

respectively. DeepArc enables the structure of a relational database to be mapped to an

XML schema An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constra ...

, and the content exported into an XML document. Xinq then allows that content to be delivered online. Although the original layout and behavior of the website cannot be preserved exactly, Xinq does allow the basic querying and retrieval functionality to be replicated.

Transactional archiving

Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a web server and a

web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...

. It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular

website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wi ...

, on a given date. This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information. A transactional archiving system typically operates by intercepting every

HTTP The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide We ...

request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bitstreams.

Difficulties and limitations

Crawlers

Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling: * The robots exclusion protocol may request crawlers not access portions of a website. Some web archivists may ignore the request and crawl those portions anyway. * Large portions of a web site may be hidden in the Deep Web. For example, the results page behind a web form can lie in the Deep Web if crawlers cannot follow a link to the results page. * Crawler traps (e.g., calendars) may cause a crawler to download an infinite number of pages, so crawlers are usually configured to limit the number of dynamic pages they crawl. * Most of the archiving tools do not capture the page as it is. It is observed that ad banners and images are often missed while archiving. However, it is important to note that a native format web archive, i.e., a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology. The Web is so large that crawling a significant portion of it takes a large number of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.

General limitations

Some web servers are configured to return different pages to web archiver requests than they would in response to regular browser requests. This is typically done to fool search engines into directing more user traffic to a website, and is often done to avoid accountability, or to provide enhanced content only to those browsers that can display it. Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman states that "although the Web is popularly regarded as a

public domain The public domain (PD) consists of all the creative work to which no exclusive intellectual property rights apply. Those rights may have expired, been forfeited, expressly waived, or may be inapplicable. Because those rights have expired, ...

resource, it is

copyright A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educatio ...

ed; thus, archivists have no legal right to copy the Web". However

national libraries A national library is a library established by a government as a country's preeminent repository of information. Unlike public libraries, these rarely allow citizens to borrow books. Often, they include numerous rare, valuable, or significant wo ...

in some countries have a legal right to copy portions of the web under an extension of a

legal deposit Legal deposit is a legal requirement that a person or group submit copies of their publications to a repository, usually a library. The number of copies required varies from country to country. Typically, the national library is the primary reposi ...

. Some private non-profit web archives that are made publicly accessible like

, the

or the

allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage. WebCite cites a recent lawsuit against Google's caching, which

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

won.

Laws

In 2017 the Financial Industry Regulatory Authority, Inc. (FINRA), a United States financial regulatory organization, released a notice stating all the business doing digital communications are required to keep a record. This includes website data, social media posts, and messages. Some

copyright law A copyright is a type of intellectual property that gives its owner the exclusive right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educatio ...

s may inhibit Web archiving. For instance, academic archiving by Sci-Hub falls outside the bounds of contemporary copyright law. The site provides enduring access to academic works including those that do not have an open access license and thereby contributes to the archival of scientific research which may otherwise be lost.

References

Citations

General bibliography

* * * * * * * * * *

External links

International Internet Preservation Consortium (IIPC)
��International consortium whose mission is to acquire, preserve, and make accessible knowledge and information from the Internet for future generations
International Web Archiving Workshop (IWAW)
��Annual workshop that focuses on web archiving

* ttps://www.loc.gov/webarchiving/ Library of Congress—Web Archiving
Web archiving bibliography
��Lengthy list of web-archiving resources

��Julien Masanès, Bibliothèque Nationale de France
Comparison of web archiving services
{{DEFAULTSORT:Web Archiving Collections care Computer-related introductions in 2001 Conservation and restoration of cultural heritage Digital preservation Digital Library project Museology