Archive Site
   HOME

TheInfoList



OR:

In
web archiving Web archiving is the process of collecting, preserving, and providing access to material from the World Wide Web. The aim is to ensure that information is preserved in an archival format for research and the public. Web archivists typically ...
, an archive site is a
website A website (also written as a web site) is any web page whose content is identified by a common domain name and is published on at least one web server. Websites are typically dedicated to a particular topic or purpose, such as news, educatio ...
that stores information on webpages from the past for anyone to view.


Common techniques

Two common techniques for archiving websites are using a
web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
or soliciting user submissions: # Using a
web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
: By using a web crawler (e.g., the
Internet Archive The Internet Archive is an American 501(c)(3) organization, non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including web ...
) the service will not depend on an active community for its content, and thereby can build a larger database faster. However, web crawlers are only able to index and archive information the public has chosen to post to the Internet, or that is available to be crawled, as website developers and system administrators have the ability to block web crawlers from accessing ertainweb pages (using a
robots.txt robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit. The standard, dev ...
). # User submissions: While it can be difficult to start user submission services due to potentially low rates of user submissions, this system can yield some of the best results. By crawling web pages one is only able to obtain the information the public has chosen to post online; however, potential content providers may not bother to post certain information, assuming no one would be interested in it, because they lack a proper venue in which to post it, or because of copyright concerns. However, users who see someone wants their information may be more apt to submit it.


Examples


Google Groups

On 12 February 2001,
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
acquired the
usenet Usenet (), a portmanteau of User's Network, is a worldwide distributed discussion system available on computers. It was developed from the general-purpose UUCP, Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Elli ...
discussion group archives from Deja.com and turned it into their Google Groups service. They allow users to search old discussions with Google's search technology, while still allowing users to post to the
mailing list A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. Mailing lists are often rented or sold. If rented, the renter agrees to use the mailing list only at contra ...
s.


Internet Archive

The
Internet Archive The Internet Archive is an American 501(c)(3) organization, non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including web ...
is building a compendium of websites and
digital media In mass communication, digital media is any media (communication), communication media that operates in conjunction with various encoded machine-readable data formats. Digital content can be created, viewed, distributed, modified, listened to, an ...
. Starting in 1996, the Archive has been employing a web crawler to build up their database. It is one of the best known archive sites.


NBCUniversal Archives

NBCUniversal Archives offer access to exclusive content from
NBCUniversal NBCUniversal Media, LLC (abbreviated as NBCU and Trade name, doing business as NBCUniversal or Comcast NBCUniversal since 2013) is an American Multinational corporation, multinational mass media and Show business, entertainment conglomerate (comp ...
and its subsidiaries. Their NBCUniversal Archives website provides easy viewing of past and recent news clips, and it is a prime example of a news archive.NBCUniversal Archives
/ref>


Nextpoint

Nextpoint offers an automated
cloud In meteorology, a cloud is an aerosol consisting of a visible mass of miniature liquid droplets, frozen crystals, or other particles, suspended in the atmosphere of a planetary body or similar space. Water or various other chemicals may ...
-based,
SaaS Software as a service (SaaS ) is a cloud computing service model where the provider offers use of application software to a client and manages all needed physical and software resources. SaaS is usually accessed via a web application. Unlike oth ...
for marketing, compliance, and litigation related needs including electronic discovery.


PANDORA Archive

PANDORA ( Pandora Archive), founded in 1996 by the National Library of
Australia Australia, officially the Commonwealth of Australia, is a country comprising mainland Australia, the mainland of the Australia (continent), Australian continent, the island of Tasmania and list of islands of Australia, numerous smaller isl ...
, stands for Preserving and Accessing Networked Documentary Resources of Australia, which encapsulates their mission. They provide a long-term catalog of select online publications and web sites authored by Australians or that are of an Australian topic. They employ their PANDAS (PANDORA Digital Archiving System) when building their catalog.


textfiles.com

textfiles.com is a large library of old text files maintained by Jason Scott Sadofsky. Its mission is to archive the old documents that had floated around the
bulletin board systems A bulletin board system (BBS), also called a computer bulletin board service (CBBS), is a computer server running software that allows users to connect to the system using a terminal program. Once logged in, the user performs functions such as ...
(BBS) of his youth and to document other people's experiences on the bulletin board systems.


See also

*
Internet Archive The Internet Archive is an American 501(c)(3) organization, non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including web ...
* Pandora Archive *
WebCite WebCite is an intermittently available archive site, originally designed to digitally preserve scientific and educationally important material on the web by taking snapshots of Internet contents as they existed at the time when a blogger or ...
*
Web archiving Web archiving is the process of collecting, preserving, and providing access to material from the World Wide Web. The aim is to ensure that information is preserved in an archival format for research and the public. Web archivists typically ...


References

{{DEFAULTSORT:Archive Site Data management Online archives Web archiving initiatives