Mass digitization is a term used to describe "large-scale digitization projects of varying scopes." Such projects include efforts to digitize physical books, on a mass scale, to make knowledge openly and publicly accessible and are made possible by selecting cultural objects, prepping them, scanning them, and constructing necessary digital infrastructures including
digital libraries. These projects are often piloted by cultural institutions and private bodies, however, individuals may attempt to conduct a mass digitization effort as well. Mass digitization efforts occur quite often; millions of files (books, photos, color swatches, etc.) are uploaded to large-scale public or private online archives every single day. This practice of taking the physical to the digital on a mass realm changes the way we interact with knowledge. The history of mass digitization can be traced as early as the mid-1800s with the advent of microfilm, and technical infrastructures such as the
internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
,
data
Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
farms, and
computer data storage
Computer data storage or digital data storage is a technology consisting of computer components and Data storage, recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The cent ...
make these efforts technologically possible. This seemingly simple process of digitization of physical knowledge, or even products, has vast implications that can be explored.
History of Mass Digitization Initiatives
Fictional Considerations
Perhaps one of the most notable considerations of mass digitization, in a fictional sense, is the speculations on the Library of
Babel by
Jorge Luis Borges
Jorge Francisco Isidoro Luis Borges Acevedo ( ; ; 24 August 1899 – 14 June 1986) was an Argentine short-story writer, essayist, poet and translator regarded as a key figure in Spanish literature, Spanish-language and international literatur ...
. In this account, Borges describes a vision of a library in which every possible permutation of books were available. Although Borges describes the preservation and archival practices of all knowledge in a physical space (a library), Borges' fictional vision has already taken place in a digital sense. Endless copies of online books are freely available to the public by means of internet archives or library databases. An account like this was actually quite common, and expertly conveys the idea that "the dream and practice of mass digitization cultural works have been around for decades."
Non-fictional considerations
Some of the earliest digitization programs started before the age of the internet, and include the adaption of technologies such as
microfilm in the 19th century. The technical affordances of microfilm allowed it to be a significant medium in the efforts to preserve and extend library materials, as well as its feature of "graphically dramatizing questions of scale." Microfilm was also known as
microphotography, developed in 1839, and its capabilities demonstrate (perhaps for the first time) the ability to store mass amounts of information, in this case photos, on a physically small space. When discussing the affordances of microfilm, it was noted by an observer that, "the whole archives of the nation might be packed away in a snuffbox." Such notes expertly demonstrate ''how'' the technical infrastructure of microfilm could be leveraged to archive and preserve on a mass scale.
Paul Otlet
Paul Marie Ghislain Otlet (; ; 23 August 1868 – 10 December 1944) was a Belgian author, lawyer and peace activist; who was a foundational figure in documentalism, a precursory discipline to information science.
Otlet created the Universal D ...
, a Belgian author often considered one of the founders of information science, "outlined the benefits of microfilm as a stable and long-term remediation format that could be used to extend the reach of literature" in his 1906 work "''Sur une forme nouvelle du livre : le livre microphotographique".'' His claim was proven right, with the
Library of Congress
The Library of Congress (LOC) is a research library in Washington, D.C., serving as the library and research service for the United States Congress and the ''de facto'' national library of the United States. It also administers Copyright law o ...
and other bodies using microfilm to "digitize" cultural objects such as manuscripts, books, images, and newspapers in the early 20th century.
Technical Infrastructures
Microfilm
Microfilm represents a shift in the infrastructure of data storage: an immense amount of pictures could be stored in a physically small space, and then expanded for viewing with the help of the microfilm machine. Microfilm, in combination with the
microfilm viewer, were leveraged to allow objects to be digitized, preserved, and viewed on a mass scale. It is interesting to note that students needed the help of staff before using the machine; accessing digital materials now is a swift, easy process that one can conduct independently. More information on microfilm can be found under the "Non-fictional considerations" tab of this page.
Server Farms
Another large shift in the infrastructure of data storage was the advent server farms. Websites rely on
server farms for “scalability, reliability, and low-latency access to Internet content”. According to Randal Burns
, these technologies are essential when building a high-performance infrastructure for content delivery. Moving from microfilm to complex server farms with their own schemas demonstrates the infrastructural demands mass digitization requires over time. Here, mass digitization is both facilitated and exists in this place. Without server farms, data would not be able to be stored or accessed on the necessary scale for mass digitization projects. However, it is important to note that server farms do not act alone in storing data. Other web based infrastructures aid greatly in the storage of data, such as
hard drives on a personal computer.
Encryption
In Cryptography law, cryptography, encryption (more specifically, Code, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the inf ...
tools and services also work to protect and secure data in sensitive, or internal use, mass digitization projects.
Databases
Databases are often seen as the "home" of a variety of mass digitization efforts. Databases, such as
Google Books
Google Books (previously known as Google Book Search, Google Print, and by its code-name Project Ocean) is a service from Google that searches the full text of books and magazines that Google has scanned, converted to text using optical charac ...
, allow one to view an entire collection of digitized objects. In the case of Google Books, the database allows a user to search, research, and preview an estimated 40 million titles, corresponding to roughly 30% of the estimated number of all books ever published that the Google team has scanned and uploaded However, faults do exist within such databases; the hands of a scanner can accidentally be scanned and posted, as opposed to the page of a book itself. Errors such as these in public, and often permanent, databases call into question the efficiency of human efforts in mass digitization projects.
Other databases allow researchers from all over the world to upload or view data for scientific inquiry. In this case, raw data from scientific experiments - anonymized for participant privacy - is uploaded and stored on a mass scale. A prime example of such databases for research purposes include the Child Language Data Exchange System (
CHILDES) Database. This database houses raw data for language acquisition, and includes videos, audio, transcripts, and de-identified participant information. Databases that store published research articles also exist, and include sites such as
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
,
ScienceDirect
ScienceDirect is a searchable web-based bibliographic database, which provides access to full texts of scientific and medical publications of the Dutch publisher Elsevier as well of several small academic publishers. It hosts over 18 million ...
,
JSTOR
JSTOR ( ; short for ''Journal Storage'') is a digital library of academic journals, books, and primary sources founded in 1994. Originally containing digitized back issues of academic journals, it now encompasses books and other primary source ...
, and
EBSCO.
Databases, in conjunction with server farms and other web based infrastructures, allow for crucial collaboration in the scientific realm. Here, mass digitization has expanded from the digitization of physical objects (such as books) to the digitization of interactions for scientific inquiry.
Implications
References
* Auerbach, J.; Gitelman, L. (2007-06-13). "Microfilm, Containment, and the Cold War". ''American Literary History''. 19 (3): 745–768. . {{ISSN, 0896-7148
* Luther, Frederic. Microfilm: A History, 1839–1900. Annapolis, MD: The National Microfilm Association, 1959.
* Goldschmidt, & Otlet, P. (1906). ''Sur une forme nouvelle du livre : le livre microphotographique''.
nstitut international de bibliographie
* La Hood, Charles G. "Microfilm for the Library of Congress." ''College & Research Libraries'' 34.4 (1973): 291–294.
* Duncan, Virginia L., and Frances E. Parsons. "Use of Microfilm in an Industrial Research Library." ''Spec Libr'' 61.6 (1970): 288–290.