In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, an inverted index (also referred to as a postings list, postings file, or inverted file) is a
database index
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without ...
storing a mapping from content, such as words or numbers, to its locations in a
table
Table may refer to:
* Table (furniture), a piece of furniture with a flat surface and one or more legs
* Table (landform), a flat area of land
* Table (information), a data arrangement with rows and columns
* Table (database), how the table data ...
, or in a document or a set of documents (named in contrast to a
forward index Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and ...
, which maps from documents to content). The purpose of an inverted index is to allow fast
full-text search
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original tex ...
es, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in
document retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. Us ...
systems, used on a large scale for example in
search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s. Additionally, several significant general-purpose
mainframe
A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterpris ...
-based
database management systems
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases span ...
have used inverted list architectures, including
ADABAS
Adabas, a contraction of “adaptable database system," is a database package that was developed by Software AG to run on IBM mainframes. It was launched in 1971 as a non-relational database. As of 2019, Adabas is marketed for use on a wider ran ...
,
DATACOM/DB
Datacom/DB is a relational database management system for mainframe computers. It was developed in the early 1970s by Computer Information Management Company and was subsequently owned by Insyte, Applied Data Research, Ameritech, and Computer ...
, and
Model 204
Model 204 (M204) is a database management system for IBM and compatible mainframe computers developed and commercialized by Computer Corporation of America. It was announced in 1965, and first deployed in 1972. It incorporates a programming langua ...
.
There are two main variants of inverted indexes: A record-level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word. A word-level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document.
The latter form offers more functionality (like
phrase search In computer science, phrase searching allows users to retrieve content from information systems (such as documents from file storage systems, records from databases, and web pages on the internet) that contains a specific order and combination of wo ...
es), but needs more processing power and space to be created.
Applications
The inverted index
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
is a central component of a typical
search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a
forward index Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and ...
is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.
With the inverted index created, the query can now be resolved by jumping to the word ID (via
random access
Random access (more precisely and more generally called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any othe ...
) in the inverted index.
In pre-computer times,
concordances to important books were manually assembled. These were effectively inverted indexes with a small amount of accompanying commentary that required a tremendous amount of effort to produce.
In bioinformatics, inverted indexes are very important in the
sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
of short fragments of sequenced DNA. One way to find the source of a fragment is to search for it against a reference DNA sequence. A small number of mismatches (due to differences between the sequenced DNA and reference DNA, or errors) can be accounted for by dividing the fragment into smaller fragments—at least one subfragment is likely to match the reference DNA sequence. The matching requires constructing an inverted index of all substrings of a certain length from the reference DNA sequence. Since the human DNA contains more than 3 billion base pairs, and we need to store a DNA substring for every index and a 32-bit integer for index itself, the storage requirement for such an inverted index would probably be in the tens of gigabytes.
Compression
For historical reasons, inverted list compression and
bitmap compression were developed as separate lines of research, and only later were recognized as solving essentially the same problem.
[
Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson]
"An Experimental Study of Bitmap Compression vs. Inverted List Compression"
2017.
doi: 10.1145/3035918.3064007
See also
*
Index (search engine) Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, an ...
*
Reverse index
Database management systems provide multiple types of indexes to improve performance and data integrity across diverse applications. Index types include b-trees, bitmaps, and r-trees.
In database management systems, a reverse key index strategy r ...
*
Vector space model
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and ...
Bibliography
*
*
*
*
*
*
References
{{Reflist
External links
NIST's Dictionary of Algorithms and Data Structures: inverted indexManaging Gigabytes for Javaa free full-text search engine for large document collections written in Java.
Lucene- Apache Lucene is a full-featured text search engine library written in Java.
Sphinx Search- Open source high-performance, full-featured text search engine library used by craigslist and others employing an inverted index.
Example implementationson
Rosetta Code
Rosetta Code is a wiki-based programming website with implementations of common algorithms and solutions to various programming problems in many different programming languages. It is named for the Rosetta Stone, which has the same text inscribe ...
Caltech Large Scale Image Search Toolbox a Matlab toolbox implementing Inverted File Bag-of-Words image search.
Data management
Search algorithms
Database index techniques
Substring indices