computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

, an FM-index is a compressed full-text

substring index In computer science, a substring index is a data structure which gives substring search in a text or text collection in sublinear time. If you have a document S of length n, or a set of documents D=\ of total length n, you can locate all occurrenc ...

based on the Burrows–Wheeler transform, with some similarities to the

suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by a ...

. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manzini (2000)
"Opportunistic Data Structures with Applications".
Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390. who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data. The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2". A further improvement, the alphabet-friendly FM-index, combines the use of compression boosting and wavelet treesP. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index. ''In Proc. SPIRE'04'', pages 150-160. LNCS 3246. to significantly reduce the space usage for large alphabets. The FM-index has found use in, among other places,

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

Background

Using an index is a common strategy to efficiently search a large body of text. When the text is larger than what reasonably fits within a computer's main memory, there is a need to compress not only the text but also the index. When the FM-index was introduced, there were several suggested solutions that were based on traditional compression methods and tried to solve the compressed matching problem. In contrast, the FM-index is a compressed self-index, which means that it compresses the data and indexes it at the same time.

FM-index data structure

An FM-index is created by first taking the Burrows–Wheeler transform (BWT) of the input text. For example, the BWT of the string "abracadabra$" is "ard$rcaaaabb", and here it is represented by the matrix where each row is a rotation of the text, and the rows have been sorted lexicographically. The transform corresponds to the last column labeled . The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses. The rows in the matrix are essentially the sorted suffixes of the text and the first column F of the matrix shares similarities with

s. How the suffix array relates to the BWT lies at the heart of the FM-index.

Count

The operation ''count'' takes a pattern and returns the number of occurrences of that pattern in the original text . Since the rows of matrix are sorted, and it contains every suffix of , the occurrences of pattern will be next to each other in a single continuous range. The operation iterates backwards over the pattern. For every character in the pattern, the range that has the character as a suffix is found. For example, the count of the pattern "bra" in "abracadabra" follows these steps: # The first character we look for is , the last character in the pattern. The initial range is set to . This range over represents every character of that has a suffix beginning with ''a''. # The next character to look for is . The new range is , if is the index of the beginning of the range and is the end. This range over is all the characters of that have suffixes beginning with ''ra''. # The last character to look at is . The new range is . This range over is all the characters that have a suffix that begins with ''bra''. Now that the whole pattern has been processed, the count is the same as the size of the range: . If the range becomes empty or the range boundaries cross each other before the whole pattern has been looked up, the pattern does not occur in . Because can be performed in constant time, count can complete in linear time in the length of the pattern: time.

Locate

The operation ''locate'' takes as input an index of a character in and returns its position in . For instance . To locate every occurrence of a pattern, first the range of character is found whose suffix is the pattern in the same way the ''count'' operation found the range. Then the position of every character in the range can be located. To map an index in to one in , a subset of the indices in are associated with a position in . If has a position associated with it, is trivial. If it's not associated, the string is followed with until an associated index is found. By associating a suitable number of indices, an upper bound can be found. ''Locate'' can be implemented to find ''occ'' occurrences of a pattern in a text in time with

O \left(H_k(T) +  \right)

bits per input symbol for any .

Applications

DNA read mapping

FM index with backtracking has been successfully (>2000 citations) applied to approximate string matching/sequence alignment, See Bowtie http://bowtie-bio.sourceforge.net/index.shtml

References

{{reflist Substring indices String data structures