computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, compressed pattern matching (abbreviated as CPM) is the process of searching for patterns in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less space.

Compressed matching problem

If the compressed file uses a variable width encoding it could be present a problem: for example, let “100” be the

codeword Code word may refer to: * Code word (communication), an element of a standardized code or protocol * Code word (figure of speech), designed to convey a predetermined meaning to a receptive audience, while remaining inconspicuous to others ** Proce ...

for ''a'' and let “110100” be the codeword for ''b''. If we are looking for an occurrence of ''a'' in the text we could obtain as result also an occurrence that is within the codeword of ''b'': we call this event ''false match''. So we have to verify if the occurrence detected is effectively aligned on a codeword boundary. However we could always decode the entire text and then apply a classic

string matching algorithm A string-searching algorithm, sometimes called string-matching algorithm, is an algorithm that searches a body of text for portions that match by pattern. A basic example of string searching is when the pattern and the searched text are arrays of ...

, but this usually requires more space and time and often is not possible, for example if the compressed file is hosted online. This problem of verifying the match returned by the compressed pattern matching algorithm is a true or a false match together with the impossibility of decoding an entire text is called the compressed matching problem.

Strategies

Many strategies exist for finding the boundaries of codewords and avoiding full decompression of the text, for example: *List of the indices of first bit of each codeword, where we can apply a binary search; *List of the indices of first bit of each codeword with differential coding, so we can take less space within the file; * Mask of bit, where bit 1 marks the starting bit of each codeword; *Subdivision in blocks, for a partial and aimed decompression. There were introduced algorithms that provide running time that grows logarithmically with the increase of string and pattern length.

References

*Shmuel T. Klein and Dana Shapir
PATTERN MATCHING IN HUFFMAN ENCODED TEXTS
(2003)

External links

* * * * * * {{DEFAULTSORT:Compressed Pattern Matching Data compression Pattern matching Computer data