Compressed matching problem
If the compressed file uses a variable width encoding it could be present a problem: for example, let “100” be the codeword for ''a'' and let “110100” be the codeword for ''b''. If we are looking for an occurrence of ''a'' in the text we could obtain as result also an occurrence that is within the codeword of ''b'': we call this event ''false match''. So we have to verify if the occurrence detected is effectively aligned on a codeword boundary. However we could always decode the entire text and then apply a classicStrategies
Many strategies exist for finding the boundaries of codewords and avoiding full decompression of the text, for example: *List of the indices of first bit of each codeword, where we can apply a binary search; *List of the indices of first bit of each codeword with differential coding, so we can take less space within the file; * Mask of bit, where bit 1 marks the starting bit of each codeword; *Subdivision in blocks, for a partial and aimed decompression. There were introduced algorithms that provide running time that grows logarithmically with the increase of string and pattern length.References
*Shmuel T. Klein and Dana ShapirExternal links
* * * * * * {{DEFAULTSORT:Compressed Pattern Matching Data compression Pattern matching Computer data