coding theory Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and data storage. Codes are stud ...

, the Sardinas–Patterson algorithm is a classical algorithm for determining in

polynomial time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

whether a given

variable-length code In coding theory a variable-length code is a code which maps source symbols to a ''variable'' number of bits. Variable-length codes can allow sources to be compressed and decompressed with ''zero'' error (lossless data compression) and still b ...

is uniquely decodable, named after August Albert Sardinas and George W. Patterson, who published it in 1953. The algorithm carries out a systematic search for a string which admits two different decompositions into codewords. As Knuth reports, the algorithm was rediscovered about ten years later in 1963 by Floyd, despite the fact that it was at the time already well known in coding theory.

Idea of the algorithm

Consider the code

\

. This code, which is based on an example by Berstel, is an example of a code which is not uniquely decodable, since the string :011101110011 can be interpreted as the sequence of codewords :01110 – 1110 – 011, but also as the sequence of codewords :011 – 1 – 011 – 10011. Two possible decodings of this encoded string are thus given by ''cdb'' and ''babe''. In general, a codeword can be found by the following idea: In the first round, we choose two codewords

x_1

and

y_1

such that

x_1

is a

prefix A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particu ...

y_1

, that is,

x_1w = y_1

for some "dangling suffix"

w

. If one tries first

x_1=011

and

y_1=01110

, the dangling suffix is

w = 10

. If we manage to find two sequences

x_2,\ldots,x_p

and

y_2,\ldots,y_q

of codewords such that

x_2\cdots x_p = wy_2\cdots y_q

, then we are finished: For then the string

x = x_1x_2\cdots x_p

can alternatively be decomposed as

y_1y_2\cdots y_q

, and we have found the desired string having at least two different decompositions into codewords. In the second round, we try out two different approaches: the first trial is to look for a codeword that has ''w'' as prefix. Then we obtain a new dangling suffix ''w, with which we can continue our search. If we eventually encounter a dangling suffix that is itself a codeword (or the empty word), then the search will terminate, as we know there exists a string with two decompositions. The second trial is to seek for a codeword that is itself a prefix of ''w''. In our example, we have

w = 10

, and the sequence ''1'' is a codeword. We can thus also continue with ''w'=0'' as the new dangling suffix.

Precise description of the algorithm

The algorithm is described most conveniently using quotients of

formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of s ...

s. In general, for two sets of strings ''D'' and ''N'', the (left) quotient

N^D

is defined as the residual words obtained from ''D'' by removing some prefix in ''N''. Formally,

N^D = \

. Now let

C

denote the (finite) set of codewords in the given code. The algorithm proceeds in rounds, where we maintain in each round not only one dangling suffix as described above, but the (finite) set of all potential dangling suffixes. Starting with round

i=1

, the set of potential dangling suffixes will be denoted by

S_i

. The sets

S_i

are defined inductively as follows:

S_1 = C^C \setminus \

. Here, the symbol

\varepsilon

denotes the empty word.

S_ = C^S_i\cup S_i^C

, for all

i\ge 1

. The algorithm computes the sets

S_i

in increasing order of

i

. As soon as one of the

S_i

contains a word from ''C'' or the empty word, then the algorithm terminates and answers that the given code is not uniquely decodable. Otherwise, once a set

S_i

equals a previously encountered set

S_j

with

j, then the algorithm would enter in principle an endless loop. Instead of continuing endlessly, it answers that the given code is uniquely decodable.

Termination and correctness of the algorithm

Since all sets

S_i

are sets of suffixes of a finite set of codewords, there are only finitely many different candidates for

S_i

. Since visiting one of the sets for the second time will cause the algorithm to stop, the algorithm cannot continue endlessly and thus must always terminate. More precisely, the total number of dangling suffixes that the algorithm considers is at most equal to the total of the lengths of the codewords in the input, so the algorithm runs in

as a function of this input length. By using a

suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...

to speed the comparison between each dangling suffix and the codewords, the time for the algorithm can be bounded by O(''nk''), where ''n'' is the total length of the codewords and ''k'' is the number of codewords. The algorithm can be implemented using a pattern matching machine. The algorithm can also be implemented to run on a nondeterministic Turing machine that uses only logarithmic space; the problem of testing unique decipherability is

NL-complete In computational complexity theory, NL-complete is a complexity class containing the languages that are complete for NL, the class of decision problems that can be solved by a nondeterministic Turing machine using a logarithmic amount of mem ...

, so this space bound is optimal. A proof that the algorithm is correct, i.e. that it always gives the correct answer, is found in the textbooks by Salomaa and by Berstel et al.Berstel et al. (2009), Chapter 2.3

Notes

References

* * * *. *. *. * *. ;Further reading *

Robert G. Gallager Robert Gray Gallager (born May 29, 1931) is an American electrical engineer known for his work on information theory and communications networks. Gallager was elected a member of the National Academy of Engineering (NAE) in 1979 for contributio ...

: ''Information Theory and Reliable Communication.'' Wiley, 1968 {{DEFAULTSORT:Sardinas-Patterson algorithm Algorithms Coding theory Data compression

Idea of the algorithm

Precise description of the algorithm

Termination and correctness of the algorithm

See also

Notes

References