information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...

and

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing '' signals'', such as sound, images, and scientific measurements. Signal processing techniques are used to optimize transmissions, ...

, the Discrete Universal Denoiser (DUDE) is a

denoising Noise reduction is the process of removing noise from a signal. Noise reduction techniques exist for audio and images. Noise reduction algorithms may distort the signal to some degree. Noise rejection is the ability of a circuit to isolate an un ...

scheme for recovering sequences over a finite alphabet, which have been corrupted by a discrete memoryless channel. The DUDE was proposed in 2005 by Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú and Marcelo J. Weinberger.T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu ́, and M.J. Weinberger. Universal discrete denoising: Known channel. IEEE Transactions on Information Theory,, 51(1):5–28, 2005.

Overview

The Discrete Universal Denoiser (DUDE) is a

scheme that estimates an unknown signal

x^n=\left( x_1 \ldots x_n \right)

over a finite alphabet from a noisy version

z^n=\left( z_1 \ldots z_n \right)

. While most

schemes in the signal processing and statistics literature deal with

signals In signal processing, a signal is a function that conveys information about a phenomenon. Any quantity that can vary over space or time can be used as a signal to share messages between observers. The ''IEEE Transactions on Signal Processing'' ...

over an infinite alphabet (notably, real-valued signals), the DUDE addresses the finite alphabet case. The noisy version

z^n

is assumed to be generated by transmitting

x^n

through a known discrete memoryless channel. For a fixed ''context length'' parameter

k

, the DUDE counts of the occurrences of all the strings of length

2k+1

appearing in

z^n

. The estimated value

\hat_i

is determined based the two-sided length-

k

''context''

\left( z_, \ldots, z_,z_, \ldots,z_  \right)

z_i

, taking into account all the other tokens in

z^n

with the same context, as well as the known channel matrix and the loss function being used. The idea underlying the DUDE is best illustrated when

x^n

is a realization of a random vector

X^n

. If the conditional distribution

X_i ,  Z_, \ldots, Z_, Z_, \ldots, Z_

, namely the distribution of the noiseless symbol

X_i

conditional on its noisy context

\left( Z_, \ldots,
Z_,Z_, \ldots,Z_  \right)

was available, the optimal estimator

\hat_i

would be the Bayes Response to

X_i ,  Z_, \ldots, Z_, Z_, \ldots, Z_

. Fortunately, when the channel matrix is known and non-degenerate, this conditional distribution can be expressed in terms of the conditional distribution

Z_i ,  Z_, \ldots, Z_, Z_, \ldots, Z_

, namely the distribution of the noisy symbol

Z_i

conditional on its noisy context. This conditional distribution, in turn, can be estimated from an individual observed noisy signal

Z^n

by virtue of the

Law of Large Numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials sho ...

, provided

n

is “large enough”. Applying the DUDE scheme with a context length

k

to a sequence of length

n

over a finite alphabet

\mathcal

requires

O(n)

operations and space

O\left( \min( n , , \mathcal, ^ )
\right)

. Under certain assumptions, the DUDE is a universal scheme in the sense of asymptotically performing as well as an optimal denoiser, which has oracle access to the unknown sequence. More specifically, assume that the denoising performance is measured using a given single-character fidelity criterion, and consider the regime where the sequence length

n

tends to infinity and the context length

k=k_n

tends to infinity “not too fast”. In the stochastic setting, where a doubly infinite sequence noiseless sequence

\mathbf

is a realization of a stationary process

\mathbf

, the DUDE asymptotically performs, in expectation, as well as the best denoiser, which has oracle access to the source distribution

\mathbf

. In the single-sequence, or “semi-stochastic” setting with a ''fixed'' doubly infinite sequence

\mathbf

, the DUDE asymptotically performs as well as the best “sliding window” denoiser, namely any denoiser that determines

\hat_i

from the window

\left( z_,\ldots,z_ \right)

, which has oracle access to

\mathbf

The discrete denoising problem

Let

\mathcal

be the finite alphabet of a fixed but unknown original “noiseless” sequence

x^n=\left( x_1, \ldots, x_n \right)\in \mathcal^n

. The sequence is fed into a discrete memoryless channel (DMC). The DMC operates on each symbol

x_i

independently, producing a corresponding random symbol

Z_i

in a finite alphabet

\mathcal

. The DMC is known and given as a

\mathcal

-by-

\mathcal

Markov matrix

\Pi

, whose entries are

\pi(x,z)=\mathbb\left( Z=z \,, \, X=x \right)

. It is convenient to write

\pi_z

for the

z

-column of

\Pi

. The DMC produces a random noisy sequence

Z^n=\left( z_1,\ldots,z_n \right)\in\mathcal^n

. A specific realization of this random vector will be denoted by

z^n

. A denoiser is a function

\hat^n: \mathcal^n \to \mathcal^n

that attempts to recover the noiseless sequence

x^n

from a distorted version

z^n

. A specific denoised sequence is denoted by

\hat^n=\hat^n\left( z^n
\right)=\left( \hat_1 (z^n),\ldots , \hat_n(z^n) \right)

. The problem of choosing the denoiser

\hat^n

is known as signal estimation,

filtering Filter, filtering or filters may refer to: Science and technology Computing * Filter (higher-order function), in functional programming * Filter (software), a computer program to process a data stream * Filter (video), a software component th ...

smoothing In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In smoothing, the data ...

. To compare candidate denoisers, we choose a single-symbol fidelity criterion

\Lambda:\mathcal\times \mathcal\to [0,\infty)

(for example, the Hamming loss) and define the per-symbol loss of the denoiser

\hat^n

(x^n,z^n)

\begin
  L_\left( x^n,z^n \right) = \frac\sum_^n\Lambda\left(
  x_i \, , \, \hat_i(z^n) \right) \,.
  \end

Ordering the elements of the alphabet

\mathcal

\mathcal=\left( a_1 , \ldots ,
a_ \right)

, the fidelity criterion can be given by a

, \mathcal,

-by-

, \mathcal,

matrix, with columns of the form

\begin
  \lambda_ = \left( 
  \begin
    \Lambda(a_1,\hat) \\
    \vdots \\
    \Lambda(a_,\hat)
  \end
  \right) \,.
  \end

The DUDE scheme

Step 1: Calculating the empirical distribution in each context

The DUDE corrects symbols according to their context. The context length

k

used is a tuning parameter of the scheme. For

k+1\leq i\leq n-k

, define the left context of the

i

-th symbol in

z^n

l^k(z^n,i)=\left(
z_,\ldots,z_ \right)

and the corresponding right context as

r^k(z^n,i)=\left( z_,\ldots,z_ \right)

. A two-sided context is a combination

(l^k,r^k)

of a left and a right context. The first step of the DUDE scheme is to calculate the empirical distribution of symbols in each possible two-sided context along the noisy sequence

z^n

. Formally, a given two-sided context

(l^k,r^k)\in\mathcal^k\times \mathcal^k

that appears once or more along

z^n

determines an empirical probability distribution over

\mathcal

, whose value at the symbol

z

\begin
  \mu \left( z^n,l^k,r^k \right)[z] = 
  \frac
   \,.
  \end

Thus, the first step of the DUDE scheme with context length

k

is to scan the input noisy sequence

z^n

once, and store the length-

, \mathcal,

empirical distribution vector

\mu \left(
z^n,l^k,r^k \right)

(or its non-normalized version, the count vector) for each two-sided context found along

z^n

. Since there are at most

N_=\min\left(
n,, \mathcal, ^ \right)

possible two-sided contexts along

z^n

, this step requires

O(n)

operations and storage

O(N_)

Step 2: Calculating the Bayes response to each context

Denote the column of single-symbol fidelity criterion

\Lambda

, corresponding to the symbol

\hat\in\mathcal

, by

\lambda_

. We define the ''Bayes Response'' to any vector

\mathbf

of length

, \mathcal,

with non-negative entries as

\begin
  \hat_(\mathbf) =
  \text_\lambda_^\top\mathbf\,.
  \end

This definition is motivated in the

background Background may refer to: Performing arts and stagecraft * Background actor * Background artist * Background light * Background music * Background story * Background vocals * ''Background'' (play), a 1950 play by Warren Chetham-Strode Record ...

below. The second step of the DUDE scheme is to calculate, for each two-sided context

(l^k,r^k)

observed in the previous step along

z^n

, and for each symbol

z\in\mathcal

observed in each context (namely, any

z

such that

l^rzr^k

is a substring of

z^n

) the Bayes response to the vector

\Pi^\mu\left( z^n\,,\,l^k\,,\,r^k \right)\odot \pi_

, namely

\begin
  g(l^k,z,r^k) := \hat_ \left( \Pi^\mu\left(
  z^n\,,\,l^k\,,\,r^k \right)\odot \pi_ \right)\,. 
  \end

Note that the sequence

z^n

and the context length

k

are implicit. Here,

\pi_z

is the

z

-column of

\Pi

and for vectors

\mathbf

and

\mathbf

\mathbf\odot\mathbf

denotes their Schur (entrywise) product, defined by

\left(
\mathbf\odot\mathbf\right)_i = a_i b_i

. Matrix multiplication is evaluated before the Schur product, so that

\Pi^\mu\odot\pi_z

stands for

(\Pi^\mu)\odot\pi_z

. This formula assumed that the channel matrix

\Pi

is square (

, \mathcal, =, \mathcal,

) and invertible. When

, \mathcal, \leq, \mathcal,

and

\Pi

is not invertible, under the reasonable assumption that it has full row rank, we replace

(\Pi^\top)^

above with its Moore-Penrose pseudo-inverse

\left( \Pi\Pi^\top \right)^\Pi

and calculate instead

\begin
  g(l^k,z,r^k):=\hat_\left( (\Pi\Pi^\top)^\Pi \mu\left( z^n,l^k,r^k \right)\odot
  \pi_z \right)\,.
  \end

By caching the inverse or pseudo-inverse

\Pi^

, and the values

\lambda_\odot \pi_z

for the relevant pairs

(\hat,z)\in\mathcal\times\mathcal

, this step requires

O(N_)

operations and

O(N_)

storage.

Step 3: Estimating each symbol by the Bayes response to its context

The third and final step of the DUDE scheme is to scan

z^n

again and compute the actual denoised sequence

\hat^n(z^n)=\left( \hat_1(z^n), \ldots ,
\hat_n(z^n) \right)

. The denoised symbol chosen to replace

z_i

is the Bayes response to the two-sided context of the symbol, namely

\begin
  \hat_i(z^n) := g\left( l^k(z^n,i) \,,\, z_i \,,\, r^k(z^n,i)\right)\,.\end

This step requires

O(n)

operations and used the data structure constructed in the previous step. In summary, the entire DUDE requires

O(n)

operations and

O(N_)

storage.

Asymptotic optimality properties

The DUDE is designed to be universally optimal, namely optimal (is some sense, under some assumptions) regardless of the original sequence

x^n

. Let

\hat^n_:\mathcal^n\to\mathcal^n

denote a sequence of DUDE schemes, as described above, where

\hat^n_

uses a context length

k_n

that is implicit in the notation. We only require that

\lim_k_n=\infty

and that

k_n , \mathcal, ^=o\left( \frac \right)

For a stationary source

Denote by

\mathcal_n

the set of all

n

-block denoisers, namely all maps

\hat^n:\mathcal^n\to\mathcal^n

. Let

\mathbf

be an unknown stationary source and

\mathbf

be the distribution of the corresponding noisy sequence. Then

,, \end

and both limits exist. If, in addition the source

\mathbf

is ergodic, then

,,\,\text\,. \end

For an individual sequence

Denote by

\mathcal_

the set of all

n

-block

k

-th order sliding window denoisers, namely all maps

\hat^n:\mathcal\to\mathcal

of the form

\hat_i(z^n) = f\left( z_,\ldots,z_ \right)

with

f:\mathcal^\to\mathcal

arbitrary. Let

\mathbf\in\mathcal^\infty

be an unknown noiseless sequence stationary source and

\mathbf

be the distribution of the corresponding noisy sequence. Then

=0 \,,\,\text\,. \end

Non-asymptotic performance

Let

\hat^n_

denote the DUDE on with context length

k

defined on

n

-blocks. Then there exist explicit constants

A,C>0

and

B>1

that depend on

\left( \Pi,\Lambda \right)

alone, such that for any

n,k

and any

x^n\in\mathcal^n

we have

\leq \sqrt\frac , \mathcal, ^ \,, \end

where

Z^n

is the noisy sequence corresponding to

x^n

(whose randomness is due to the channel alone) K. Viswanathan and E. Ordentlich. Lower limits of discrete universal denoising. IEEE Transactions on Information Theory, 55(3):1374–1386, 2009. . In fact holds with the same constants

A,B

as above for ''any''

n

-block denoiser

\hat^n\in\mathcal^n

. The lower bound proof requires that the channel matrix

\Pi

be square and the pair

\left( \Pi,\Lambda \right)

satisfies a certain technical condition.

Background

To motivate the particular definition of the DUDE using the Bayes response to a particular vector, we now find the optimal denoiser in the non-universal case, where the unknown sequence

x^n

is a realization of a random vector

X^n

, whose distribution is known. Consider first the case

n=1

. Since the joint distribution of

(X,Z)

is known, given the observed noisy symbol

z

, the unknown symbol

X\in\mathcal

is distributed according to the known distribution

\mathbb(X=x, Z=z)

. By ordering the elements of

\mathcal

, we can describe this conditional distribution on

\mathcal

using a probability vector

\mathbf_

, indexed by

\mathcal

, whose

x

-entry is

\mathbb\left( X=x, Z=z \right)

. Clearly the expected loss for the choice of estimated symbol

\hat

\lambda_^\top \mathbf_

. Define the ''Bayes Envelope'' of a probability vector

\mathbf

, describing a probability distribution on

\mathcal

, as the minimal expected loss

U(\mathbf) =
\min_\mathbf^\top \lambda_

, and the ''Bayes Response'' to

\mathbf

as the prediction that achieves this minimum,

\hat_(\mathbf) = \text_\mathbf^\top
\lambda_

. Observe that the Bayes response is scale invariant in the sense that

\hat_(\mathbf) = \hat_(\alpha\mathbf)

for

\alpha>0

. For the case

n=1

, then, the optimal denoiser is

\hat(z)=\hat_\left( \mathbf_ \right)

. This optimal denoiser can be expressed using the marginal distribution of

Z

alone, as follows. When the channel matrix

\Pi

is invertible, we have

\mathbf_ \propto \Pi^P_Z\odot \pi_z

where

\pi_z

is the

z

-th column of

\Pi

. This implies that the optimal denoiser is given equivalently by

\hat(z)=\hat_\left(
\Pi^\mathbf_Z\odot\pi_z \right)

. When

, \mathcal, \leq, \mathcal,

and

\Pi

is not invertible, under the reasonable assumption that it has full row rank, we can replace

\Pi^

with its Moore-Penrose pseudo-inverse and obtain

\hat(z)=\hat_\left( (\Pi\Pi^\top)^\Pi\mathbf_Z\odot\pi_z
\right)\,.

Turning now to arbitrary

n

, the optimal denoiser

\hat^(z^n)

(with minimal expected loss) is therefore given by the Bayes response to

\mathbf_

\begin
  \hat^_i(z^n) = \hat_\mathbf_ =
  \text_\lambda_^\top \mathbf_\,,
  \end

where

\mathbf_

is a vector indexed by

\mathcal

, whose

x

-entry is

\mathbb\left( X_i=x ,  Z^n=z^n \right)

. The conditional probability vector

\mathbf_

is hard to compute. A derivation analogous to the case

n=1

above shows that the optimal denoiser admits an alternative representation, namely

\hat^_i(z^n)=\hat_\left(
\Pi^\mathbf_\odot\pi_ \right)

, where

z^=\left( z_1,\ldots,z_,z_,\ldots,z_n \right)\in
\mathcal^

is a given vector and

\mathbf_

is the probability vector indexed by

\mathcal

whose

z

-entry is

\mathbb\left( (Z_1,\ldots,Z_n) =
(z_1,\ldots,z_,z,z_,\ldots,z_n)  \right)\,.

Again,

\Pi^

is replaced by a pseudo-inverse if

\Pi

is not square or not invertible. When the distribution of

X

(and therefore, of

Z

) is not available, the DUDE replaces the unknown vector

\mathbf_

with an empirical estimate obtained along the noisy sequence

z^n

itself, namely with

\mu\left( Z_i, l^k(Z^n,i),r^k(Z^n,i) \right)

. This leads to the above definition of the DUDE. While the convergence arguments behind the optimality properties above are more subtle, we note that the above, combined with the Birkhoff Ergodic Theorem, is enough to prove that for a stationary ergodic source, the DUDE with context-length

k

is asymptotically optimal all

k

-th order sliding window denoisers.

Extensions

The basic DUDE as described here assumes a signal with a one-dimensional index set over a finite alphabet, a known memoryless channel and a context length that is fixed in advance. Relaxations of each of these assumptions have been considered in turn. Specifically: * Infinite alphabets G. Motta, E. Ordentlich, I. Ramírez, G. Seroussi, and M. Weinberger, “The DUDE framework for continuous tone image denoising,” IEEE Transactions on Image Processing, 20, No. 1, January 2011. K. Sivaramakrishnan and T. Weissman. Universal denoising of continuous amplitude signals with applications to images. In Proc. of IEEE International Conference on Image Processing, Atlanta, GA, USA, October 2006, pp. 2609–2612 * Channels with memory * Unknown channel matrix * Variable context and adaptive choice of context length G. Gimel’farb. Adaptive context for a discrete universal denoiser. In Proc. Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18–20, pp. 477–485 * Two-dimensional signals E. Ordentlich, G. Seroussi, S. Verd´u, M.J. Weinberger, and T. Weissman. A universal discrete image denoiser and its application to binary images. In Proc. IEEE International Conference on Image Processing, Barcelona, Catalonia, Spain, September 2003.

Applications

Application to image denoising

A DUDE-based framework for grayscale

image denoising Noise reduction is the process of removing noise from a signal. Noise reduction techniques exist for audio and images. Noise reduction algorithms may distort the signal to some degree. Noise rejection is the ability of a circuit to isolate an und ...

achieves state-of-the-art denoising for impulse-type noise channels (e.g., "salt and pepper" or "M-ary symmetric" noise), and good performance on the Gaussian channel (comparable to the

Non-local means Non-local means is an algorithm in image processing for image denoising. Unlike "local mean" filters, which take the mean value of a group of pixels surrounding a target pixel to smooth the image, non-local means filtering takes a mean of all pi ...

image denoising scheme on this channel). A different DUDE variant applicable to grayscale images is presented in.

Application to channel decoding of uncompressed sources

The DUDE has led to universal algorithms for channel decoding of uncompressed sources. E. Ordentlich, G. Seroussi, S. Verdú, and K. Viswanathan, "Universal Algorithms for Channel Decoding of Uncompressed Sources," IEEE Trans. Information Theory, vol. 54, no. 5, pp. 2243–2262, May 2008

References

{{Reflist Noise reduction