Discrete Universal Denoiser Explained

In information theory and signal processing, the Discrete Universal Denoiser (DUDE) is a denoising scheme for recovering sequences over a finite alphabet, which have been corrupted by a discretememoryless channel. The DUDE was proposed in 2005 by Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú and Marcelo J. Weinberger.^[1]

Overview

The Discrete Universal Denoiser (DUDE) is a denoising scheme that estimates anunknown signal

x^n=\left(x₁\ldotsx_n\right)

over a finitealphabet from a noisy version

z^n=\left(z₁\ldotsz_n\right)

.While most denoising schemes in the signal processingand statistics literature deal with signals overan infinite alphabet (notably, real-valued signals), the DUDE addresses thefinite alphabet case. The noisy version

zⁿ

is assumed to be generated by transmitting

xⁿ

through a known discretememoryless channel.

For a fixed context length parameter

, the DUDE counts of the occurrences of all the strings of length

2k+1

appearing in

zⁿ

. The estimated value

\hat{x}_i

is determined based the two-sided length-

context

\left(z_i-k,\ldots,z_i-1,z_i+1,\ldots,z_i+k\right)

z_i

, taking into account all the other tokens in

zⁿ

with the same context, as well as the known channel matrix and the loss function being used.

The idea underlying the DUDE is best illustrated when

xⁿ

is arealization of a random vector

Xⁿ

. If the conditional distribution

X_i|Z_i-k,\ldots,Z_i-1,Z_i+1,\ldots,Z_i+k

, namelythe distribution of the noiseless symbol

X_i

conditional on its noisy context

\left(Z_i-k,\ldots, Z_i-1,Z_i+1,\ldots,Z_i+k\right)

was available, the optimalestimator

\hat{X}_i

would be the Bayes Response to

X_i|Z_i-k,\ldots,Z_i-1,Z_i+1,\ldots,Z_i+k

.Fortunately, whenthe channel matrix is known and non-degenerate, this conditional distributioncan be expressed in terms of the conditional distribution

Z_i|Z_i-k,\ldots,Z_i-1,Z_i+1,\ldots,Z_i+k

, namelythe distribution of the noisy symbol

Z_i

conditional on its noisycontext. This conditional distribution, in turn, can be estimated from anindividual observed noisy signal

Zⁿ

by virtue of the Law of Large Numbers, provided

is “large enough”.

Applying the DUDE scheme with a context length

to a sequence oflength

over a finite alphabet

l{Z}

requires

O(n)

operations and space

O\left(min(n,|l{Z}|^2k) \right)

Under certain assumptions, the DUDE is a universal scheme in the sense of asymptotically performing as well as an optimal denoiser, which has oracle access to the unknown sequence. More specifically, assume that the denoising performance is measured using a given single-character fidelity criterion, and consider the regime where the sequence length

tends to infinity and the context length

k=k_n

tends to infinity “not too fast”. In the stochastic setting, where a doubly infinite sequence noiseless sequence

is a realization of a stationary process

, the DUDE asymptotically performs, in expectation, as well as the best denoiser, which has oracle access to the source distribution

. In the single-sequence, or “semi-stochastic” setting with a fixed doubly infinite sequence

, the DUDE asymptotically performs as well as the best “sliding window” denoiser, namely any denoiser that determines

\hat{x}_i

from the window

\left(z_i-k,\ldots,z_i+k\right)

, which has oracle access to

The discrete denoising problem

Let

l{X}

be the finite alphabet of a fixed but unknown original “noiseless” sequence

x^n=\left(x_1,\ldots,x_n\right)\inl{X}ⁿ

. The sequence is fed into a discretememoryless channel (DMC). The DMC operates on each symbol

x_i

independently, producing a corresponding random symbol

Z_i

in a finite alphabet

l{Z}

. The DMC is known and given as a

l{X}

-by-

l{Z}

Markov matrix

\Pi

, whose entries are

\pi(x,z)=P\left(Z=z|X=x\right)

. It is convenient to write

\pi_z

for the

-column of

\Pi

. The DMC produces a random noisy sequence

Z^n=\left(z_1,\ldots,z_n\right)\inl{Z}ⁿ

. A specific realization of this random vector will be denoted by

zⁿ

.A denoiser is a function

\hat{X}^n:l{Z}ⁿ\tol{X}ⁿ

that attempts to recover the noiseless sequence

xⁿ

from a distorted version

zⁿ

. A specific denoised sequence is denoted by

\hat{x}^n=\hat{X}^n\left(z^{n
\right)=\left(}\hat{X}₁(z^n),\ldots,

	n)
\hat{X}
	n(z

\right)

.The problem of choosing the denoiser

\hat{X}ⁿ

is known as signalestimation, filtering or smoothing. To compare candidate denoisers, we choose a single-symbol fidelity criterion

Λ:l{X} x l{X}\to[0,infty)

(for example, the Hamming loss) and define the per-symbol loss of the denoiser

\hat{X}ⁿ

(x^n,zⁿ⁾

\begin{align}

	n}\left(
L
	\hat{X

x^n,zⁿ\right)=

	1
	n

	nΛ\left(
\sum
	i=1

x_i,

	n)
\hat{X}
	i(z

\right). \end{align}

Ordering the elements of the alphabet

l{X}

l{X}=\left(a₁,\ldots, a_|l{X|}\right)

, the fidelity criterion can be given by a

|l{X}|

-by-

|l{X}|

matrix, with columns of the form

\begin{align} λ_\hat{x

} = \left(\begin \Lambda(a_1,\hat) \\ \vdots \\ \Lambda(a_

,\hat) \end \right) \,. \end

The DUDE scheme

Step 1: Calculating the empirical distribution in each context

The DUDE corrects symbols according to their context. The context length

used is a tuning parameter of the scheme. For

k+1\leqi\leqn-k

, define the left context of the

-th symbol in

zⁿ

l^k(z

	n,i)=\left(z

	i-k

,\ldots,z_i-1\right)

and the corresponding right context as

r^k(z^n,i)=\left(z_i+1,\ldots,z_i+k\right)

. A two-sided context is a combination

(l^k,r^k)

of a left and a right context.

The first step of the DUDE scheme is to calculate the empirical distribution of symbols in each possible two-sided context along the noisy sequence

zⁿ

. Formally, a given two-sided context

(l^k,r^k)\inl{Z}^{k x}l{Z}^k

that appears once or more along

zⁿ

determines an empirical probability distribution over

l{Z}

, whose value at the symbol

\begin{align} \mu\left(z^n,l^k,r^k\right)[z]=

	\| \left\{k+1\leqi\leqn-k\|(z_i-k,\ldots,z_i+k)=l^kzr^k\right\
	\|}

{| \left\{k+1\leqi\leqn-k|l^k(z^n,i)=l^kandr^k(z^n,i)=r^k\right\}|}. \end{align}

Thus, the first step of the DUDE scheme with context length

is to scan the input noisy sequence

zⁿ

once, and store the length-

|l{Z}|

empirical distribution vector

\mu\left(z^n,l^k,r^k\right)

(or its non-normalized version, the count vector) for each two-sided context found along

zⁿ

. Since there are at most

N_n,k=min\left(n,|l{Z}|^2k\right)

possible two-sided contexts along

zⁿ

, this step requires

O(n)

operations and storage

O(N_n,k)

Step 2: Calculating the Bayes response to each context

Denote the column of single-symbol fidelity criterion

, corresponding to the symbol

\hat{x}\inl{X}

, by

λ_\hat{x

}. We define the Bayes Response to any vector

of length

|l{X}|

with non-negative entries as

\begin{align} \hat{X}_Bayes(v)= argmin_\hat{x\inl{X}}λ_\hat{x

}^\top\mathbf\,. \end

This definition is motivated in the background below.

The second step of the DUDE scheme is to calculate, for each two-sided context

(l^k,r^k)

observed in the previous step along

zⁿ

, and for each symbol

z\inl{Z}

observed in each context (namely, any

such that

l^rzr^k

is a substring of

zⁿ

) the Bayes response to the vector

\Pi^-\top\mu\left(z^n,l^k,r^k\right)\odot\pi_z

, namely

\begin{align} g(l^k,z,r^k):=\hat{X}_Bayes\left(\Pi^-\top\mu\left(z^n,l^k,r^k\right)\odot\pi_z\right).\end{align}

Note that the sequence

zⁿ

and the context length

are implicit. Here,

\pi_z

is the

-column of

\Pi

and for vectors

and

a\odotb

denotes their Schur (entrywise) product, defined by

\left(a\odotb\right)_i=a_ib_i

. Matrix multiplication is evaluated before the Schur product, so that

\Pi^-\top\mu\odot\pi_z

stands for

(\Pi^-\top\mu)\odot\pi_z

This formula assumed that the channel matrix

\Pi

is square (

|l{X}|=|l{Z}|

) and invertible. When

|l{X}|\leq|l{Z}|

and

\Pi

is not invertible, under the reasonable assumption that it has full row rank, we replace

(\Pi^\top)^-1

above with its Moore-Penrose pseudo-inverse

\left(\Pi\Pi^\top\right)^-1\Pi

and calculate instead

\begin{align} g(l^k,z,r

	k):=\hat{X}

	Bayes

\left((\Pi\Pi^\top)^-1\Pi\mu\left(z^n,l^k,r^k\right)\odot \pi_z\right). \end{align}

By caching the inverse or pseudo-inverse

\Pi^-\top

, and the values

λ_\hat{x

}\odot \pi_z for the relevant pairs

(\hat{x},z)\inl{X} x l{Z}

, this step requires

O(N_k,n)

operations and

O(N_k,n)

storage.

Step 3: Estimating each symbol by the Bayes response to its context

The third and final step of the DUDE scheme is to scan

zⁿ

again and compute the actual denoised sequence

\hat{X}^n(z^n)=\left(

	n),
\hat{X}
	1(z

\ldots

	n)
, \hat{X}
	n(z

\right)

. The denoised symbol chosen to replace

z_i

is the Bayes response to the two-sided context of the symbol, namely

\begin{align}

	n)
\hat{X}
	i(z

:=g\left(l^k(z^n,i),z_i,r^k(z^{n,i)\right).\end{align}}

This step requires

O(n)

operations and used the data structure constructed in the previous step.

In summary, the entire DUDE requires

O(n)

operations and

O(N_k,n)

storage.

Asymptotic optimality properties

The DUDE is designed to be universally optimal, namely optimal (is some sense, under some assumptions) regardless of the original sequence

xⁿ

Let

	n
\hat{X}
	DUDE

:l{Z}^n\tol{X}ⁿ

denote a sequence of DUDE schemes, as described above, where

	n
\hat{X}
	DUDE

uses a context length

k_n

that is implicit in the notation. We only require that

\lim_n\toinftyk_n=infty

and that

k_n

	2K_n
\|l{Z}\|

=o\left(

	n
	logn

\right)

For a stationary source

Denote by

l{D}_n

the set of all

-block denoisers, namely all maps

\hat{X}^n:l{Z}^n\tol{X}ⁿ

Let

be an unknown stationary source and

be the distribution of the corresponding noisy sequence. Then

\begin{align} \lim_n\toinftyE\left[

	n
L
	DUDE

}\left(X^n,Z^n \right) \right]= \lim_\min_\mathbf \left[L_{\hat{X}^n}\left(X^n,Z^n \right)\right]\,, \end

and both limits exist. If, in addition the source

is ergodic, then

\begin{align} \limsup_n\toinfty

	n
L
	DUDE

}\left(X^n,Z^n \right) = \lim_\min_\mathbf \left[L_{\hat{X}^n}\left(X^n,Z^n \right)\right]\,,\,\text\,. \end

For an individual sequence

Denote by

l{D}_n,k

the set of all

-block

-th order sliding window denoisers, namely all maps

\hat{X}^{n:l{Z}\tol{X}}

of the form

	n)
\hat{X}
	i(z

=f\left(z_i-k,\ldots,z_i+k\right)

with

f:l{Z}^2k+1\tol{X}

arbitrary.

Let

x\inl{X}^infty

be an unknown noiseless sequence stationary source and

be the distribution of the corresponding noisy sequence. Then

\begin{align} \lim_n\toinfty\left[

	n
L
	DUDE

}\left(x^n,Z^n \right) - \min_ L_\left(x^n,Z^n \right) \right ] =0 \,,\,\text\,. \end

Non-asymptotic performance

Let

	n
\hat{X}
	k

denote the DUDE on with context length

defined on

-blocks. Then there exist explicit constants

A,C>0

and

B>1

that depend on

\left(\Pi,Λ\right)

alone, such that for any

n,k

and any

x^n\inl{X}ⁿ

we have

\begin{align}

	A
	\sqrt{n

}B^k\,\leq \mathbf \left[L_{\hat{X}^n_{k}}\left(x^n,Z^n \right) - \min_{\hat{X}^n\in\mathcal{D}_{n,k}} L_{\hat{X}^n}\left(x^n,Z^n \right) \right] \leq \sqrt\frac |\mathcal|^ \,, \end

where

Zⁿ

is the noisy sequence corresponding to

xⁿ

(whose randomness is due to the channel alone)^[2] .

In fact holds with the same constants

A,B

as above for any

-block denoiser

\hat{X}^n\inl{D}ⁿ

. The lower bound proof requires that the channel matrix

\Pi

be square and the pair

\left(\Pi,Λ\right)

satisfies a certain technical condition.

Background

To motivate the particular definition of the DUDE using the Bayes response to a particular vector, we now find the optimal denoiser in the non-universal case, where the unknown sequence

xⁿ

is a realization of a random vector

Xⁿ

, whose distribution is known.

Consider first the case

n=1

. Since the joint distribution of

(X,Z)

is known, given the observed noisy symbol

, the unknown symbol

X\inl{X}

is distributed according to the known distribution

P(X=x|Z=z)

. By ordering the elements of

l{X}

, we can describe this conditional distribution on

l{X}

using a probability vector

P_X|z

, indexed by

l{X}

, whose

-entry is

P\left(X=x|Z=z\right)

. Clearly the expected loss for the choice of estimated symbol

\hat{x}

λ_\hat{x

}^\top \mathbf_.

Define the Bayes Envelope of a probability vector

, describing a probability distribution on

l{X}

, as the minimal expected loss

U(v)= min_\hat{x\inl{X}}v^\topλ_\hat{x

}, and the Bayes Response to

as the prediction that achieves this minimum,

\hat{X}_Bayes(v)=argmin_\hat{x

	\top λ
\inl{X}}v
	\hat{x

}. Observe that the Bayes response is scale invariant in the sense that

\hat{X}_Bayes(v)=\hat{X}_Bayes(\alphav)

for

\alpha>0

For the case

n=1

, then, the optimal denoiser is

\hat{X}(z)=\hat{X}_Bayes\left(P_X|z\right)

. This optimal denoiser can be expressed using the marginal distribution of

alone, as follows. When the channel matrix

\Pi

is invertible, we have

P_X|z\propto\Pi^-\topP_Z\odot\pi_z

where

\pi_z

is the

-th column of

\Pi

. This implies that the optimal denoiser is given equivalently by

\hat{X}(z)=\hat{X}_Bayes\left(\Pi^-\topP_Z\odot\pi_z\right)

. When

|l{X}|\leq|l{Z}|

and

\Pi

is not invertible, under the reasonable assumption that it has full row rank, we can replace

\Pi^-1

with its Moore-Penrose pseudo-inverse and obtain

\hat{X}(z)=\hat{X}_Bayes\left((\Pi\Pi^\top)^-1\PiP_Z\odot\pi_z
\right).

Turning now to arbitrary

, the optimal denoiser

\hat{X}^opt(zⁿ⁾

(with minimal expected loss) is therefore given by the Bayes response to

	n
X
	i\|z

\begin{align}

	n)
\hat{X}
	i(z

=\hat{X}_Bayes

	n
X
	i\|z

= argmin_\hat{x\inl{X}}λ_\hat{x

}^\top \mathbf_\,, \end

where

	n
X
	i\|z

is a vector indexed by

l{X}

, whose

-entry is

P\left(X_i=x|Z^n=zⁿ\right)

. The conditional probability vector

	n
X
	i\|z

is hard to compute. A derivation analogous to the case

n=1

above shows that the optimal denoiser admits an alternative representation, namely

	n)=\hat{X}
\hat{X}
	Bayes

\left(\Pi^-\top

	n\backslashi
Z
	i,z

\odot\pi
	z_i

\right)

, where

zⁿ=\left(z_1,\ldots,z_i-1,z_i+1,\ldots,z_n\right)\in l{Z}^n-1

is a given vector and

	n\backslashi
Z
	i,z

is the probability vector indexed by

l{Z}

whose

-entry is

P\left((Z_1,\ldots,Z_n)= (z_1,\ldots,z_i-1,z,z_i+1,\ldots,z_n)\right).

Again,

\Pi^-\top

is replaced by a pseudo-inverse if

\Pi

is not square or not invertible.

When the distribution of

(and therefore, of

) isnot available, the DUDE replaces the unknown vector

	n\backslashi
Z
	i,z

with an empirical estimateobtained along the noisy sequence

zⁿ

itself, namely with

\mu\left(Z_i,l^k(Z^n,i),r^k(Z^n,i)\right)

. This leads to theabove definition of the DUDE.

While the convergence arguments behind the optimality properties above are moresubtle, we note that the above, combined with theBirkhoff Ergodic Theorem, is enough to prove that for a stationary ergodic source, the DUDE with context-length

is asymptotically optimal all

-th order sliding window denoisers.

Extensions

The basic DUDE as described here assumes a signal with a one-dimensional indexset over a finite alphabet, a known memorylesschannel and a context length that is fixed in advance. Relaxations of each of these assumptions have been considered in turn.^[3] Specifically:

Infinite alphabets^[4] ^[5] ^[6] ^[7]
Channels with memory^[8] ^[9]
Unknown channel matrix^[10] ^[11]
Variable context and adaptive choice of context length^[12] ^[13] ^[14] ^[15]
Two-dimensional signals^[16]

Applications

Application to image denoising

A DUDE-based framework for grayscale image denoising achieves state-of-the-art denoising for impulse-type noise channels (e.g., "salt and pepper" or "M-ary symmetric" noise), and good performance on the Gaussian channel (comparable to the Non-local means image denoising scheme on this channel). A different DUDE variant applicable to grayscale images is presented in.

Application to channel decoding of uncompressed sources

The DUDE has led to universal algorithms for channel decoding of uncompressed sources.^[17]

Notes and References

T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu ́, and M.J. Weinberger. Universal discrete denoising: Known channel. IEEE Transactions on Information Theory,, 51(1):5–28, 2005.
K. Viswanathan and E. Ordentlich. Lower limits of discrete universal denoising. IEEE Transactions on Information Theory, 55(3):1374–1386, 2009.
E. . Ordentlich . G. . Seroussi . S. . Verd´u . M. J. . Weinberger . T. . Weissman . Reflections on the DUDE .
A. Dembo and T. Weissman. Universal denoising for the finite-input-general-output channel.IEEE Trans. Inf. Theory, 51(4):1507–1517, April 2005.
K. Sivaramakrishnan and T. Weissman. Universal denoising of discrete-time continuous amplitude signals. In Proc. of the 2006 IEEE Intl. Symp. on Inform. Theory, (ISIT’06),Seattle, WA, USA, July 2006.
G. Motta, E. Ordentlich, I. Ramírez, G. Seroussi, and M. Weinberger, “TheDUDE framework for continuous tone image denoising,” IEEE Transactions onImage Processing, 20, No. 1, January 2011.
K. Sivaramakrishnan and T. Weissman. Universal denoising of continuous amplitude signalswith applications to images. In Proc. of IEEE International Conference on Image Processing,Atlanta, GA, USA, October 2006, pp. 2609–2612
C. D. Giurcaneanu and B. Yu. Efficient algorithms for discrete universal denoising for channelswith memory. In Proc. of the 2005 IEEE Intl. Symp. on Inform. Theory, (ISIT’05), Adelaide,Australia, Sept. 2005.
R. Zhang and T. Weissman. Discrete denoising for channels with memory. Communicationsin Information and Systems (CIS), 5(2):257–288, 2005.
G. M. Gemelos, S. Sigurjonsson, T. Weissman. Universal minimax discrete denoising underchannel uncertainty. IEEE Trans. Inf. Theory, 52:3476–3497, 2006.
G. M. Gemelos, S. Sigurjonsson and T. Weissman. Algorithms for discrete denoising underchannel uncertainty. IEEE Trans. Signal Process., 54(6):2263–2276, June 2006.
E. Ordentlich, M.J. Weinberger, and T. Weissman. Multi-directional context sets with applications to universal denoising and compression. In Proc. of the 2005 IEEE Intl. Symp. onInform. Theory, (ISIT’05), Adelaide, Australia, Sept. 2005.
J. Yu and S. Verd´u. Schemes for bidirectional modeling of discrete stationary sources. IEEETrans. Inform. Theory, 52(11):4789–4807, 2006.
S. Chen, S. N. Diggavi, S. Dusad and S. Muthukrishnan. Efficient string matching algorithmsfor combinatorial universal denoising. In Proc. of IEEE Data Compression Conference (DCC),Snowbird, Utah, March 2005.
G. Gimel’farb. Adaptive context for a discrete universal denoiser. In Proc. Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops, SSPR 2004and SPR 2004, Lisbon, Portugal, August 18–20, pp. 477–485
E. Ordentlich, G. Seroussi, S. Verd´u, M.J. Weinberger, and T. Weissman. A universal discreteimage denoiser and its application to binary images. In Proc. IEEE International Conferenceon Image Processing, Barcelona, Catalonia, Spain, September 2003.
E. Ordentlich, G. Seroussi, S. Verdú, and K. Viswanathan, "UniversalAlgorithms for Channel Decoding of Uncompressed Sources," IEEE Trans.Information Theory, vol. 54, no. 5, pp. 2243–2262, May 2008