Least mean squares (LMS) algorithms are a class of

adaptive filter An adaptive filter is a system with a linear filter that has a transfer function controlled by variable parameters and a means to adjust those parameters according to an optimization algorithm. Because of the complexity of the optimization algorit ...

used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean square of the error signal (difference between the desired and the actual signal). It is a

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

method in that the filter is only adapted based on the error at the current time. It was invented in 1960 by

Stanford University Leland Stanford Junior University, commonly referred to as Stanford University, is a Private university, private research university in Stanford, California, United States. It was founded in 1885 by railroad magnate Leland Stanford (the eighth ...

professor

Bernard Widrow Bernard Widrow (born December 24, 1929) is a U.S. professor of electrical engineering at Stanford University. He is the co-inventor of the Widrow–Hoff least mean squares filter (LMS) adaptive algorithm with his then doctoral student Ted Hoff ...

and his first Ph.D. student, Ted Hoff, based on their research in single-layer neural networks ( ADALINE). Specifically, they used gradient descent to train ADALINE to recognize patterns, and called the algorithm "

delta rule In machine learning, the delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It can be derived as the backpropagation algorithm for a single-layer neural ...

". They then applied the rule to filters, resulting in the LMS algorithm.

Problem formulation

The picture shows the various parts of the filter.

x

is the input signal, which is then transformed by an unknown filter

h

that we wish to match using

\hat h

. The output from the unknown filter is

y

, which is then interfered with a noise signal

\nu

, producing

d = y + \nu

. Then the error signal

e = d - \hat y = y + \nu - \hat y

is computed, and it is fed back to the adaptive filter, to adjust its parameters in order to minimize the

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

\sum e^2/n

Relationship to the Wiener filter

The realization of the causal

Wiener filter In signal processing, the Wiener filter is a filter used to produce an estimate of a desired or target random process by linear time-invariant ( LTI) filtering of an observed noisy process, assuming known stationary signal and noise spectra, a ...

looks a lot like the solution to the least squares estimate, except in the signal processing domain. The least squares solution for input matrix

\mathbf

and output vector

\boldsymbol y

is :

\boldsymbol = (\mathbf ^\mathbf\mathbf)^\mathbf^\boldsymbol y .

The

finite impulse response In signal processing, a finite impulse response (FIR) filter is a filter whose impulse response (or response to any finite length input) is of ''finite'' duration, because it settles to zero in finite time. This is in contrast to infinite impuls ...

(FIR) least mean squares filter is related to the Wiener filter, but minimizing the error criterion of the former does not rely on cross-correlations or auto-correlations. Its solution converges to the Wiener filter solution. Most linear adaptive filtering problems can be formulated using the block diagram above. That is, an unknown system

\mathbf(n)

is to be identified and the adaptive filter attempts to adapt the filter

\hat(n)

to make it as close as possible to

\mathbf(n)

, while using only observable signals

x(n)

d(n)

and

e(n)

; but

y(n)

v(n)

and

h(n)

are not directly observable. Its solution is closely related to the Wiener filter.

Definition of symbols

n

is the number of the current input sample :

p

is the number of filter taps :

\^H

(

Hermitian transpose In mathematics, the conjugate transpose, also known as the Hermitian transpose, of an m \times n complex matrix \mathbf is an n \times m matrix obtained by transposing \mathbf and applying complex conjugation to each entry (the complex conjugate ...

conjugate transpose In mathematics, the conjugate transpose, also known as the Hermitian transpose, of an m \times n complex matrix \mathbf is an n \times m matrix obtained by transposing \mathbf and applying complex conjugation to each entry (the complex conjugate ...

) :

\mathbf(n) = \left (n), x(n-1), \dots, x(n-p+1)\right T

T,\quad \mathbf(n) \in \mathbb^p

y(n) = \mathbf^H(n) \cdot \mathbf(n)

d(n) = y(n) + \nu(n)

\hat(n)

estimated filter; interpret as the estimation of the filter coefficients after samples :

e(n) = d(n) - \hat(n) = d(n) - \hat^H(n) \cdot \mathbf(n)

Idea

The basic idea behind LMS filter is to approach the optimum filter weights

(R^P)

, by updating the filter weights in a manner to converge to the optimum filter weight. This is based on the gradient descent algorithm. The algorithm starts by assuming small weights (zero in most cases) and, at each step, by finding the gradient of the mean square error, the weights are updated. That is, if the MSE-gradient is positive, it implies the error would keep increasing positively if the same weight is used for further iterations, which means we need to reduce the weights. In the same way, if the gradient is negative, we need to increase the weights. The weight update equation is :

W_ = W_n - \mu\nabla \varepsilon

where

\varepsilon

represents the mean-square error and

\mu

is a convergence coefficient. The negative sign shows that we go down the slope of the error,

\varepsilon

to find the filter weights,

W_i

, which minimize the error. The mean-square error as a function of filter weights is a quadratic function which means it has only one extremum, that minimizes the mean-square error, which is the optimal weight. The LMS thus, approaches towards this optimal weights by ascending/descending down the mean-square-error vs filter weight curve.

Derivation

The idea behind LMS filters is to use

steepest descent Gradient descent is a method for unconstrained mathematical optimization. It is a :First order methods, first-order Iterative algorithm, iterative algorithm for minimizing a differentiable function, differentiable multivariate function. The ide ...

to find filter weights

\hat(n)

which minimize a cost function. We start by defining the cost function as :

C(n) = E\left\

where

e(n)

is the error at the current sample ''n'' and

E\

denotes the

expected value In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...

. This cost function (

C(n)

) is the mean square error, and it is minimized by the LMS. This is where the LMS gets its name. Applying

means to take the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...

s with respect to the individual entries of the filter coefficient (weight) vector :

\nabla_ C(n) = \nabla_ E\left\=2E\left\

where

\nabla

is the

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

operator :

\nabla_ (e(n))= \nabla_ \left(d(n) - \hat^H \cdot \mathbf(n)\right)=-\mathbf(n)

\nabla C(n) = -2E\left\

Now,

\nabla C(n)

is a vector which points towards the steepest ascent of the cost function. To find the minimum of the cost function we need to take a step in the opposite direction of

\nabla C(n)

. To express that in mathematical terms :

\hat(n+1)=\hat(n)-\frac \nabla C(n)=\hat(n)+\mu \, E\left\

where

\frac

is the step size (adaptation constant). That means we have found a sequential update algorithm which minimizes the cost function. Unfortunately, this algorithm is not realizable until we know

E\left\

. Generally, the expectation above is not computed. Instead, to run the LMS in an online (updating after each new sample is received) environment, we use an instantaneous estimate of that expectation. See below.

Simplifications

For most systems the expectation function

\left\

must be approximated. This can be done with the following unbiased

estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...

\hat\left\=\frac\sum_^\mathbf(n-i) \, e^(n-i)

where

N

indicates the number of samples we use for that estimate. The simplest case is

N=1

\hat\left\=\mathbf(n) \, e^(n)

For that simple case the update algorithm follows as :

\hat(n+1)=\hat(n)+\mu \mathbf(n) \, e^(n)

Indeed, this constitutes the update algorithm for the LMS filter.

LMS algorithm summary

The LMS algorithm for a

p

th order filter can be summarized as

Convergence and stability in the mean

As the LMS algorithm does not use the exact values of the expectations, the weights would never reach the optimal weights in the absolute sense, but a convergence is possible in mean. That is, even though the weights may change by small amounts, it changes about the optimal weights. However, if the variance with which the weights change, is large, convergence in mean would be misleading. This problem may occur, if the value of step-size

\mu

is not chosen properly. If

\mu

is chosen to be large, the amount with which the weights change depends heavily on the gradient estimate, and so the weights may change by a large value so that gradient which was negative at the first instant may now become positive. And at the second instant, the weight may change in the opposite direction by a large amount because of the negative gradient and would thus keep oscillating with a large variance about the optimal weights. On the other hand, if

\mu

is chosen to be too small, time to converge to the optimal weights will be too large. Thus, an upper bound on

\mu

is needed which is given as

0<\mu<\frac

, where

\lambda_

is the greatest eigenvalue of the

autocorrelation Autocorrelation, sometimes known as serial correlation in the discrete time case, measures the correlation of a signal with a delayed copy of itself. Essentially, it quantifies the similarity between observations of a random variable at differe ...

matrix

= E\

. If this condition is not fulfilled, the algorithm becomes unstable and

\hat(n)

diverges. Maximum convergence speed is achieved when :

\mu=\frac,

where

\lambda_

is the smallest eigenvalue of

. Given that

\mu

is less than or equal to this optimum, the convergence speed is determined by

\lambda_

, with a larger value yielding faster convergence. This means that faster convergence can be achieved when

\lambda_

is close to

\lambda_

, that is, the maximum achievable convergence speed depends on the eigenvalue spread of

. A

white noise In signal processing, white noise is a random signal having equal intensity at different frequencies, giving it a constant power spectral density. The term is used with this or similar meanings in many scientific and technical disciplines, i ...

signal has autocorrelation matrix

=\sigma^2

where

\sigma^2

is the variance of the signal. In this case all eigenvalues are equal, and the eigenvalue spread is the minimum over all possible matrices. The common interpretation of this result is therefore that the LMS converges quickly for white input signals, and slowly for colored input signals, such as processes with low-pass or high-pass characteristics. It is important to note that the above upperbound on

\mu

only enforces stability in the mean, but the coefficients of

\hat(n)

can still grow infinitely large, i.e. divergence of the coefficients is still possible. A more practical bound is :

0<\mu<\frac,

where

\mathrm[]

denotes the Trace (linear algebra), trace of

. This bound guarantees that the coefficients of

\hat(n)

do not diverge (in practice, the value of

\mu

should not be chosen close to this upper bound, since it is somewhat optimistic due to approximations and assumptions made in the derivation of the bound).

Normalized least mean squares filter (NLMS)

The main drawback of the "pure" LMS algorithm is that it is sensitive to the scaling of its input

x(n)

. This makes it very hard (if not impossible) to choose a

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

\mu

that guarantees stability of the algorithm (Haykin 2002). The ''Normalised least mean squares filter'' (NLMS) is a variant of the LMS algorithm that solves this problem by normalising with the power of the input. The NLMS algorithm can be summarised as:

Optimal learning rate

It can be shown that if there is no interference (

v(n)=0

), then the optimal learning rate for the NLMS algorithm is :

\mu_=1

and is independent of the input

x(n)

and the real (unknown) impulse response

\mathbf(n)

. In the general case with interference (

v(n) \ne 0

), the optimal learning rate is :

\mu_=\frac

The results above assume that the signals

v(n)

and

x(n)

are uncorrelated to each other, which is generally the case in practice.

Proof

Let the filter misalignment be defined as

\Lambda(n) = \left,  \mathbf(n) - \hat(n) \^2

, we can derive the expected misalignment for the next sample as: :

E\left \Lambda(n+1) \right = E\left \hat(n) + \frac - \mathbf(n) \^2 \right /math>
: E\left \Lambda(n+1) \right = E\left \hat(n) + \frac - \mathbf(n) \^2 \right /math>

Let \mathbf=\hat(n)-\mathbf(n) and r(n) = \hat(n)-y(n) : E\left \Lambda(n+1) \right = E\left \mathbf(n) - \frac \^2 \right /math>
: E\left \Lambda(n+1) \right = E\left \left( \mathbf(n) - \frac \right)^H \left( \mathbf(n) - \frac \right)  \right /math>

Assuming independence, we have:
: E\left \Lambda(n+1) \right = \Lambda(n) + E\left \left( \frac \right)^H \left( \frac \right)  \right - 2 E\left frac\right /math>
: E\left \Lambda(n+1) \right = \Lambda(n) + \frac - \frac The optimal learning rate is found at \frac = 0, which leads to:
: 2 \mu E\left r(n), ^2\right = 0 : \mu = \frac

References

* Monson H. Hayes: ''Statistical Digital Signal Processing and Modeling,'' Wiley, 1996, * Simon Haykin: ''Adaptive Filter Theory,'' Prentice Hall, 2002, * Simon S. Haykin, Bernard Widrow (Editor): ''Least-Mean-Square Adaptive Filters,'' Wiley, 2003, * Bernard Widrow, Samuel D. Stearns: ''Adaptive Signal Processing,'' Prentice Hall, 1985, * Weifeng Liu, Jose Principe and Simon Haykin: ''Kernel Adaptive Filtering: A Comprehensive Introduction,'' John Wiley, 2010, * Paulo S.R. Diniz: ''Adaptive Filtering: Algorithms and Practical Implementation,'' Kluwer Academic Publishers, 1997, {{ISBN, 0-7923-9912-9

External links

LMS Algorithm in Adaptive Antenna Arrays
www.antenna-theory.com

www.advsolned.com Digital signal processing Filter theory Statistical algorithms>e(n), ^2\right- 2 E\left r(n), ^2\right= 0 :

\mu = \frac

References

External links

LMS Algorithm in Adaptive Antenna Arrays
www.antenna-theory.com

www.advsolned.com Digital signal processing Filter theory Statistical algorithms

Problem formulation

Relationship to the Wiener filter

Definition of symbols

Idea

Derivation

Simplifications

LMS algorithm summary

Convergence and stability in the mean

Normalized least mean squares filter (NLMS)

Optimal learning rate

Proof

See also

References

External links

See also

References

External links