The Kaczmarz method or Kaczmarz's algorithm is an iterative algorithm for solving linear equation systems

A x = b

. It was first discovered by the Polish mathematician Stefan Kaczmarz, and was rediscovered in the field of image reconstruction from projections by Richard Gordon, Robert Bender, and Gabor Herman in 1970, where it is called the Algebraic Reconstruction Technique (ART). ART includes the positivity constraint, making it nonlinear. The Kaczmarz method is applicable to any linear system of equations, but its computational advantage relative to other methods depends on the system being sparse. It has been demonstrated to be superior, in some biomedical imaging applications, to other methods such as the filtered backprojection method. It has many applications ranging from

computed tomography A computed tomography scan (CT scan), formerly called computed axial tomography scan (CAT scan), is a medical imaging technique used to obtain detailed internal images of the body. The personnel that perform CT scans are called radiographers or ...

(CT) to

signal processing Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...

. It can be obtained also by applying to the hyperplanes, described by the linear system, the method of successive projections onto convex sets (POCS).

Algorithm 1: Kaczmarz algorithm

The original Kaczmarz algorithm solves a complex-valued

system of linear equations In mathematics, a system of linear equations (or linear system) is a collection of two or more linear equations involving the same variable (math), variables. For example, : \begin 3x+2y-z=1\\ 2x-2y+4z=-2\\ -x+\fracy-z=0 \end is a system of th ...

Ax = b

. Let

a_

be the

conjugate transpose In mathematics, the conjugate transpose, also known as the Hermitian transpose, of an m \times n complex matrix \mathbf is an n \times m matrix obtained by transposing \mathbf and applying complex conjugation to each entry (the complex conjugate ...

of the

i

-th row of

A

. Initialize

x_

to be an arbitrary complex-valued initial approximation. (e.g.

x_=0

.) For

k=0,1,\ldots

compute: where

i_0, i_1, i_2, \dots

iterates over the rows of

A

in any order, deterministic or random. It is only necessary that each row is iterated infinitely often. When we are in the space of real vectors, the Kaczmarz iteration has a clear geometric meaning. It means projecting

x_k

orthogonally to the hyperplane defined by

\

. In this interpretation, it is clear that if the Kaczmarz iteration converges, then it must converge to one of the solutions to

Ax = b

. A more general algorithm can be defined using a relaxation parameter

\lambda^k

x_ = x_ + \lambda_k \frac a_

If the system has a solution,

x_k

converges to the minimum-

norm Norm, the Norm or NORM may refer to: In academic disciplines * Normativity, phenomenon of designating things as good or bad * Norm (geology), an estimate of the idealised mineral content of a rock * Norm (philosophy), a standard in normative e ...

solution, provided that the iterations start with the zero vector. If the rows are iterated in order, and

\lambda_k = 1

, then convergence is exponential. There are versions of the method that converge to a regularized weighted least squares solution when applied to a system of inconsistent equations and, at least as far as initial behavior is concerned, at a lesser cost than other iterative methods, such as the

conjugate gradient method In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an it ...

Algorithm 2: Randomized Kaczmarz algorithm

In 2009, a randomized version of the Kaczmarz method for overdetermined linear systems was introduced by Thomas Strohmer and Roman Vershynin in which the ''i''-th equation is selected randomly with probability proportional to

\, a_i \, ^2.

This method can be seen as a particular case of

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

. Under such circumstances

x_

converges exponentially fast to the solution of

Ax=b,

and the rate of convergence depends only on the scaled

condition number In numerical analysis, the condition number of a function measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the inpu ...

\kappa(A)

. :Theorem. Let

x

be the solution of

Ax=b.

Then Algorithm 2 converges to

x

in expectation, with the average error: ::

\mathbb \, x_k-x \, ^2 \leq \left (1-\kappa(A)^ \right )^ \cdot \,  x_0-x \, ^2.

Proof

We have Using :

\,  A \, ^2=\sum_^ \,  a_j \, ^2

we can write () as The main point of the proof is to view the left hand side in () as an expectation of some random variable. Namely, recall that the solution space of the

j-th

equation of

Ax=b

is the hyperplane :

\,

whose normal is

\tfrac.

Define a random vector ''Z'' whose values are the normals to all the equations of

Ax=b

, with probabilities as in our algorithm: :

Z=\frac

with probability

\frac \qquad\qquad\qquad j=1,\ldots,m

Then () says that The orthogonal projection

P

onto the solution space of a random equation of

Ax=b

is given by

Pz= z-\langle z-x, Z\rangle Z.

Now we are ready to analyze our algorithm. We want to show that the error

reduces at each step in average (conditioned on the previous steps) by at least the factor of

(1-\kappa(A)^).

The next approximation

x_k

is computed from

x_

x_k= P_kx_,

where

P_1,P_2,\ldots

are independent realizations of the random projection

P.

The vector

x_-x_k

is in the kernel of

P_k.

It is orthogonal to the solution space of the equation onto which

P_k

projects, which contains the vector

x_k-x

(recall that

x

is the solution to all equations). The orthogonality of these two vectors then yields :

\,  x_k-x \, ^2=\,  x_-x \, ^2-\,  x_-x_k \, ^2.

To complete the proof, we have to bound

\,  x_-x_k \, ^2

from below. By the definition of

x_k

, we have :

\,  x_-x_k \, =\langle x_-x,Z_k\rangle

where

Z_1,Z_2,\ldots

are independent realizations of the random vector

Z.

Thus :

\,  x_k-x \, ^2 \leq \left(1-\left, \left\langle\frac, Z_k\right\rangle\^2\right).

Now we take the expectation of both sides conditional upon the choice of the random vectors

Z_1,\ldots,Z_

(hence we fix the choice of the random projections

P_1,\ldots,P_

and thus the random vectors

x_1,\ldots,x_

and we average over the random vector

Z_k

). Then :

\mathbb E_ = \left(1-\mathbb E_\left, \left\langle\frac,Z_k\right\rangle\^2\right).

By () and the independence, :

\mathbb E_ \leq (1-\kappa(A)^).

Taking the full expectation of both sides, we conclude that :

\mathbb E \,  x_k-x \, ^2 \leq (1-\kappa(A)^)\mathbb E.\blacksquare

The superiority of this selection was illustrated with the reconstruction of a bandlimited function from its nonuniformly spaced sampling values. However, it has been pointed out that the reported success by Strohmer and Vershynin depends on the specific choices that were made there in translating the underlying problem, whose geometrical nature is to ''find a common point of a set of hyperplanes'', into a system of algebraic equations. There will always be legitimate algebraic representations of the underlying problem for which the selection method in will perform in an inferior manner. The Kaczmarz iteration () has a purely geometric interpretation: the algorithm successively projects the current iterate onto the hyperplane defined by the next equation. Hence, any scaling of the equations is irrelevant; it can also be seen from () that any (nonzero) scaling of the equations cancels out. Thus, in RK, one can use

\,  a_i \,

or any other weights that may be relevant. Specifically, in the above-mentioned reconstruction example, the equations were chosen with probability proportional to the average distance of each sample point from its two nearest neighbors — a concept introduced by Feichtinger and Gröchenig. For additional progress on this topic, see, and the references therein.

Algorithm 3: Gower-Richtarik algorithm

In 2015, Robert M. Gower and Peter Richtarik developed a versatile randomized iterative method for solving a consistent system of linear equations

Ax = b

which includes the randomized Kaczmarz algorithm as a special case. Other special cases include randomized coordinate descent, randomized Gaussian descent and randomized Newton method. Block versions and versions with importance sampling of all these methods also arise as special cases. The method is shown to enjoy exponential rate decay (in expectation) - also known as linear convergence, under very mild conditions on the way randomness enters the algorithm. The Gower-Richtarik method is the first algorithm uncovering a "sibling" relationship between these methods, some of which were independently proposed before, while many of which were new.

Insights about Randomized Kaczmarz

Interesting new insights about the randomized Kaczmarz method that can be gained from the analysis of the method include: * The general rate of the Gower-Richtarik algorithm precisely recovers the rate of the randomized Kaczmarz method in the special case when it reduced to it. * The choice of probabilities for which the randomized Kaczmarz algorithm was originally formulated and analyzed (probabilities proportional to the squares of the row norms) is not optimal. Optimal probabilities are the solution of a certain semidefinite program. The theoretical complexity of randomized Kaczmarz with the optimal probabilities can be arbitrarily better than the complexity for the standard probabilities. However, the amount by which it is better depends on the matrix

A

. There are problems for which the standard probabilities are optimal. * When applied to a system with matrix

A

which is positive definite, Randomized Kaczmarz method is equivalent to the Stochastic Gradient Descent (SGD) method (with a very special stepsize) for minimizing the strongly convex quadratic function

f(x) = \tfracx^T A x - b^T x.

Note that since

f

is convex, the minimizers of

f

must satisfy

\nabla f(x) = 0

, which is equivalent to

Ax = b.

The "special stepsize" is the stepsize which leads to a point which in the one-dimensional line spanned by the stochastic gradient minimizes the Euclidean distance from the unknown(!) minimizer of

f

, namely, from

x^* = A^b.

This insight is gained from a dual view of the iterative process (below described as "Optimization Viewpoint: Constrain and Approximate").

Six Equivalent Formulations

The Gower-Richtarik method enjoys six seemingly different but equivalent formulations, shedding additional light on how to interpret it (and, as a consequence, how to interpret its many variants, including randomized Kaczmarz): * 1. Sketching viewpoint: Sketch & Project * 2. Optimization viewpoint: Constrain and Approximate * 3. Geometric viewpoint: Random Intersect * 4. Algebraic viewpoint 1: Random Linear Solve * 5. Algebraic viewpoint 2: Random Update * 6. Analytic viewpoint: Random Fixed Point We now describe some of these viewpoints. The method depends on 2 parameters: * a positive definite matrix

B

giving rise to a weighted Euclidean inner product

\langle x,y \rangle _B := x^T B y

and the induced norm ::

\, x\, _B = \left (\langle x,x \rangle _B \right )^,

* and a random matrix

S

with as many rows as

A

(and possibly random number of columns).

1. Sketch and Project

Given previous iterate

x^k,

the new point

x^

is computed by drawing a random matrix

S

(in an iid fashion from some fixed distribution), and setting :

x^ = \underset x \operatorname \,  x - x^k \, _B \text S^T A x = S^T b.

That is,

x^

is obtained as the projection of

x^k

onto the randomly sketched system

S^T Ax = S^T b

. The idea behind this method is to pick

S

in such a way that a projection onto the sketched system is substantially simpler than the solution of the original system

Ax=b

. Randomized Kaczmarz method is obtained by picking

B

to be the identity matrix, and

S

to be the

i^

unit coordinate vector with probability

p_i = \, a_i\, ^2_2/\, A\, _F^2.

Different choices of

B

and

S

lead to different variants of the method.

2. Constrain and Approximate

A seemingly different but entirely equivalent formulation of the method (obtained via Lagrangian duality) is :

x^ = \underset x \operatorname \left \, x - x^* \right \, _B \text x = x^k + B^A^T S y,

where

y

is also allowed to vary, and where

x^*

is any solution of the system

Ax=b.

Hence,

x^

is obtained by first constraining the update to the linear subspace spanned by the columns of the random matrix

B^A^T S

, i.e., to :

\left \,

and then choosing the point

x

from this subspace which best approximates

x^*

. This formulation may look surprising as it seems impossible to perform the approximation step due to the fact that

x^*

is not known (after all, this is what we are trying the compute!). However, it is still possible to do this, simply because

x^

computed this way is the same as

x^

computed via the sketch and project formulation and since

x^*

does not appear there.

5. Random Update

The update can also be written explicitly as :

x^ = x^k - B^A^T S \left (S^T A B^A^T S \right )^ S^T \left (Ax^k - b \right ),

where by

M^\dagger

we denote the Moore-Penrose pseudo-inverse of matrix

M

. Hence, the method can be written in the form

x^=x^k + h^k

, where

h^k

is a random update vector. Letting

M = S^T A B^A^T S,

it can be shown that the system

M y = S^T (Ax^k - b)

always has a solution

y^k

, and that for all such solutions the vector

x^ - B^ A^T S y^k

is the same. Hence, it does not matter which of these solutions is chosen, and the method can be also written as

x^ = x^k - B^A^T S y^k

. The pseudo-inverse leads just to one particular solution. The role of the pseudo-inverse is twofold: * It allows the method to be written in the explicit "random update" form as above, * It makes the analysis simple through the final, sixth, formulation.

6. Random Fixed Point

If we subtract

x^*

from both sides of the random update formula, denote :

Z := A^T S \left (S^T A B^ A^T S \right )^\dagger S^T A,

and use the fact that

Ax^* = b,

we arrive at the last formulation: :

x^ - x^* = \left (I - B^Z \right ) \left (x^k - x^* \right ),

where

I

is the identity matrix. The iteration matrix,

I- B^Z,

is random, whence the name of this formulation.

Convergence

By taking conditional expectations in the 6th formulation (conditional on

x^k

), we obtain :

\mathbb \left. \left x^k \right = \left (I - B^\mathbb \right ) \left^k - x^* \right

By taking expectation again, and using the tower property of expectations, we obtain :

\mathbb \left^-x^* \right = (I - B^\mathbb \mathbb\left^k - x^* \right

Gower and Richtarik show that :

\right ),

where the matrix norm is defined by :

\, M\, _B := \max_ \frac.

Moreover, without any assumptions on

S

one has

0\leq \rho \leq 1.

By taking norms and unrolling the recurrence, we obtain

Theorem ower & Richtarik 2015/h2>
: $\right \, _B \leq \rho^k \, x^0 - x^* \, _B.$ ''Remark''. A sufficient condition for the expected residuals to converge to 0 is $\rho<1.$ This can be achieved if $A$ has a full column rank and under very mild conditions on $S.$ Convergence of the method can be established also without the full column rank assumption in a different way. It is also possible to show a stronger result:

Theorem ower & Richtarik 2015/h2>
The expected squared norms (rather than norms of expectations) converge at the same rate: : $\right \, ^2_B \leq \rho^k \left \, x^0 - x^* \right \, ^2_B.$ ''Remark''. This second type of convergence is stronger due to the following identity which holds for any random vector $x$ and any fixed vector $x^$ : : $\left\, \mathbb \left - x^ \right \right \, ^2 = \mathbb\left x-x^* \right \, ^2 \right - \mathbb \left x-\mathbb^2 \right$

Convergence of Randomized Kaczmarz
We have seen that the randomized Kaczmarz method appears as a special case of the Gower-Richtarik method for $B=I$ and $S$ being the $i^$ unit coordinate vector with probability $p_i = \, a_i\, _2^2/\, A\, _F^2,$ where $a_i$ is the $i^$ row of $A.$ It can be checked by direct calculation that : $, _B = 1 - \frac.$

Further Special Cases

Algorithm 4: PLSS-Kaczmarz
Since the convergence of the (randomized) Kaczmarz method depends on a
rate of convergence In mathematical analysis, particularly numerical analysis, the rate of convergence and order of convergence of a sequence that converges to a limit are any of several characterizations of how quickly that sequence approaches its limit. These are ...
the method may make slow progress on some practical problems. To ensure finite termination of the method, Johannes Brust and
Michael Saunders (academic) Michael Alan Saunders is a New Zealand American Numerical Analysis, numerical analyst and computer scientist. He is a research professor of Management Science and Engineering at Stanford University. Saunders is known for his contributions to nume ...
have developed a process that generalizes the (randomized) Kaczmarz iteration and terminates in at most $m$ iterations to a solution for the consistent system $Ax = b$ . The process is based on
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
, or projections onto lower dimensional spaces, which is how it derives its name PLSS (Projected Linear Systems Solver). An iteration of PLSS-Kaczmarz can be regarded as the generalization : $x^ = x^k + A^T_(A_A^T_)^(b_ - A_x^k)$ where $A_$ is the selection of rows 1 to $k$ and all columns of $A$ . A randomized version of the method uses $k$ non repeated row indices at each iteration: $\$ where each $i_j$ is in $1,2,...,m$ . The iteration converges to a solution when $k =m$ . In particular, since $A_ = A$ it holds that : $Ax^ = Ax^m + AA^T(AA^T)^(b-Ax^m) = b$ and therefore $x^$ is a solution to the linear system. The computation of iterates in PLSS-Kaczmarz can be simplified and organized effectively. The resulting algorithm only requires matrix-vector products and has a direct form algorithm PLSS-Kaczmarz is input: matrix ''A'' right hand side ''b'' output: solution ''x'' such that ''Ax=b'' ''x := 0'', ''P = ' for ''k'' in ''1,2,...,m'' do ''a'' := ''A(i_k,:)' // Select an index i_k in 1,...,m without resampling'' ''d'' := ''P' * a'' ''c₁'' := ''norm(a)'' ''c₂'' := ''norm(d)'' ''c₃'' := ''(b_{i_k}-x'a)/((c₁-c₂)(c₁+c₂))'' ''p'' := c₃(a - P(P'a)) ''P'' := P, p/norm(p) // Append a normalized update'' ''x'' := x + p return ''x''

Notes

References
* * * * * * * * * * * * * * * *

External links

A randomized Kaczmarz algorithm with exponential convergence

Comments on the randomized Kaczmarz method

Kaczmarz algorithm in training Kolmogorov-Arnold network {{Numerical linear algebra Numerical linear algebra Medical imaging Signal processing