Conjugate Gradient Method (Another Perspective)
Krylov Subspace
Let us introduce a different way to derive the CG method.
Krylov Subspace: Given \(\boldsymbol{r}_0\), let
\[
\mathbf{K}_n=\mathrm{span}\left\{ \boldsymbol{r}_0,\boldsymbol{Ar}_0,\boldsymbol{A}^2\boldsymbol{r}_0,\cdots ,\boldsymbol{A}^{n-1}\boldsymbol{r}_0 \right\} \triangleq \mathbf{K}_n\left( \boldsymbol{A},\boldsymbol{r}_0 \right)
\]
Claim
: If we take \(\mathbf{L}_n=\mathbf{K}_n\), when \(\boldsymbol{A}\) is SPD, by applying projection method, we obtain the CG algorithm.
We need to have a basis for \(\mathbf{L}_n=\mathbf{K}_n\).
Goal
: Construct an orthonormal basis for \(\mathbf{K}_n\), denoted by \(\boldsymbol{V}_n\).
We can apply Modified Gram-Schmidt algorithm to find the orthonormal basis for the Krylov Subspace \(\mathbf{K}_n\). This resulting algorithm is called Lancozs algorithm ( \(\boldsymbol{A}\) is Hermitian matrix ).
Algorithm
( Lancozs ):
Choose an initial vector \(\boldsymbol{v}_1\) with \(\left\| \boldsymbol{v}_1 \right\| =1\);
Set \(\beta _1=0\);
For \(j=1,2,\cdots ,n\), do:
- \(\boldsymbol{w}_j=\boldsymbol{Av}_j-\beta _j\boldsymbol{v}_{j-1}\);
- \(\alpha _j=\left< \boldsymbol{w}_j,\boldsymbol{v}_j \right>\);
- \(\boldsymbol{w}_j=\boldsymbol{w}_j-\alpha _j\boldsymbol{v}_j\);
- \(\beta _{j+1}=\left\| \boldsymbol{w}_j \right\|\); If \(\beta _{j+1}=0\), stop;
- \(\boldsymbol{v}_{j+1}=\frac{\boldsymbol{w}_j}{\beta _{j+1}}\);
End
\(\left\{ \boldsymbol{v}_1,\cdots ,\boldsymbol{v}_n \right\}\) can form an orthonormal basis for \(\mathbf{K}_n\). Also:
\[
\beta _{j+1}\boldsymbol{v}_{j+1}=\boldsymbol{w}_j=\boldsymbol{Av}_j-\alpha _j\boldsymbol{v}_j-\beta _j\boldsymbol{v}_{j-1};
\]
\[
\boldsymbol{Av}_j=\beta _{j+1}\boldsymbol{v}_{j+1}+\alpha _j\boldsymbol{v}_j+\beta _j\boldsymbol{v}_{j-1}
\]
Note that:
\[
\boldsymbol{V}_n=\left[ \begin{matrix}
\boldsymbol{v}_1& \cdots& \boldsymbol{v}_n\\
\end{matrix} \right] , \boldsymbol{AV}_n=\boldsymbol{V}_n\boldsymbol{T}_n;
\]
where:
\[
\boldsymbol{T}_n=\left[ \begin{matrix}
\alpha _1& \beta _2& & & & \boldsymbol{O}\\
\beta _2& \alpha _2& \beta _3& & & \\
& \beta _3& \alpha _3& \beta _4& & \\
& & \ddots& \ddots& \ddots& \\
& & & \ddots& \ddots& \beta _n\\
\boldsymbol{O}& & & & \beta _n& \alpha _n\\
\end{matrix} \right]
\]
We know that:
\[
{\boldsymbol{V}_n}^T\boldsymbol{AV}_n=\boldsymbol{T}_n
\]
Apply Projection Method:
\[
\begin{cases}
\boldsymbol{x}^{\left( n \right)}=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{V}_n\boldsymbol{y}^{\left( n \right)}\\
\boldsymbol{y}^{\left( n \right)}=\left( {\boldsymbol{V}_n}^T\boldsymbol{AV}_n \right) ^{-1}{\boldsymbol{V}_n}^T\boldsymbol{r}^{\left( 0 \right)}\\
\end{cases};
\]
Then:
\[
\boldsymbol{y}^{\left( n \right)}=\left( \boldsymbol{T}_n \right) ^{-1}\beta _1\boldsymbol{e}_1, \beta _1=\left\| \boldsymbol{r}^{\left( 0 \right)} \right\|
\]
We now apply LU factorization to find \(\left( \boldsymbol{T}_n \right) ^{-1}\):
\[
\boldsymbol{T}_n=\boldsymbol{L}_n\boldsymbol{U}_n,
\]
\[
\boldsymbol{L}_n=\left[ \begin{matrix}
1& & & \boldsymbol{O}\\
\lambda _2& \ddots& & \\
& \ddots& \ddots& \\
\boldsymbol{O}& & \lambda _n& 1\\
\end{matrix} \right] , \boldsymbol{U}_n=\left[ \begin{matrix}
\eta _1& \beta _2& & \boldsymbol{O}\\
& \ddots& \ddots& \\
& & \ddots& \beta _n\\
\boldsymbol{O}& & & \eta _n\\
\end{matrix} \right] ;
\]
\[
\boldsymbol{x}^{\left( n \right)}=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{V}_n\boldsymbol{y}^{\left( n \right)}=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{V}_n\left( \boldsymbol{T}_n \right) ^{-1}\beta _1\boldsymbol{e}_1
\]
\[
=\boldsymbol{x}^{\left( 0 \right)}+\underset{\boldsymbol{P}_n}{\underbrace{\boldsymbol{V}_n\left( \boldsymbol{U}_n \right) ^{-1}}}\cdot \underset{\boldsymbol{z}_n}{\underbrace{\left( \boldsymbol{L}_n \right) ^{-1}\beta _1\boldsymbol{e}_1}}
\]
\[
=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{P}_n\boldsymbol{z}_n=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{P}_{n-1}\boldsymbol{z}_{n-1}+\xi _n\boldsymbol{p}_n
\]
\[
=\boldsymbol{x}^{\left( n-1 \right)}+\xi _n\boldsymbol{p}_n
\]
We need to determine \(\xi _n\boldsymbol{p}_n\) such that \(\boldsymbol{x}^{\left( 0 \right)}\) is the solution of Projection Method.
Summarize what we have got so far:
\[
\boldsymbol{x}^{\left( n \right)}=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{P}_n\boldsymbol{z}_n,
\]
\[
\boldsymbol{P}_n=\boldsymbol{V}_n{\boldsymbol{U}_n}^{-1}, \boldsymbol{z}_n={\boldsymbol{L}_n}^{-1}\beta _1\boldsymbol{e}_1;
\]
\[
\boldsymbol{P}_n=\left[ \begin{matrix}
\boldsymbol{p}_1& \boldsymbol{p}_2& \cdots& \boldsymbol{p}_n\\
\end{matrix} \right] ,
\]
\[
\boldsymbol{x}_n=\boldsymbol{x}^{\left( 0 \right)}+\boldsymbol{P}_{n-1}\boldsymbol{z}_{n-1}+\xi _n\boldsymbol{p}_n=\boldsymbol{x}^{\left( n-1 \right)}+\xi _n\boldsymbol{p}_n
\]
Note that:
\[
\boldsymbol{P}_n=\boldsymbol{V}_n{\boldsymbol{U}_n}^{-1}\Leftrightarrow \boldsymbol{P}_n\boldsymbol{U}_n=\boldsymbol{V}_n,
\]
\[
\left[ \begin{matrix}
\boldsymbol{p}_1& \boldsymbol{p}_2& \cdots& \boldsymbol{p}_n\\
\end{matrix} \right] \cdot \left[ \begin{matrix}
\eta _1& \beta _2& 0& \cdots& \boldsymbol{O}\\
0& \eta _2& \beta _3& \ddots& \vdots\\
& \ddots& \ddots& \ddots& 0\\
& & \ddots& \ddots& \beta _n\\
\boldsymbol{O}& & & 0& \eta _n\\
\end{matrix} \right] =\left[ \begin{matrix}
\boldsymbol{v}_1& \boldsymbol{v}_2& \cdots& \boldsymbol{v}_n\\
\end{matrix} \right] ,
\]
\[
\boldsymbol{v}_n=\beta _n\boldsymbol{p}_{n-1}+\eta _n\boldsymbol{p}_n
\]
\[
\Rightarrow \boldsymbol{p}_n=\frac{1}{\eta _n}\left( \boldsymbol{v}_n-\beta _n\boldsymbol{p}_{n-1} \right)
\]
Claim
: \(\boldsymbol{v}_n\parallel \boldsymbol{r}_n\). (Why? Prove by induction)
Then:
\[
\boldsymbol{p}_n=\frac{1}{\eta _n}\left( \gamma _n\boldsymbol{r}_n-\beta _n\boldsymbol{p}_{n-1} \right)
\]
At this point, this projection method becomes CG algorithm.
Properties of Conjugate Gradient Method
Basic Property
Theorem
: In CG method,
- \(\left< \boldsymbol{r}^{\left( j \right)},\boldsymbol{r}^{\left( i \right)} \right> =0, i\ne j\);
- \(\left< \boldsymbol{Ap}^{\left( j \right)},\boldsymbol{p}^{\left( i \right)} \right> =0,i\ne j\).
Proof
: Note that:
\[
\boldsymbol{P}_n=\boldsymbol{V}_n{\boldsymbol{U}_n}^{-1},
\]
\[
{\boldsymbol{P}_n}^T\boldsymbol{AP}_n=\left( {\boldsymbol{U}_n}^{-1} \right) ^T{\boldsymbol{V}_n}^T\boldsymbol{AV}_n{\boldsymbol{U}_n}^{-1}
\]
\[
=\left( {\boldsymbol{U}_n}^{-1} \right) ^T\boldsymbol{T}_n{\boldsymbol{U}_n}^{-1}=\left( {\boldsymbol{U}_n}^{-1} \right) ^T\boldsymbol{L}_n\boldsymbol{U}_n{\boldsymbol{U}_n}^{-1}=\left( {\boldsymbol{U}_n}^{-1} \right) ^T\boldsymbol{L}_n
\]
Since \(\left( {\boldsymbol{U}_n}^{-1} \right) ^T\boldsymbol{L}_n\) is lower triangular, and \({\boldsymbol{P}_n}^T\boldsymbol{AP}_n\) is symmetric, then we know that \({\boldsymbol{P}_n}^T\boldsymbol{AP}_n\) is diagonal. Therefore,
\[
{\boldsymbol{p}_i}^T\boldsymbol{Ap}_j=0, i\ne j\Longleftrightarrow \left< \boldsymbol{Ap}^{\left( j \right)},\boldsymbol{p}^{\left( i \right)} \right> =0,i\ne j
\]
End of proof.
We call \(\boldsymbol{p}^{\left( j \right)},\boldsymbol{p}^{\left( i \right)}\) conjugate to each other.
Theorem
: Assume \(\boldsymbol{A}\) is SPD, \(\mathbf{L}=\mathbf{K}\), then a vector \(\boldsymbol{x}\) is the result of a projection method onto \(\mathbf{K}\), orthogonal to \(\mathbf{L}\) with initial guess \(\boldsymbol{x}^{(0)}\) if and only if \(\boldsymbol{x}\) minimizes the \(\boldsymbol{A}\)-norm of the error over \(\boldsymbol{x}^{\left( 0 \right)}+\mathbf{K}\), i.e.,
\[
E\left( \boldsymbol{y} \right) =\left\| \boldsymbol{x}^*-\boldsymbol{y} \right\| _{\boldsymbol{A}}^{2}=\left( \boldsymbol{x}^*-\boldsymbol{y} \right) ^T\boldsymbol{A}\left( \boldsymbol{x}^*-\boldsymbol{y} \right)
\]
where \(\boldsymbol{x}^*\) is the true solution of \(\boldsymbol{Ax}=\boldsymbol{b}\), and:
\[
\boldsymbol{x}=\mathrm{arg}\min E\left( \boldsymbol{y} \right) , \boldsymbol{y}\in \boldsymbol{x}^{\left( 0 \right)}+\mathbf{K}
\]
Proof
(idea): To show that
\[
\frac{\partial E}{\partial \boldsymbol{y}}\mid_{\boldsymbol{y}=\boldsymbol{x}}^{}=\boldsymbol{v}^T\boldsymbol{r}=0,\forall \boldsymbol{v}\in \mathbf{K}=\mathbf{L}, \boldsymbol{r}=\boldsymbol{b}-\boldsymbol{Ax}
\]
Convergence Rate
Theorem
: In exact mathematics, CG method converges in at most \(m\) steps if \(\boldsymbol{A}\) is SPD in \(\mathbb{R} ^{m\times m}\).
Theorem
: In general, CG method converges according to the following estimate ( \(\boldsymbol{A}\) is SPD):
\[
\left\| \boldsymbol{x}^{\left( n \right)}-\boldsymbol{x}^{\left( * \right)} \right\| _2\leqslant \left( \frac{\sqrt{\kappa \left( \boldsymbol{A} \right)}-1}{\sqrt{\kappa \left( \boldsymbol{A} \right)}+1} \right) ^n\cdot \left\| \boldsymbol{x}^{\left( 0 \right)}-\boldsymbol{x}^{\left( * \right)} \right\| _2
\]
Theorem
: The Steepest Descent Method converges with the following estimate ( \(\boldsymbol{A}\) is SPD):
\[
\left\| \boldsymbol{x}^{\left( n \right)}-\boldsymbol{x}^{\left( * \right)} \right\| _{\boldsymbol{A}}\leqslant \left( \frac{\kappa \left( \boldsymbol{A} \right) -1}{\kappa \left( \boldsymbol{A} \right) +1} \right) ^n\cdot \left\| \boldsymbol{x}^{\left( 0 \right)}-\boldsymbol{x}^{\left( * \right)} \right\| _{\boldsymbol{A}}
\]
The difference between CG and Steepest Descent Method is significant.
Example
: If \(\kappa \left( \boldsymbol{A} \right) =999\), then:
\[
\frac{\kappa \left( \boldsymbol{A} \right) -1}{\kappa \left( \boldsymbol{A} \right) +1}=0.998,
\]
\[
n=100: \left( 0.998 \right) ^{100}\approx 1-100\times 0.002=0.8;
\]
\[
\frac{\sqrt{\kappa \left( \boldsymbol{A} \right)}-1}{\sqrt{\kappa \left( \boldsymbol{A} \right)}+1}=0.93,
\]
\[
n=0.8100: \left( 0.93 \right) ^{100}\approx 7\times 10^{-4}
\]
Preconditioning Techniques: Reduce the condition number of a linear system, so that the convergence is faster.