Steepest Descent Method⚓︎
Introduction to Steepest Descent⚓︎
Example
: Assume \(\boldsymbol{A}\) is SPD, and \(\boldsymbol{r}\) is the current residual. We pick \(\mathbf{L}=\mathbf{K}=\mathrm{span}\left\{ \boldsymbol{r} \right\}\) if \(\boldsymbol{r} \ne 0\). Then:
Also:
Note that:
We get:
Algorithm
( Steepest Descent Method ):
For \(k=1,2,\cdots\):
- \(\boldsymbol{r}^{\left( k \right)}=\boldsymbol{b}^{\left( k \right)}-\boldsymbol{Ax}^{\left( k \right)}\);
- \(\alpha ^{\left( k \right)}=\frac{\left( \boldsymbol{r}^{\left( k \right)} \right) ^T\boldsymbol{r}^{\left( k \right)}}{\left( \boldsymbol{r}^{\left( k \right)} \right) ^T\boldsymbol{Ar}^{\left( k \right)}}\);
- \(\boldsymbol{x}^{\left( k+1 \right)}=\boldsymbol{x}^{\left( k \right)}+\alpha ^{\left( k \right)}\boldsymbol{r}^{\left( k \right)}\);
End
Discussion on Steepest Descent⚓︎
Question
: Why is it called "Steepest Descent"? Why is \(\alpha\) optimal?
First Perspective (From Error Viewpoint)⚓︎
Assume \(\boldsymbol{x}^*\) is the true solution that we want to compute. We define the error as \(\boldsymbol{e}=\boldsymbol{x}-\boldsymbol{x}^*\). Also define \(f\left( \boldsymbol{x} \right) =\boldsymbol{e}^T\boldsymbol{Ae}\triangleq \left\| \boldsymbol{e} \right\| _{\boldsymbol{A}}^{2}\) as the \(\boldsymbol{A}\)-norm ( \(\boldsymbol{A}\) is SPD here).
Note that since \(f\left( \boldsymbol{x} \right)\) is convex (quadratic function), there is a unique minimizer, \(\boldsymbol{x}^*\) satisfying:
We use gradient descent to find the minimizing search along the direction of negative gradient:
This means that the residual and negative gradient have the same direction.
What is the best step size? Actually we would like:
Therefore, we want to find \(\alpha\) such that \(\frac{\mathrm{d}}{\mathrm{d}\alpha}\left( f\left( \boldsymbol{x}+\alpha \boldsymbol{r} \right) \right) =0\). It turns out that (how to prove?):
This is the optimal solution for \(\alpha\).
Second Perspective (Quadratic Optimization)⚓︎
We define:
Note that:
It turns out that:
Similarly, we get:
is the optimal choice along \(\boldsymbol{r}\).
We can examine the level set of \(f\left( \boldsymbol{x} \right)\) or \(g\left( \boldsymbol{x} \right)\) to get geometric properties.
Further Remarks on Steepest Descent⚓︎
Consider two cases for \(\boldsymbol{A}\):
Case 1
: Eigenvalues \(\frac{\lambda _{\max}}{\lambda _{\min}}\sim O\left( 1 \right)\) (nearly a circle, fast convergence);Case 2
: Eigenvalues \(\frac{\lambda _{\max}}{\lambda _{\min}}\gg 1\) (nearly a very flat oval, slow convergence).
Conclusion
: The condition number of \(\boldsymbol{A}\) determines the speed of convergence:
- If \(\kappa \left( \boldsymbol{A} \right) \gg 1\), we call \(\boldsymbol{A}\) ill-conditioned;
- Otherwise, we call \(\boldsymbol{A}\) well-conditioned.