Skip to content

Least Square Problems⚓︎

Given \(\boldsymbol{A}\in \mathbb{R} ^{m\times n}, m\geqslant n; \boldsymbol{b}\in \mathbb{R} ^m\), the Least Square problem is defined as: Find \(\boldsymbol{x}^*\in \mathbb{R} ^n\), such that:

\[ \boldsymbol{x}^*=\mathrm{arg}\min_{\boldsymbol{x}} \left\| \boldsymbol{b}-\boldsymbol{Ax} \right\| _2 \]

where \(\boldsymbol{r}=\boldsymbol{b}-\boldsymbol{Ax}\) is called residual. Residual \(\boldsymbol{r}\ne \mathbf{0}\).

We will introduce three different methods to solve the Least Square problem.

Normal Equation⚓︎

Traditional Method⚓︎

We solve \(\boldsymbol{A}^*\boldsymbol{Ax}=\boldsymbol{A}^*\boldsymbol{b}\), where \(\boldsymbol{A}^*\boldsymbol{A}\in \mathbb{R} ^{n\times n}\). If \(\boldsymbol{A}\) is full rank, then \(\boldsymbol{A}^*\boldsymbol{A}\) is invertible.

The Least Square solution is:

\[ \boldsymbol{x}^*=\left( \boldsymbol{A}^*\boldsymbol{A} \right) ^{-1}\boldsymbol{A}^*\boldsymbol{b} \]

Question: How about \(\boldsymbol{A}\) is not full rank?

Comment: This method is not recommended is \(\kappa \left( \boldsymbol{A} \right)\) is large.

Note: \(\kappa \left( \boldsymbol{A}^*\boldsymbol{A} \right) =\left( \kappa \left( \boldsymbol{A} \right) \right) ^2\)

Usually in practice, the number of matrix rows are way larger than that of columns.

QR Factorization⚓︎

\[ \boldsymbol{A}=\boldsymbol{QR}\,\,\Longleftrightarrow \boldsymbol{Q}^*\boldsymbol{A}=\boldsymbol{R} \]
\[ \left\| \boldsymbol{Ax}-\boldsymbol{b} \right\| _2=\left\| \boldsymbol{Q}^*\left( \boldsymbol{Ax}-\boldsymbol{b} \right) \right\| _2 \]
\[ =\left\| \boldsymbol{Q}^*\boldsymbol{Ax}-\boldsymbol{Q}^*\boldsymbol{b} \right\| _2=\left\| \boldsymbol{Rx}-\boldsymbol{Q}^*\boldsymbol{b} \right\| _2 \]

Let \(\boldsymbol{c}=\boldsymbol{Q}^*\boldsymbol{b}\), then:

\[ \boldsymbol{x}^*=\boldsymbol{R}^{-1}\boldsymbol{c}=\boldsymbol{R}^{-1}\boldsymbol{Q}^*\boldsymbol{b} \]

Assume \(\boldsymbol{A}=\boldsymbol{U}\mathbf{\Sigma }\boldsymbol{V}^*\) as full SVD. We can get:

\[ \boldsymbol{A}=\boldsymbol{U}\mathbf{\Sigma }\boldsymbol{V}^*\Leftrightarrow \boldsymbol{U}^*\boldsymbol{AV}=\mathbf{\Sigma }; \]
\[ \left\| \boldsymbol{Ax}-\boldsymbol{b} \right\| _2=\left\| \boldsymbol{U}^*\boldsymbol{AVV}^*\boldsymbol{x}-\boldsymbol{U}^*\boldsymbol{b} \right\| _2=\left\| \mathbf{\Sigma }\boldsymbol{V}^*\boldsymbol{x}-\boldsymbol{U}^*\boldsymbol{b} \right\| _2 \]

Define \(\boldsymbol{V}^*\boldsymbol{x}=\boldsymbol{y},\ \boldsymbol{w}=\boldsymbol{U}^*\boldsymbol{b}\), then:

\[ \left\| \boldsymbol{Ax}-\boldsymbol{b} \right\| _2=\left\| \mathbf{\Sigma }\boldsymbol{y}-\boldsymbol{w} \right\| _2; \]
\[ \mathrm{arg}\min_{\boldsymbol{y}} \left\| \mathbf{\Sigma }\boldsymbol{y}-\boldsymbol{w} \right\| _2 \]
\[ \Longrightarrow \boldsymbol{y}^*=\mathbf{\Sigma }^{-1}\boldsymbol{w}=\mathbf{\Sigma }^{-1}\boldsymbol{U}^*\boldsymbol{b}, \]
\[ \boldsymbol{x}^*=\boldsymbol{Vy}^*=\boldsymbol{V}\mathbf{\Sigma }^{-1}\boldsymbol{U}^*\boldsymbol{b} \]

We get the formula:

\[ \boldsymbol{x}^*=\boldsymbol{V}\mathbf{\Sigma }^{-1}\boldsymbol{U}^*\boldsymbol{b} \]

However, \(\mathbf{\Sigma }^{-1}\) may not exist! Assume:

\[ \mathbf{\Sigma }=\left[ \begin{matrix} \sigma _1& & & & & \\ & \ddots& & & & \\ & & \sigma _r& & & \\ & & & 0& & \\ & & & & \ddots& \\ & & & & & 0\\ \end{matrix} \right] , \sigma _1\geqslant \sigma _2\geqslant \cdots \geqslant \sigma _r>\sigma _{r+1}=\cdots =\sigma _m=0 \]

The pseudo inverse \(\mathbf{\Sigma }^{-1}\) is defined as:

\[ \mathbf{\Sigma }^{-1}=\left[ \begin{matrix} \frac{1}{\sigma _1}& & & & & \\ & \ddots& & & & \\ & & \frac{1}{\sigma _r}& & & \\ & & & 0& & \\ & & & & \ddots& \\ & & & & & 0\\ \end{matrix} \right] \]

Then \(\boldsymbol{x}^*=\boldsymbol{V}\mathbf{\Sigma }^{-1}\boldsymbol{U}^*\boldsymbol{b}\) can be applied to the reduced SVD.

We can rewrite another formula of the result:

\[ \boldsymbol{x}^*=\sum_{i=1}^r{\frac{{\boldsymbol{u}_i}^T\boldsymbol{b}}{\sigma _i}\boldsymbol{v}_i} \]

where \(\boldsymbol{u}_i, \boldsymbol{v}_i\) are columns of \(\boldsymbol{U}, \boldsymbol{V}\) respectively.

Comments: If \(\sigma _i\) is small, the error is enlarged by \(\sigma _i\).

We can compute the approximate value ("the best approximation of \(\boldsymbol{x}^*\) in \(\mathbb{R} ^k\) "):

\[ \boldsymbol{y}^*=\sum_{i=1}^k{\frac{{\boldsymbol{u}_i}^T\boldsymbol{b}}{\sigma _i}\boldsymbol{v}_i}, k<r \]

This is the idea behind Principal Component Analysis (where to cut off?).