= [1 1; 1 (1-1e-12)]
A = [0; 1e-12]
b = A\b x
2-element Vector{Float64}:
1.0000221222095027
-1.0000221222095027
Cours ENPC — Pratique du calcul scientifique
$$
% % %
%
$$
\[ \definecolor{gris_C}{RGB}{96,96,96} \definecolor{blanc_C}{RGB}{255,255,255} \definecolor{pistache_C}{RGB}{176,204,78} \definecolor{pistache2_C}{RGB}{150,171,91} \definecolor{jaune_C}{RGB}{253,235,125} \definecolor{jaune2_C}{RGB}{247,208,92} \definecolor{orange_C}{RGB}{244,157,84} \definecolor{orange2_C}{RGB}{239,119,87} \definecolor{bleu_C}{RGB}{126,151,206} \definecolor{bleu2_C}{RGB}{90,113,180} \definecolor{vert_C}{RGB}{96,180,103} \definecolor{vert2_C}{RGB}{68,141,96} \definecolor{pistache_light_C}{RGB}{235,241,212} \definecolor{jaune_light_C}{RGB}{254,250,224} \definecolor{orange_light_C}{RGB}{251,230,214} \definecolor{bleu_light_C}{RGB}{222,229,241} \definecolor{vert_light_C}{RGB}{216,236,218} \]
Goal of this chapter
Study numerical methods for the linear equation \[ \mathsf A \mathbf{\boldsymbol{x}} = \mathbf{\boldsymbol{b}}, \] where \(\mathsf A \in \mathbf R^{n \times n}\) and \(b \in \mathbf R^n.\)
Two classes of methods:
Direct methods:
LU decomposition (for general invertible matrices);
Cholesky decomposition (for symmetric positive definite matrices)
Iterative methods:
Basic iterative methods based on a splitting;
Conjugate gradients;
And many more: GMRES, BiCGStab, etc.
Before discussing these methods, we introduce the concept of conditioning.
In Julia, we can calculate the solution with the \
operator
2-element Vector{Float64}:
1.0000221222095027
-1.0000221222095027
Why is the relative error much larger than the machine epsilon?
Question: Can we estimate \(\frac{{\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}}\) in terms of \(\frac{{\lVert {\Delta \mathsf A} \rVert}}{{\lVert {\mathsf A} \rVert}}\) and \(\frac{{\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}}\)?
Given \(p \in [1, \infty]\), the \(p\)-norm of a vector \(\mathbf{\boldsymbol{x}} \in \mathbf R^n\) is defined as follows: \[
{\lVert {\mathbf{\boldsymbol{x}}} \rVert}_p :=
\begin{cases}
\left( \sum_{i=1}^{n} {\lvert {x_i} \rvert}^p \right)^{\frac{1}{p}} & \text{if $p < \infty$}, \\
\max \Bigl\{ {\lvert {x_1} \rvert}, \dotsc, {\lvert {x_n} \rvert} \Bigr\} & \text{if $p = \infty$}.
\end{cases}
\] In Julia, calculate \({\lVert {\mathbf{\boldsymbol{x}}} \rVert}_p\) using norm(x, p)
.
Plots.plot(aspect_ratio=:equal, xlims=(-1.1,1.1), ylims=(-1.1, 1.1))
Plots.plot!(framestyle=:origin, grid=true, legend=:outertopright)
Plots.plot!(title="Unit circle in different norms")
Plots.plot!([0, 1, 0, -1, 0], [-1, 0, 1, 0, -1], label=L"$p = 1$ (taxicab norm)")
Plots.plot!(t -> cos(t), t -> sin(t), 0, 2π, label=L"$p = 2$ (Euclidean norm)")
Plots.plot!([1, 1, -1, -1, 1], [1, -1, -1, 1, 1], label=L"p = \infty")
The operator norm on matrices induced by the vector \(p\)-norm is given by \[
{\lVert {\mathsf A} \rVert}_{p} := \sup_{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}_{p} \leqslant 1} {\lVert {\mathsf A \mathbf{\boldsymbol{x}}} \rVert}_{p}
\] It follows from the definition that \({\lVert {\mathsf A \mathsf B} \rVert}_p \leqslant{\lVert {\mathsf A} \rVert}_p {\lVert {\mathsf B} \rVert}_p\). In Julia, calculate \({\lVert {\mathsf A} \rVert}_p\) using opnorm(A, p)
.
Exercises
Show that
The matrix 2-norm is given by \(\sqrt{\lambda_{\rm max}(\mathsf A^* \mathsf A)}\).
The matrix 1-norm is the maximum absolute column sum:
\[{\lVert {\mathsf A} \rVert}_1 = \max_{1 \leqslant j \leqslant n} \sum_{i=1}^{n} {\lvert {a_{ij}} \rvert}.\]
The matrix \(\infty\)-norm is the maximum absolute row sum:
\[{\lVert {\mathsf A} \rVert}_{\infty} = \max_{1 \leqslant i \leqslant n} \sum_{j=1}^{n} {\lvert {a_{ij}} \rvert}.\]
From matrix norms, we define the condition number of a matrix as \[ \kappa_p(\mathsf A) = {\lVert {\mathsf A} \rVert}_p {\lVert {\mathsf A^{-1}} \rVert}_p \]
Properties:
\(\kappa_p(\mathsf I) = 1\)
\(\kappa_p(\mathsf A) \geqslant 1\)
\(\kappa_p(\alpha \mathsf A) = \kappa_p(\mathsf A)\)
In Julia, calculate the condition number using cond(A, p)
.
Proposition
Let \(\mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}}\) denote the solution to \[ \mathsf A (\mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}}) = \mathbf{\boldsymbol{b}} + \Delta \mathbf{\boldsymbol{b}} \] The following inequality holds: \[ \frac{{\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}} \leqslant\kappa(\mathsf A) \frac{{\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}} \]
Proof
It holds by definition of \(\Delta \mathbf{\boldsymbol{x}}\) that \(\mathsf A \Delta \mathbf{\boldsymbol{x}} = \Delta \mathbf{\boldsymbol{b}}\). Therefore \[ \begin{aligned} {\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert} &= {\lVert {\mathsf A^{-1} \Delta \mathbf{\boldsymbol{b}}} \rVert} \leqslant{\lVert {\mathsf A^{-1}} \rVert} {\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert} \\ &= \frac{{\lVert {\mathsf A \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}} {\lVert {\mathsf A^{-1}} \rVert} {\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert} \leqslant\frac{{\lVert {\mathsf A} \rVert} {\lVert {\mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{b}}} \rVert}} {\lVert {\mathsf A^{-1}} \rVert} {\lVert {\Delta \mathbf{\boldsymbol{b}}} \rVert}. \end{aligned} \] Rearranging, we obtain the statement.
Proposition
Let \(\mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}}\) denote the solution to \[ (\mathsf A + \Delta \mathsf A) (\mathbf{\boldsymbol{x}} + \Delta \mathbf{\boldsymbol{x}}) = \mathbf{\boldsymbol{b}} \] If \(\mathsf A\) is invertible and \({\lVert {\Delta \mathsf A} \rVert} < \frac{1}{2} {\lVert {\mathsf A^{-1}} \rVert}^{-1}\), then \[ \frac{{\lVert {\Delta \mathbf{\boldsymbol{x}}} \rVert}}{{\lVert {\mathbf{\boldsymbol{x}}} \rVert}} \leqslant 2\kappa(\mathsf A) \frac{{\lVert {\Delta \mathsf A} \rVert}}{{\lVert {\mathsf A} \rVert}} \]
Conclusion: \(\kappa(\mathsf A)\) measures sensitivity to perturbations:
useful to estimate the impact of round-off errors;
influences convergence speed of methods (see later).
When \(\kappa_p(\mathsf A) \gg 1\), the system is called ill-conditioned.
Consider the linear system \[ \mathsf A(α) \mathbf x := \begin{pmatrix} 1 & 1 \\ 1 & 1 - α \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} π \\ π - πα \end{pmatrix} =: \mathbf b(α). \]
We plot on the same graph the functions which to \(\alpha\) associate …
… the quantity \(\kappa_2\bigl(\mathsf A(α)\bigr) \times ε\), where \(ε\) is the machine epsilon for the Float64
format.
… the relative error in Euclidean norm, obtained when solving the linear system with the backslash \
operator.
Af(α) = [1. 1.; 1. (1-α)]
bf(α) = [π; π - π*α]
# Exact solution
x_exact = [0.; π]
# Range of α
αs = 10. .^ (-15:.1:-2)
rel_err(α) = norm(Af(α)\bf(α) - x_exact)/norm(x_exact)
Plots.plot(αs, cond.(Af.(αs)) * eps(), label=L"κ_2(A) \times ε")
Plots.plot!(αs, rel_err.(αs), label=L"$ℓ^2$ error")
Plots.plot!(xscale=:log10, yscale=:log10, xlabel=L"α", bottom_margin=5Plots.mm)
First calculate the \(\mathsf L \mathsf U\) decomposition of \(\mathsf A\) with
\(\mathsf U\) upper triangular matrix;
\(\mathsf L\) unit lower triangular.
Then solve \(\mathsf L \mathbf{\boldsymbol{y}} = \mathbf{\boldsymbol{b}}\) using forward substitution.
Remark: The LU decomposition may not exist
In practice, use decomposition \(\mathsf P \mathsf A = \mathsf L \mathsf U\), with \(\mathsf P\) a permutation matrix.
Guaranteed to exist if \(\mathsf A\) is invertible;
More numerically stable.
\[ \begin{pmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{pmatrix} =: \mathsf A \]
\[ \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}\frac{3}{2}} & 1 & 0 \\ {\color{red}{1}} & 0 & 1 \end{pmatrix} }_{=: \, \mathsf M_1}~ \underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{pmatrix} }_{\mathsf A} = \begin{pmatrix} 2 & 1 & -1 \\ {\color{red} 0} & \frac{1}{2} & \frac{1}{2} \\ {\color{red} 0} & 2 & 1 \end{pmatrix} \]
\[ \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & {\color{blue} -4} & 1 \end{pmatrix} }_{=: \, \mathsf M_2}~ \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}\frac{3}{2}} & 1 & 0 \\ {\color{red}{1}} & 0 & 1 \end{pmatrix} }_{=: \, \mathsf M_1}~ \underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ -3 & -1 & 2 \\ -2 & 1 & 2 \end{pmatrix} }_{\mathsf A} = \underbrace{ \begin{pmatrix} 2 & 1 & -1 \\ {\color{red} 0} & \frac{1}{2} & \frac{1}{2} \\ {\color{red} 0} & {\color{blue} 0} & -1 \end{pmatrix} }_{=: \, \mathsf U} \]
Gaussian transformations \(\mathsf M_1\), \(\mathsf M_2\) are simple to invert and multiply. In particular \[ \mathsf A = \mathsf M_1^{-1} \mathsf M_2^{-1} \mathsf U = \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}-\frac{3}{2}} & 1 & 0 \\ {\color{red}{-1}} & 0 & 1 \end{pmatrix} }_{=: \, \mathsf M_1^{-1}}~ \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & {\color{blue} 4} & 1 \end{pmatrix} }_{=: \, \mathsf M_2^{-1}}~ \mathsf U = \underbrace{ \begin{pmatrix} 1 & 0 & 0 \\ {\color{red}-\frac{3}{2}} & 1 & 0 \\ {\color{red}{-1}} & {\color{blue} 4} & 1 \end{pmatrix} }_{\mathsf M_1^{-1} \mathsf M_2^{-1} =: \, \mathsf L}~ \mathsf U \]
# A is an invertible matrix of size n x n
L = [r == c ? 1.0 : 0.0 for r in 1:n, c in 1:n]
U = copy(A)
for c in 1:n-1, r in c+1:n
U[c, c] == 0 && error("Pivotal entry is zero!")
L[r, c] = U[r, c] / U[c, c]
U[r, c:end] -= U[c, c:end] * L[r, c]
end
# L is unit lower triangular and U is upper triangular
\(~\)
Computational cost: \(\frac{2}{3} n^3 + \mathcal O(n^2)\) floating point operations (flops);
In comparison, the computational cost of forward/backward substitution scales as \(\mathcal O(n^2)\);
The LU decomposition can be reused for different right-hand sides;
If \(\mathsf A\) is a banded matrix, so are \(\mathsf L\) and \(\mathsf U\), with the same bandwidth;
\(\mathsf L\) and \(\mathsf U\) are not necessarily sparse when \(\mathsf A\) is sparse.
Alternative 1: Calculate \(\mathsf A^{-1}\) by applying row operations to extended matrix \([\mathsf A | \mathsf I]\), then set \(\mathbf x = \mathsf A^{-1} \mathbf b\)
Amounts to solving \(\mathsf A \mathbf{\boldsymbol{x}}_i = \mathbf{\boldsymbol{e}}_i\) for all \(i\), then linearly combining solutions;
❌ Less numerically stable (more rounding errors);
❌ Computational cost also \(\mathcal O(n^3)\), but with a larger prefactor.
Alternative 2: Calculate \(\mathbf{\boldsymbol{x}}\) by Gaussian elimination on extended matrix \([\mathsf A | \mathbf{\boldsymbol{b}}]\)
✔️ Roughly the same computational cost as \(\mathsf L \mathsf U\) factorization;
❌ Not ideal when solving many systems \(\mathsf A \mathbf{\boldsymbol{x}}_i = \mathbf{\boldsymbol{b}}_i\) with same matrix.
using LinearAlgebra
A = [1 1 1; 1 2 3; 1 4 9]
A = factorize(A)
@show A # Note that A can still be used as a matrix later!
A = LU{Float64, Matrix{Float64}, Vector{Int64}}([1.0 1.0 1.0; 1.0 3.0 8.0; 1.0 0.3333333333333333 -0.6666666666666665], [1, 3, 3], 0)
LU{Float64, Matrix{Float64}, Vector{Int64}}
L factor:
3×3 Matrix{Float64}:
1.0 0.0 0.0
1.0 1.0 0.0
1.0 0.333333 1.0
U factor:
3×3 Matrix{Float64}:
1.0 1.0 1.0
0.0 3.0 8.0
0.0 0.0 -0.666667
using Polynomials, Statistics
generate_defpos_matrix(n) = begin A = rand(n,n); A'A + I end
generate_defpos_matrix(1000) |> cholesky # To force compilation
nb_samples = 10
Plots.plot(xaxis=:log10, yaxis=:log10, xlabel="n", ylabel="CPU time", legend=:topleft, size=(900,400),
bottom_margin=5Plots.mm, left_margin=5Plots.mm, top_margin=5Plots.mm)
tf, tb = Float64[], Float64[]
tn = 2 .^(6:9)
for n in tn
A = generate_defpos_matrix(n)
push!(tf, mean(@elapsed cholesky(A) for _ in 1:nb_samples))
end
Pf = fit(log10.(tn), log10.(tf), 1) ; af = round(coeffs(Pf)[2]; digits=2)
Plots.plot!(tn, tf, marker=:o, label=L"Dense matrix $n^{%$af}$")
Plots.xticks!(tn, [L"2^{%$p}" for p in 6:9])
\[∀j=1\ldots n\quad c_{jj}=\sqrt{a_{jj}-\sum_{k=\max{(1,j-b)}}^{j-1}\lvert c_{jk}\rvert^2} \quad\textrm{ and }\quad ∀i=j+1\ldots \min{(n,j+b)} \quad c_{ij}=\frac{a_{ij}-\sum_{k=\max{(1,j-b)}}^{j-1}c_{ik}\overline{c_{jk}}}{c_{jj}}\]
function generate_banded_matrix(n, b)
C = [j≤i≤j+b ? rand() : 0.0 for i in 1:n, j in 1:n]
return C*C'+I
end
b = 4; A = generate_banded_matrix(10, b)
cholesky_banded(A, b) # To force compilation
Plots.plot(xaxis=:log10, yaxis=:log10, xlabel="n", ylabel="CPU time", legend=:topleft, size=(900,400),
bottom_margin=5Plots.mm, left_margin=5Plots.mm, top_margin=5Plots.mm)
tf, tb = Float64[], Float64[]
tn = 2 .^(6:9)
for n in tn
A = generate_banded_matrix(n, b)
push!(tf, mean(@elapsed cholesky(A) for _ in 1:nb_samples))
push!(tb, mean(@elapsed cholesky_banded(A,b) for _ in 1:nb_samples))
end
Pf = fit(log10.(tn), log10.(tf), 1) ; af = round(coeffs(Pf)[2]; digits=2)
Plots.plot!(tn, tf, marker=:o, label=L"Dense matrix $n^{%$af}$")
ntn = 2 .^(10:12)
for n in ntn
A = generate_banded_matrix(n, b)
push!(tb, mean(@elapsed cholesky_banded(A,b) for _ in 1:nb_samples))
end
append!(tn,ntn)
Pb = fit(log10.(tn),log10.(tb),1) ; ab = round(coeffs(Pb)[2]; digits=2)
Plots.plot!(tn, tb, marker=:diamond, label="Banded matrix "*latexstring("n^{$(ab)}"))
Plots.xticks!(ntn, [L"2^{%$p}" for p in 6:12])
Motivation: General-purpose direct methods are
exact up to roundoff errors
but computationally expensive: \(\mathcal O(n^3)\) flops for full matrices…
Additionally, they require the storage of \(\mathsf L\) and \(\mathsf U\).
\(~\)
In contrast, iterative methods
are usually approximate (but there are exceptions);
usually have a cost per iteration scaling as \(\mathcal O(n^2)\) at most;
\(\rightarrow\) often computationally more economical
can be stopped at any point when the residual \(\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}\) is sufficiently small;
\(~\)
For simplicity, we consider from now on only the case where \(\mathsf A \in \mathbf R^{n \times n}\) is symmetric positive definite.
Richardson’s method
\[\begin{align*} \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} + \omega (\mathbf{\boldsymbol{b}} - \mathsf A \mathbf{\boldsymbol{x}}^{(k)}) \end{align*}\]
Proposition
Assume \(\omega \neq 0\). If \((\mathbf{\boldsymbol{x}}^{(k)})\) converges in the iteration above, then it converges towards the solution of \[\begin{align*} \mathsf A \mathbf{\boldsymbol{x}} = \mathbf{\boldsymbol{b}}. \end{align*}\]
Questions:
Does this method converge and how quickly?
What is a good stopping criterion?
Definition
The spectral radius of \(\mathsf A \in \mathbf R^{n \times n}\) is given by \[ \rho(\mathsf A) := \max_{\lambda \in \mathop{\mathrm{spectrum}}A} {\lvert {\lambda} \rvert} \]
Remarks:
\(\rho(A)\) is not a norm;
\(\rho(A) \leqslant{\lVert {\mathsf A} \rVert}\) for any induced matrix norm.
🔎 Gelfand’s formula : For any \(\mathsf A \in \mathbf R^{n \times n}\) and any matrix norm \({\lVert {\cdot} \rVert}\), \[ \lim_{k \to \infty} {\lVert {\mathsf A^k} \rVert}^{1/k} = \rho(\mathsf A). \]
\[\begin{align*} \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} + \omega (\mathbf{\boldsymbol{b}} - \mathsf A \mathbf{\boldsymbol{x}}^{(k)}) \end{align*}\]
Proposition
The Richardson iteration converges to the real solution for every choice of \(\mathbf{\boldsymbol{x}}^{(0)}\) if and only if \[\begin{align*} \lambda := \rho(\mathsf I - \omega \mathsf A) = \max_{\lambda \in \mathop{\mathrm{spectrum}}A} {\lvert {1 - \omega \lambda} \rvert} < 1. \end{align*}\] Furthermore \(\| \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* \|_2 \leqslant\lambda^k \| \mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_* \|_2\) for all \(k \in \mathbb N\).
Proof
We notice that \[ \mathbf{\boldsymbol{x}}^{(k+1)} - \mathbf{\boldsymbol{x}}_* = \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* + \omega (\mathsf A \mathbf{\boldsymbol{x}}_* - \mathsf A \mathbf{\boldsymbol{x}}^{(k)}) = (\mathsf I - \omega \mathsf A) \left( \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* \right). \] Therefore \[ \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_* = (\mathsf I - \omega \mathsf A)^k \left( \mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_* \right) \] The statement follows by diagonalising the matrix \(\mathsf I - \omega \mathsf A\).
Corollary: It is necessary for convergence that all the eigenvalues have nonzero real parts of the same sign.
Exercise: find \(\omega\) that minimizes \(\rho(\mathsf I - \omega \mathsf A)\)
Simple calculations lead to \[ \omega_* = \frac{2}{\lambda_{\min}(\mathsf A) + \lambda_{\max}(\mathsf A)}, \qquad \rho(\mathsf I - \omega_* \mathsf A) = \frac{\kappa_2(\mathsf A) - 1}{\kappa_2(\mathsf A) + 1} \]
\(~\)
Analysis of Richardson’s method (2/2)
Since \(\mathsf A\) is assumed symmetric positive definite, the solution to \(\mathsf A \mathbf{\boldsymbol{x}} = \mathbf{\boldsymbol{b}}\) is the minimizer \[ f(\mathbf{\boldsymbol{x}}) = \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}}. \] In this case, Richardson’s iteration may be rewritten as \[ \mathbf{\boldsymbol{x}}^{(k+1)} = \mathbf{\boldsymbol{x}}^{(k)} - \omega \nabla f(\mathbf{\boldsymbol{x}}^{(k)}). \]
Consider the linear system \[ \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 7 \\ 8 \end{pmatrix} \]
The exact solution is \((2, 3)^T\);
The eigenvalues of \(\mathsf A\) are 1 and 3;
The condition number \(\kappa_2(\mathsf A)\) is 3;
The optimal \(\omega\) is \(\frac{1}{2}\).
Contour plot of \[f(\mathbf{\boldsymbol{x}}) = \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}}.\]
Consider the splitting \[ \mathsf A = {\color{green}\mathsf M} - {\color{blue} \mathsf N} \] where \(\mathsf M\) is invertible and easy to invert. Then \[ \mathsf A \mathbf{\boldsymbol{x}}_* = \mathbf{\boldsymbol{b}} \quad \Leftrightarrow \quad {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}_* = {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}_* + \mathbf{\boldsymbol{b}} \] which suggests the iterative method \[ {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}^{(k+1)} = {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}^{(k)} + \mathbf{\boldsymbol{b}} \]
Standard basic iterative methods (\(\mathsf A = \mathsf L + \mathsf D + \mathsf U\))
Richardson’s method corresponds to \[ \mathsf A = \underbrace{{\color{green}\frac{1}{\omega} \mathsf I }}_{\mathsf M} - \underbrace{\left({\color{blue} \frac{1}{\omega} \mathsf I - \mathsf A}\right)}_{\mathsf N} \]
Jacobi’s iteration: \(\mathsf A = {\color{green}\mathsf D} - ({\color{blue}-\mathsf L - \mathsf U})\).
Gauss Seidel’s iteration: \(\mathsf A = ({\color{green}\mathsf D + \mathsf L}) - ({\color{blue}- \mathsf U})\)
Proposition: convergence of the splitting method
The iteration \[ {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}^{(k+1)} = {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}^{(k)} + \mathbf{\boldsymbol{b}} \] converges for any initial \(\mathbf{\boldsymbol{x}}^{(0)}\) if and only if \(\rho({\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N}) < 1\).
In addition, for any \(\varepsilon > 0\) there exists \(K > 0\) such that \[ \forall k \geqslant K, \qquad {\lVert {\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*} \rVert} \leqslant\bigl(\rho({\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N}) + \varepsilon\bigr)^k {\lVert {\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*} \rVert}. \]
Exercise: prove this result using Gelfand’s formula.
Proof
Subtracting the equations \[ \begin{aligned} {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}^{(k+1)} &= {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}^{(k)} + \mathbf{\boldsymbol{b}} \\ {\color{green}\mathsf M} \mathbf{\boldsymbol{x}}_* &= {\color{blue} \mathsf N} \mathbf{\boldsymbol{x}}_* + \mathbf{\boldsymbol{b}} \end{aligned} \] and rearranging gives \[ \mathbf{\boldsymbol{x}}^{(k+1)} - \mathbf{\boldsymbol{x}}_* = {\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N} (\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*) = \dots = ({\color{green}\mathsf M}^{-1} {\color{blue} \mathsf N})^{k+1} (\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*). \] The “only if” part of the first item is simple. The other claims follow from Gelfand’s formula: \[ \forall \mathsf B \in \mathbf R^{n \times n}, \qquad \lim_{k \to \infty} {\lVert {\mathsf B^k} \rVert}^{1/k} = \rho(\mathsf B). \]
Settings where \(\rho(\mathsf M^{-1} \mathsf N) < 1\) are identified on a case by case basis:
Richardson’s iteration, sufficient condition: \(\mathsf A\) symmetric positive definite;
Jacobi’s iteration, sufficient condition: \(\mathsf A\) strictly row or column diagonally dominant:
\[ \lvert a_{ii} \rvert > \sum_{j \neq i} \lvert a_{ij} \rvert \quad \forall i \qquad \text{ or } \qquad \lvert a_{jj} \rvert > \sum_{i \neq j} \lvert a_{ij} \rvert \quad \forall j. \]
Gauss Seidel’s iteration, sufficient condition: \(\mathsf A\) strictly row or column diagonally dominant;
Relaxation method, sufficient condition: \(\mathsf A\) symmetric positive definite and \(\omega \in (0, 2)\);
Relaxation method, necessary condition: \(\omega \in (0, 2)\);
…
Recall that, until the end of this lecture, matrix \(\mathsf A\) is assumed symmetric positive definite
We recall Richardson’s iteration: \[\begin{align*} \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} - \omega \nabla f(\mathbf{\boldsymbol{x}}^{(k)}), \qquad f(\mathbf{\boldsymbol{x}}) := \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}}. \end{align*}\]
Proposition
For any \(\mathbf{\boldsymbol{d}} \in \mathbf R^n\), it holds that \[ \mathop{\mathrm{arg\,min}}_{\omega \in \mathbf R} f \left(\mathbf{\boldsymbol{x}} - \omega \mathbf{\boldsymbol{d}} \right) = \frac{\mathbf{\boldsymbol{d}}^T\mathbf{\boldsymbol{r}}}{\mathbf{\boldsymbol{d}}^T\mathsf A \mathbf{\boldsymbol{d}}}, \qquad \mathbf{\boldsymbol{r}} = \mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}. \]
We can improve upon Richardson’s iteration by letting \[ \mathbf{\boldsymbol{x}}^{(k + 1)} = \mathbf{\boldsymbol{x}}^{(k)} - \omega_{\color{red}k} \nabla f(\mathbf{\boldsymbol{x}}^{(k)}), \qquad \omega_{\color{red}k} := \frac{\mathbf{\boldsymbol{d}}_k^T\mathbf{\boldsymbol{d}}_k}{\mathbf{\boldsymbol{d}}_k^T\mathsf A \mathbf{\boldsymbol{d}}_k}, \qquad \mathbf{\boldsymbol{d}}_k := \mathsf A \mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{b}}. \]
This is the steepest descent method with optimal step.
\[ f(\mathbf{\boldsymbol{x}}) := \frac{1}{2} \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{b}}^T\mathbf{\boldsymbol{x}} = \frac{1}{2} {\lVert {\mathbf{\boldsymbol{x}} - \mathbf{\boldsymbol{x}}_*} \rVert}_{\mathsf A}^2 + \text{constant}, \qquad {\lVert {\mathbf{\boldsymbol{y}}} \rVert}_{\mathsf A} := \sqrt{{\langle {\mathbf{\boldsymbol{y}}, \mathbf{\boldsymbol{y}}} \rangle}_{\mathsf A}}, \qquad {\langle {\mathbf{\boldsymbol{x}}, \mathbf{\boldsymbol{y}}} \rangle}_{\mathsf A} := \mathbf{\boldsymbol{x}}^T\mathsf A \mathbf{\boldsymbol{y}}. \] Given \(n\) conjugate directions \(\{\mathbf{\boldsymbol{d}}_0, \dotsc, \mathbf{\boldsymbol{d}}_{n-1}\}\) such that \({\langle {\mathbf{\boldsymbol{d}}_i, \mathbf{\boldsymbol{d}}_j} \rangle}_{\mathsf A} = \delta_{ij}\), we have \[ \mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)} = \sum_{i=0}^{n-1} \mathbf{\boldsymbol{d}}_i \mathbf{\boldsymbol{d}}_i^T\mathsf A (\mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)}) = \sum_{i=0}^{n-1} \mathbf{\boldsymbol{d}}_i \mathbf{\boldsymbol{d}}_i^T({\color{green}\mathbf{\boldsymbol{b}}} - \mathsf A \mathbf{\boldsymbol{x}}^{(0)}) \]
\(\color{green}\rightarrow\) the \(\mathsf A\)-projections of \(\mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)}\) can be calculated even though \(\mathbf{\boldsymbol{x}}_*\) is unknown!
Minimization property (by construction)
\(\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}^{(0)}\) is the \(\mathsf A\)-orthogonal projection of \(\mathbf{\boldsymbol{x}}_* - \mathbf{\boldsymbol{x}}^{(0)}\) onto \(\mathop{\mathrm{Span}}\{\mathbf{\boldsymbol{d}}_0, \dotsc, \mathbf{\boldsymbol{d}}_{k-1}\}\).
Therefore, \(\mathbf{\boldsymbol{x}}^{(k)}\) minimizes \(f(\mathbf{\boldsymbol{x}}^{(k)})\) over the full affine subspace \(\mathbf{\boldsymbol{x}}^{(0)} + \mathop{\mathrm{Span}}\{\mathbf{\boldsymbol{d}}_0, \dotsc, \mathbf{\boldsymbol{d}}_{k-1}\}\).
Consider again the linear system \[ \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} 7 \\ 8 \end{pmatrix} \]
(non-normalized) Conjugate directions calculated by Gram-Schmidt: \[ \mathbf{\boldsymbol{d}}_0 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \qquad \mathbf{\boldsymbol{d}}_1 = \begin{pmatrix} -1 \\ 2 \end{pmatrix}. \]
Question: how to generate conjugate directions ?
\({\color{green} \rightarrow}\) the conjugate gradient method enables to generate conjugate directions on the fly.
Theorem
Suppose that \(\mathsf A\) is symmetric positive definite. Then \[ \forall k \geqslant 0, \qquad {\lVert {\mathbf{\boldsymbol{x}}^{(k)} - \mathbf{\boldsymbol{x}}_*} \rVert}_{\mathsf A} \leqslant 2 \left( \frac{{\color{red}\sqrt{\kappa_2(\mathsf A)}} - 1}{{\color{red}\sqrt{\kappa_2(\mathsf A)}} + 1} \right)^{k} {\lVert {\mathbf{\boldsymbol{x}}^{(0)} - \mathbf{\boldsymbol{x}}_*} \rVert}_{\mathsf A}, \]