conjugate gradient
Here, we want to solve a sparse linear system $A \mathbf{x} = \mathbf{b}$ in parallel using the conjugate gradient method. The conjugate gradient algorithm is an iterative method that can be used to solve linear systems where the matrix is symmetric and positive-definite.
Here we recall the conjugate gradient algorithm:
$\textbf{function}$ CG($A, \mathbf{b}$, $\epsilon_{\text{TOL}}$)
$\quad\epsilon = \epsilon_{\text{TOL}} | \mathbf{b} |_2$
$\quad\mathbf{x} = 0$
$\quad\mathbf{p} = \mathbf{b}$
$\quad\mathbf{r} = \mathbf{b}$
$\quad\textbf{while} \; |\mathbf{r}|_2 \geq \epsilon \; \textbf{do}$
$\qquad\alpha = |\mathbf{r}|_2^2 / |\mathbf{p}|_A^2$
$\qquad\mathbf{x} = \mathbf{x} + \alpha \mathbf{p}$
$\qquad\mathbf{r} = \mathbf{r} - \alpha A \mathbf{p}$
$\qquad\beta = |\mathbf{r}|_2^2 / (\alpha |\mathbf{p}|_A^2)$
$\qquad\mathbf{p} = \mathbf{r} + \beta \mathbf{p}$
$\quad\textbf{end while}$
$\quad\textbf{return} \; (\mathbf{x})$
$\textbf{end function}$
We also give a sequential C++
implementation of the conjugate gradient algorithm using the previous MapMatrix
matrix class for the matrix-vector product, along with the necessary vector operations:
click to fold/expand
#include <cmath>
// scalar product (u,v)
double operator,(const std::vector<double>& u, const std::vector<double>& v){
assert(u.size()==v.size());
double sp=0.;
for(int j=0; j<u.size(); j++){sp+=u[j]*v[j];}
return sp;
}
// norm of a vector u
double Norm(const std::vector<double>& u) {
return sqrt((u,u));
}
// addition of two vectors u+v
std::vector<double> operator+(const std::vector<double>& u, const std::vector<double>& v){
assert(u.size()==v.size());
std::vector<double> w=u;
for(int j=0; j<u.size(); j++){w[j]+=v[j];}
return w;
}
// multiplication of a vector by a scalar a*u
std::vector<double> operator*(const double& a, const std::vector<double>& u){
std::vector<double> w(u.size());
for(int j=0; j<w.size(); j++){w[j]=a*u[j];}
return w;
}
//addition assignment operator, add v to u
void operator+=(std::vector<double>& u, const std::vector<double>& v){
assert(u.size()==v.size());
for(int j=0; j<u.size(); j++){u[j]+=v[j];}
}
std::vector<double> CG(const MapMatrix& A, const std::vector<double>& b, double tol=1e-6) {
assert(b.size() == A.NbCol());
assert(A.NbCol()== A.NbRow());
std::vector<double> x(b.size(),0.);
double nr=Norm(b);
double epsilon = tol*nr;
std::vector<double> p=b,r=b, Ap=A*p;
double np2=(p,Ap), alpha=0.,beta=0.;
int num_it = 0;
while(nr>epsilon) {
alpha = (nr*nr)/(np2);
x += (+alpha)*p;
r += (-alpha)*Ap;
nr = Norm(r);
beta = (nr*nr)/(alpha*np2);
p = r+beta*p;
Ap=A*p;
np2=(p,Ap);
num_it++;
if(!(num_it%20)) {
std::cout << "iteration: " << num_it << "\t";
std::cout << "residual: " << nr << "\n";
}
}
return x;
}
The goal is to parallelize the conjugate gradient method on $p$ processes. As in the previous exercise, we suppose that $n := N/p$ is an integer, where $N$ is the size of the linear system. Again, each MPI process will hold a subset of rows of the matrix, and the vectors will be distributed on the $p$ processes, with each process storing $n$ values of the vector corresponding to the indices of the rows of the matrix it holds.
The parallelization of the matrix-vector product has been addressed in the previous exercise.
The conjugate gradient algorithm includes additional vector operations: sum of two vectors, scalar products, … Taking into account that the vectors are stored in a distributed way among the $p$ processes, which operation needs attention for the parallelization ? Which MPI routine is appropriate ?
scalability
In order to assess the efficiency of the parallel implementation, we can perform a strong scaling test: for a given problem size $N$, increase the number of MPI processes $p$. As $p$ increases, the workload per process decreases, and you should observe a decrease in computing time for solving the linear system (use MPI_Wtime
to measure the computing time).
You can report the speedup $S$ obtained as $p$ increases: $S = \frac{T_s}{T_p}$, where $T_s$ is the time taken to solve the problem with the sequential implementation, and $T_p$ is the time taken to solve the same problem on $p$ computing cores with the parallel implementation. You can take $p \in \{2,4,8\}$ for example.
You can perform these scalability tests for the two approaches described previously regarding the parallelization of the matrix-vector product, using for example the tridiagonal matrix below:
$$ \begin{equation} A = \begin{pmatrix} 2 & -1 & 0 & 0 & \dots & 0 & 0 \\\ -1 & 2 & -1& 0 & \dots & 0 & 0 \\\ 0 & -1 & 2 & -1& \dots & 0 & 0 \\\ \vdots & \ddots & \ddots & \ddots & \dots & \vdots & \vdots \\\ 0 & 0 & 0 & 0 & \dots & -1 & 2 \\ \end{pmatrix} \end{equation} $$