conjugate gradient

Here, we want to solve a sparse linear system $A \mathbf{x} = \mathbf{b}$ in parallel using the conjugate gradient method. The conjugate gradient algorithm is an iterative method that can be used to solve linear systems where the matrix is symmetric and positive-definite.
Here we recall the conjugate gradient algorithm:

$\textbf{function}$ CG($A, \mathbf{b}$, $\epsilon_{\text{TOL}}$)
$\quad\epsilon = \epsilon_{\text{TOL}} | \mathbf{b} |_2$
$\quad\mathbf{x} = 0$
$\quad\mathbf{p} = \mathbf{b}$
$\quad\mathbf{r} = \mathbf{b}$
$\quad\textbf{while} \; |\mathbf{r}|_2 \geq \epsilon \; \textbf{do}$
$\qquad\alpha = |\mathbf{r}|_2^2 / |\mathbf{p}|_A^2$
$\qquad\mathbf{x} = \mathbf{x} + \alpha \mathbf{p}$
$\qquad\mathbf{r} = \mathbf{r} - \alpha A \mathbf{p}$
$\qquad\beta = |\mathbf{r}|_2^2 / (\alpha |\mathbf{p}|_A^2)$
$\qquad\mathbf{p} = \mathbf{r} + \beta \mathbf{p}$
$\quad\textbf{end while}$
$\quad\textbf{return} \; (\mathbf{x})$
$\textbf{end function}$

We also give a sequential C++ implementation of the conjugate gradient algorithm using the previous MapMatrix matrix class for the matrix-vector product, along with the necessary vector operations:

click to fold/expand
#include <cmath>

// scalar product (u,v)
double operator,(const std::vector<double>& u, const std::vector<double>& v){ 
  assert(u.size()==v.size());
  double sp=0.;
  for(int j=0; j<u.size(); j++){sp+=u[j]*v[j];}
  return sp; 
}

// norm of a vector u
double Norm(const std::vector<double>& u) { 
  return sqrt((u,u));
}

// addition of two vectors u+v
std::vector<double> operator+(const std::vector<double>& u, const std::vector<double>& v){ 
  assert(u.size()==v.size());
  std::vector<double> w=u;
  for(int j=0; j<u.size(); j++){w[j]+=v[j];}
  return w;
}

// multiplication of a vector by a scalar a*u
std::vector<double> operator*(const double& a, const std::vector<double>& u){ 
  std::vector<double> w(u.size());
  for(int j=0; j<w.size(); j++){w[j]=a*u[j];}
  return w;
}

//addition assignment operator, add v to u
void operator+=(std::vector<double>& u, const std::vector<double>& v){ 
  assert(u.size()==v.size());
  for(int j=0; j<u.size(); j++){u[j]+=v[j];}
}

std::vector<double> CG(const MapMatrix& A, const std::vector<double>& b, double tol=1e-6) {

  assert(b.size() == A.NbCol());
  assert(A.NbCol()== A.NbRow());
  std::vector<double> x(b.size(),0.);

  double nr=Norm(b);
  double epsilon = tol*nr; 
  std::vector<double> p=b,r=b, Ap=A*p;
  double np2=(p,Ap), alpha=0.,beta=0.;

  int num_it = 0;
  while(nr>epsilon) {
    alpha = (nr*nr)/(np2);
    x += (+alpha)*p; 
    r += (-alpha)*Ap;
    nr = Norm(r);    
    beta = (nr*nr)/(alpha*np2); 
    p = r+beta*p;    
    Ap=A*p;
    np2=(p,Ap);

    num_it++;
    if(!(num_it%20)) {
      std::cout << "iteration: " << num_it << "\t";
      std::cout << "residual:  " << nr     << "\n";
    }
  }
  return x;
}

The goal is to parallelize the conjugate gradient method on $p$ processes. As in the previous exercise, we suppose that $n := N/p$ is an integer, where $N$ is the size of the linear system. Again, each MPI process will hold a subset of rows of the matrix, and the vectors will be distributed on the $p$ processes, with each process storing $n$ values of the vector corresponding to the indices of the rows of the matrix it holds.

The parallelization of the matrix-vector product has been addressed in the previous exercise.
The conjugate gradient algorithm includes additional vector operations: sum of two vectors, scalar products, … Taking into account that the vectors are stored in a distributed way among the $p$ processes, which operation needs attention for the parallelization ? Which MPI routine is appropriate ?

scalability

In order to assess the efficiency of the parallel implementation, we can perform a strong scaling test: for a given problem size $N$, increase the number of MPI processes $p$. As $p$ increases, the workload per process decreases, and you should observe a decrease in computing time for solving the linear system (use MPI_Wtime to measure the computing time).
You can report the speedup $S$ obtained as $p$ increases: $S = \frac{T_s}{T_p}$, where $T_s$ is the time taken to solve the problem with the sequential implementation, and $T_p$ is the time taken to solve the same problem on $p$ computing cores with the parallel implementation. You can take $p \in \{2,4,8\}$ for example.

You can perform these scalability tests for the two approaches described previously regarding the parallelization of the matrix-vector product, using for example the tridiagonal matrix below:

$$ \begin{equation} A = \begin{pmatrix} 2 & -1 & 0 & 0 & \dots & 0 & 0 \\\ -1 & 2 & -1& 0 & \dots & 0 & 0 \\\ 0 & -1 & 2 & -1& \dots & 0 & 0 \\\ \vdots & \ddots & \ddots & \ddots & \dots & \vdots & \vdots \\\ 0 & 0 & 0 & 0 & \dots & -1 & 2 \\ \end{pmatrix} \end{equation} $$

Previous
Next