matrix-vector product

Our goal for this first exercise is to parallelize the product of a sparse matrix and a vector. This will be useful when we consider the solution of a sparse linear system $A x = b$ with an iterative method, where $A$ comes from the discretization of a PDE by the finite element method for example.

We will start from a sequential C++ code implementing sparse matrices using a sorted COO (COOrdinates) format with std::map. You can find some information about a simple non-sorted COO format here (in french, sorry).

Here is the sequential C++ code:

click to fold/expand
#include <iostream>
#include <map>
#include <vector>
#include <cassert>

class MapMatrix{
private:
  typedef std::pair<int,int>   N2;

  std::map<N2,double>  data;
  int nbrow;
  int nbcol;

public:
  MapMatrix(const int& nr, const int& nc):
    nbrow(nr), nbcol(nc) {}; 

  MapMatrix(const MapMatrix& m): 
    nbrow(m.nbrow), nbcol(m.nbcol), data(m.data) {}; 
  
  MapMatrix& operator=(const MapMatrix& m){ 
    if(this!=&m){
      nbrow=m.nbrow;
      nbcol=m.nbcol;
      data=m.data;
    }   
    return *this; 
  }

  int NbRow() const {return nbrow;}
  int NbCol() const {return nbcol;}

  double operator()(const int& j, const int& k) const {
    auto search = data.find(std::make_pair(j,k));
    if(search!=data.end()) return search->second;
    return 0;
  }

  double& Assign(const int& j, const int& k) {
    return data[std::make_pair(j,k)];
  }

  std::vector<double> operator*(const std::vector<double>& x) const {
    assert(x.size()==NbCol());
    std::vector<double> b(NbRow(),0.);
    for(auto it=data.begin(); it!=data.end(); ++it){
      int j = (it->first).first;
      int k = (it->first).second;
      double Mjk = it->second;
      b[j] += Mjk*x[k];
    }
    return b;
  }
};

int main() {
  MapMatrix A(4,4);
  A.Assign(0,0)=2; A.Assign(0,1)=6;
  A.Assign(1,1)=1;
  A.Assign(2,2)=5;
  A.Assign(3,1)=3;

  std::vector<double> x(4,1), b = A*x;
  for (const double& val : b) std::cout << val << " ";
  std::cout << std::endl;

  return 0;
}

The goal is to parallelize the code above on $p$ processes. Each MPI process will hold a subset of the matrix. We will consider a band decomposition of the matrix: each process will hold a contiguous subset of rows. For simplicity, we suppose in the following that $n := N/p$ is an integer, where $N$ is the total number of rows of the matrix.
For the parallel matrix-vector product, the input and output vectors $u$ and $b$ will also be distributed on the $p$ processes ; each process will only store $n$ values corresponding to the indices of the rows of the matrix it holds.
Consider the following example, with 3 processes:

$$ \begin{equation} \begin{array}{lr} \textcolor{red}{proc. 0} \quad \begin{cases} \quad \\ \quad \end{cases}\\ \textcolor{blue}{proc. 1} \quad \begin{cases} \quad \\ \quad \end{cases}\\ \textcolor{limegreen}{proc. 2} \quad \begin{cases} \quad \\ \quad \end{cases} \end{array} \begin{pmatrix} \color{red} 2 & \color{red} 1 & \color{red} 0 & \color{red} 0 & \color{red} 4 & \color{red} 0 \\\ \color{red} 0 & \color{red} 4 & \color{red} -1 & \color{red} 9 & \color{red} 0 & \color{red} 6 \\\ \color{blue} 0 & \color{blue} 0 & \color{blue} 2 & \color{blue} 3 & \color{blue} 0 & \color{blue} 2 \\\ \color{blue} 0 & \color{blue} 1 & \color{blue} 0 & \color{blue} 0 & \color{blue} -2 & \color{blue} 0 \\\ \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} -2 & \color{limegreen} 0 & \color{limegreen} 5 & \color{limegreen} 0 \\\ \color{limegreen} 1 & \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} -1 & \color{limegreen} 4 \\ \end{pmatrix} \times \begin{pmatrix} \color{red} u_0 \\\ \color{red} u_1 \\\ \color{blue} u_2 \\\ \color{blue} u_3 \\\ \color{limegreen} u_4 \\\ \color{limegreen} u_5 \\ \end{pmatrix} = \begin{pmatrix} \color{red} b_0 \\\ \color{red} b_1 \\\ \color{blue} b_2 \\\ \color{blue} b_3 \\\ \color{limegreen} b_4 \\\ \color{limegreen} b_5 \\ \end{pmatrix} \end{equation} $$

For the parallel matrix-vector product, we can see that in order to compute its contribution to the output vector ($b_0, b_1$), process 0 needs the data corresponding to the whole input vector $u$:

$$ \begin{pmatrix} \color{red} 2 & \color{red} 1 & \color{red} 0 & \color{red} 0 & \color{red} 4 & \color{red} 0 \\\ \color{red} 0 & \color{red} 4 & \color{red} -1 & \color{red} 9 & \color{red} 0 & \color{red} 6 \\\ &&&&& \\\ &&&&& \\\ &&&&& \\\ &&&&& \\ \end{pmatrix} \times \begin{pmatrix} \color{red} u_0 \\\ \color{red} u_1 \\\ \color{blue} u_2 \\\ \color{blue} u_3 \\\ \color{limegreen} u_4 \\\ \color{limegreen} u_5 \\ \end{pmatrix} = \begin{pmatrix} \color{red} b_0 \\\ \color{red} b_1 \\\ \\\ \\\ \\\ \\ \end{pmatrix} $$

If we consider a sparse matrix with a generic pattern of non-zero elements, a first non-optimal way of handling data dependencies is to gather the whole input vector $u$ on all processes. Then, having all the necessary input data, each process can perform its contribution to the matrix-vector product concurrently.
In order to gather the whole vector $u$ on all processes, each process sends its local contribution of size $n$ to every process. This operation can be performed by the collective MPI operation MPI_Allgather.

Now, let us consider that the pattern of non-zero elements of the matrix is known a priori ; let us consider for example the following tridiagonal matrix:

$$ \begin{equation} \begin{array}{lr} \textcolor{red}{proc. 0} \quad \begin{cases} \quad \\ \quad \end{cases}\\ \textcolor{blue}{proc. 1} \quad \begin{cases} \quad \\ \quad \end{cases}\\ \textcolor{limegreen}{proc. 2} \quad \begin{cases} \quad \\ \quad \end{cases} \end{array} \begin{pmatrix} \color{red} 2 & \color{red} -1 & \color{red} 0 & \color{red} 0 & \color{red} 0 & \color{red} 0 \\\ \color{red} -1 & \color{red} 2 & \color{red} -1 & \color{red} 0 & \color{red} 0 & \color{red} 0 \\\ \color{blue} 0 & \color{blue} -1 & \color{blue} 2 & \color{blue} -1 & \color{blue} 0 & \color{blue} 0 \\\ \color{blue} 0 & \color{blue} 0 & \color{blue} -1 & \color{blue} 2 & \color{blue} -1 & \color{blue} 0 \\\ \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} -1 & \color{limegreen} 2 & \color{limegreen} -1 \\\ \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} 0 & \color{limegreen} -1 & \color{limegreen} 2 \\ \end{pmatrix} \times \begin{pmatrix} \color{red} u_0 \\\ \color{red} u_1 \\\ \color{blue} u_2 \\\ \color{blue} u_3 \\\ \color{limegreen} u_4 \\\ \color{limegreen} u_5 \\ \end{pmatrix} = \begin{pmatrix} \color{red} b_0 \\\ \color{red} b_1 \\\ \color{blue} b_2 \\\ \color{blue} b_3 \\\ \color{limegreen} b_4 \\\ \color{limegreen} b_5 \\ \end{pmatrix} \end{equation} $$

Now, we can see that each process $i$ only needs the last value of the local input vector held by process $i-1$ and the first value of the local input vector held by process $i+1$. For example, process $0$ only needs $u_2$ from process $1$:

$$ \begin{pmatrix} \color{red} 2 & \color{red} -1 & \color{red} 0 & \color{red} 0 & \color{red} 0 & \color{red} 0 \\\ \color{red} -1 & \color{red} 2 & \color{red} -1 & \color{red} 0 & \color{red} 0 & \color{red} 0 \\\ &&&&&& \\\ &&&&&& \\\ &&&&&& \\\ &&&&&& \\ \end{pmatrix} \times \begin{pmatrix} \color{red} u_0 \\\ \color{red} u_1 \\\ \color{blue} u_2 \\\ \\\ \\\ \\ \end{pmatrix} = \begin{pmatrix} \color{red} b_0 \\\ \color{red} b_1 \\\ \\\ \\\ \\\ \\ \end{pmatrix} $$

In other words, each process $i$ only needs to exchange local data with its two direct neighbouring processes $i-1$ and $i+1$ in the band decomposition. Thus, the amount of data exchanged between MPI processes in order to perform the parallel matrix-vector product will be much smaller than in the generic case where the whole vector is gathered on all processes, and the communication cost will be much cheaper.
Direct communication between neighbouring processes can be performed with the non-blocking point-to-point MPI routines MPI_Isend and MPI_Irecv. Note that you need to call MPI_Wait where appropriate in order to wait for each operation to complete.

You can test both approaches on a tridiagonal or pentadiagonal matrix. You can use MPI_Wtime to measure the computing time while varying the matrix size $N$ and the number of MPI processes $p$.

Next