Domain Decomposition

Parallel programs must break down something to distribute among the different processes or threads. In our previous discussion we talked generally about task parallelism and data parallelism, but we have seen few concrete examples applied to programming. One of the most common parallelisms is a form of data decomposition called domain decomposition. This is typically used for data that can be represented on some form of grid. We break the grid down into subgrids, and assign each subgrid to a process rank. There is no hard and fast rule for numbering the subgrids and usually multiple options are possible.

For a concrete example, consider a two-dimension grid of points, with a function defined at each point. Consider a simple case of four processes. Two obvious possible rank assignments are left-to-right or cyclic.

Possible rank assignment on a two-dimensional grid.

How to assign the numbers to the subdomains may depend on the problem. MPI provides some utilities that may help with this, but using them requires creating a new communicator group, which we have not discussed, so for now we will do our ordering manually.

Let us consider the “brute force” maximum problem of Project Set 1. In that example, we randomly evaluated a function over a two-dimensional set of $x$ and $y$ values and used gathers to find an approximation to the value and location of the maximum. Another approach is to divide up the grid and perform our random searches independently over each subgrid, then collect results from each subdomain and compare them to find the maximum and its location.

A little thought and experimentation suggests that the simplest way to map ranks to subdomains is to divide the overall grid into $np$ regions, where $np$ is the number of processes, then label each domain by row and column number, starting the counts from $0$. The number of processes must be a multiple of two integers. In our example we will assume we will use a perfect square, so our grid is $\sqrt{np}$ on each side. This is not required, but if a more general breakdown is desired, the program must be provided the number of rows and columns and the number of processes should be confirmed to be $n_{rows} \times n_{cols}$. Each process rank then computes its location in the grid according to

   col_num=rank%ncols
   row_num=rank//nrows

using Python syntax. The local column number is the remainder of the rank divided by the number of columns, and the local row number is the integer division of the rank by the number of rows. This layout corresponds to a rank assigment such as the following, for nine processes:

Domain decomposition for a typical two-dimensional distribution by rank.

With this decomposition, each rank evaluates the same number of random points, so this is an example of weak scaling.

Exercise

If you have not worked the example from Project Set 1, download the sample serial codes referenced there.

Implement the domain decomposition method to solve this problem. The basic method does not take too many lines of code. The example solutions also check that the number of processors is a perfect square, but you may assume that for a first pass. For correct statistical properties you should also attempt to use different random seeds on different ranks. Be sure not to end up with a seed of $0$ on rank $0$.

Hint: From $xlo$, $xhi$, and $ncols$, compute the increment for each block in $x$ values. Repeat for $y$ values using $ylo$, $yhi$, and $nrows$. Compute the location in rows and columns for each rank

col_num=rank%ncols
row_num=rank//nrows

You can now compute the starting and ending $x$ and $y$ values for each rank.

Example Solutions

C++

#include <iostream>
#include <iomanip>
#include <string>
#include <sstream>
#include <random>
#include <cmath>
#include "mpi.h"

using namespace std;

#define urand ((double)rand()/(double)RAND_MAX);

float surface(float x, float y) {
   float pi=4.0*atan(1.0);
   float mu1, mu2, sig1, sig2;
   float a, b;
   float z1, z2;

   mu1=sqrt(2.0);
   mu2=sqrt(pi);
   sig1=3.1;
   sig2=1.4;
   z1=0.1*sin(x)*sin(x*y);
   a=pow((x-mu1),2)/(2*pow(sig1,2));
   b=pow((y-mu2),2)/(2*pow(sig2,2));
   z2=exp(-(a+b))/(sig1*sig2*sqrt(2.0*pi));
   return z1+z2;
}

int main(int argc, char **argv) {

   float xval, yval, zval;

   int nprocs, rank;

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);

   // For this example we'll use perfect square nprocs.  Otherwise we have
   // to tell it how to divide up the processes into rows and columns.
   float rows=pow(nprocs,0.5);
   int nrows=int(rows);
   int ncols=nrows;

   int nsamps;
   bool shutdown=false;

   if (rank==0) {
      if ( argc != 2 ) {
         cout <<"Usage: number of samples to test\n";
         shutdown=true;
      }
      else if ( pow(int(rows+0.5),2) != nprocs ) {
         cout << "Number of processes must be a perfect square\n";
	 shutdown=true;
      }
      else {
         nsamps=stoi(argv[1]);
      }
   }

   MPI_Bcast(&shutdown,1,MPI_CXX_BOOL,0,MPI_COMM_WORLD);
   if (shutdown) {
        MPI_Finalize();
        exit(1);
   }

   MPI_Bcast(&nsamps,1,MPI_INT,0,MPI_COMM_WORLD);

   // Define the parameters to test
   float pi=4.0*atan(1.0);

   float xlo=-10.*pi; 
   float xhi=10.*pi;
   float ylo=-10.*pi; 
   float yhi=10.*pi;

   float x_subrange=(xhi-xlo)/ncols;
   float y_subrange=(yhi-ylo)/nrows;

   // Calculate the range over which I will be working
   int col_num=rank%ncols;
   int row_num=rank/nrows;
   float my_xlo=xlo+col_num*x_subrange;
   float my_xhi=my_xlo+x_subrange;
   float my_ylo=ylo+row_num*y_subrange;
   float my_yhi=my_ylo+y_subrange;

   srand(time(0)+rank);

   float xmax=0.;
   float ymax=0.;
   float zmax=0.;

   for (int i=0;i<nsamps;++i) {
      for (int j=0;j<nsamps;++j) {
        xval=my_xlo+(my_xhi-my_xlo)*urand;
        yval=my_ylo+(my_yhi-my_ylo)*urand;
        zval=surface(xval,yval);
        if (zval>zmax) {
           zmax=zval;
           xmax=xval;
           ymax=yval;
	}
      }
   }

   float *xresult=new float[nprocs];
   float *yresult=new float[nprocs];
   float *zresult=new float[nprocs];

   MPI_Gather(&xmax,1,MPI_FLOAT,xresult,1,MPI_FLOAT,0,MPI_COMM_WORLD);
   MPI_Gather(&ymax,1,MPI_FLOAT,yresult,1,MPI_FLOAT,0,MPI_COMM_WORLD);
   MPI_Gather(&zmax,1,MPI_FLOAT,zresult,1,MPI_FLOAT,0,MPI_COMM_WORLD);

   if (rank==0) {
       float g_zmax=zresult[0];
       float g_xmax=xresult[0];
       float g_ymax=yresult[0];
       for (int i=1;i<nprocs;++i) {
           if (zresult[i] > g_zmax) {
               g_zmax=zresult[i];
	       g_xmax=xresult[i];
	       g_ymax=yresult[i];
	   }
       }
       cout<<"Result is "<<g_zmax<<" at x,y="<<g_xmax<<", "<<g_ymax<<endl;
   }

   MPI_Finalize();

}

Fortran

module random
  implicit none

  ! Author:  Katherine Holcomb
  ! Shortened version of a longer module
  ! Module incorporated into this file for convenience; typically it would be in its
  ! own file.
  ! Comment out one or the other when the module is incorporated into a code.

  ! Single precision
  integer, parameter   :: rk = kind(1.0)
  ! Double precision
  !integer, parameter   :: rk = kind(1.0d0)

  contains

     subroutine set_random_seed(seed)
       ! Sets all elements of the seed array
       integer, optional, intent(in)       :: seed

       integer                             :: isize
       integer, dimension(:), allocatable  :: iseed
       integer, dimension(8)               :: idate
       integer                             :: icount, i

       call random_seed(size=isize)

       if ( .not. present(seed) ) then
          call system_clock(icount)
          allocate(iseed(isize),source=icount+370*[(i,i=0,isize-1)])
       else
          allocate(iseed(isize))
          iseed=seed
       endif

       call random_seed(put=iseed)

     end subroutine set_random_seed

     function urand(lb,ub,seed)
       ! Returns a uniformly-distributed random number in the range lb to ub.
       real(rk)                        :: urand
       real(rk), optional, intent(in)  :: lb,ub
       real(rk), optional, intent(in)  :: seed

       integer                         :: iseed
       real(rk)                        :: rnd
       real(rk)                        :: lower,upper

       if ( present(seed) ) then
          iseed=int(seed)
          call set_random_seed(iseed)
       endif

       if ( present(lb) ) then
          lower=lb
       else
          lower=0.0_rk
       endif

       if ( present(ub) ) then
          upper = ub
       else
          upper = 1.0_rk
       endif

       call random_number(rnd)
       urand = lower+(upper-lower)*rnd

       return
     end function urand
end module

program find_max
   use random
   use mpi
   implicit none

   real    :: pi=4.0*atan(1.0)
   real    :: xlo, xhi, ylo, yhi
   real    :: zval, zmax
   real    :: xval, xincr, xmax
   real    :: yval, yincr, ymax
   real, dimension(:),   allocatable :: xvals, yvals
   real, dimension(:,:), allocatable :: zvals
   integer :: i, j, nargs
   integer :: nsamps
   character(len=12) :: n

   integer :: rank, nprocs, ierr
   integer, dimension(:), allocatable:: seed
   integer               :: iseed, icount
   integer, dimension(:), allocatable:: maxinds
   real,    dimension(:), allocatable:: xresult,yresult,zresult
   real                  :: g_zmax
   real                  :: rows
   real                  :: x_subrange,y_subrange
   real                  :: my_xlo,my_xhi,my_ylo,my_yhi
   integer               :: nrows,ncols
   integer               :: row_num, col_num
   logical               :: shutdown=.false.

   interface
      real function surface(x,y)
         implicit none
         real, intent(in) :: x, y
      end function
   end interface

   call MPI_Init(ierr)
   call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr)
   call MPI_Comm_size(MPI_COMM_WORLD,nprocs,ierr)

   ! For this example we'll use perfect square nprocs.  Otherwise we have to
   ! tell it how to divide up the processes into rows and columns.
   rows=sqrt(real(nprocs))
   nrows=int(rows)
   ncols=nrows

   if (rank==0) then
      nargs=command_argument_count()
      if ( nargs .ne. 1 ) then
         shutdown=.true.
         write(*,*) "Usage: number of samples to test"
      else if ( int(rows+0.5)**2 /= nprocs ) then
         write(*,*) "Number of processes must be a perfect square"
         shutdown=.true.
      else
         call get_command_argument(1,n)
         read(n,'(i12)') nsamps
      endif
   endif

   call MPI_Bcast(shutdown,1,MPI_LOGICAL,0,MPI_COMM_WORLD,ierr)
   if (shutdown) then
      call MPI_Finalize(ierr)
      stop
   endif

   call MPI_Bcast(nsamps,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)

   ! Define the parameters to test
   xlo=-10.*pi; xhi=10.*pi
   ylo=-10.*pi; yhi=10.*pi

   x_subrange=(xhi-xlo)/ncols
   y_subrange=(yhi-ylo)/nrows

   ! Calculate the range over which I will be working
   my_xlo=xlo+col_num*x_subrange;
   my_xhi=my_xlo+x_subrange;
   my_ylo=ylo+row_num*y_subrange;
   my_yhi=my_ylo+y_subrange;

!For correct statistical properties, we set random seeds for each rank
   call system_clock(icount)
   iseed=icount+rank
   call set_random_seed(iseed)

   xmax=0.
   ymax=0.
   zmax=0.
   do i=1,nsamps
     do j=1,nsamps
        xval=urand(xlo,xhi)
        yval=urand(ylo,yhi)
        zval=surface(xval,yval)
        if (zval>zmax) then
            zmax=zval
            xmax=xval
            ymax=yval
        endif
     enddo
   enddo

   ! If not allgather, only root needs to allocate
   if (rank==0) then
       allocate(maxinds(nprocs))
       allocate(xresult(nprocs),yresult(nprocs),zresult(nprocs))
   endif

   call MPI_Gather(xmax,1,MPI_REAL,xresult,1,MPI_REAL,0,MPI_COMM_WORLD,ierr)
   call MPI_Gather(ymax,1,MPI_REAL,yresult,1,MPI_REAL,0,MPI_COMM_WORLD,ierr)
   call MPI_Gather(zmax,1,MPI_REAL,zresult,1,MPI_REAL,0,MPI_COMM_WORLD,ierr)

   if (rank==0) then
      g_zmax=maxval(zresult)
      maxinds=maxloc(zresult)
      write(*,'(a,f0.6,a,f0.6,a,f0.6)')                                     &
           "Result is ",g_zmax," at x,y=",xresult(maxinds(1)),", ",yresult(maxinds(1))
   endif

   call MPI_Finalize(ierr)

end program

real function surface(x,y)
   implicit none
   real, intent(in) :: x, y

   real  :: pi=4.0*atan(1.0)
   real  :: mu1, mu2, sig1, sig2
   real  :: a, b
   real  :: z1, z2

   mu1=sqrt(2.0)
   mu2=sqrt(pi)
   sig1=3.1
   sig2=1.4
   z1=0.1*sin(x)*sin(x*y)
   a=(x-mu1)**2/(2*sig1**2)
   b=(y-mu2)**2/(2*sig2**2)
   z2=exp(-(a+b))/(sig1*sig2*sqrt(2.0*pi))

   surface=z1+z2
end function

Python

from mpi4py import MPI
import numpy as np
import sys

def surface(x,y): 
   """This is the main processing function.  We use a ufunc for MPI."""
   mu1=np.sqrt(2.0)
   mu2=np.sqrt(np.pi)
   sig1=3.1
   sig2=1.4
   z1=0.1*np.sin(x)*np.sin(x*y)
   a=(x-mu1)**2/(2*sig1**2)
   b=(y-mu2)**2/(2*sig2**2)
   z2=np.exp(-(a+b))/(sig1*sig2*np.sqrt(2.0*np.pi))
   z=z1+z2
   return z

if __name__ == '__main__':

   comm=MPI.COMM_WORLD
   rank=comm.Get_rank()
   nprocs=comm.Get_size()

   shutdown=None
   nsamps=0

   # For this example we'll use perfect square nprocs.  Otherwise we have
   # to tell it how to divide up the processes into rows and columns.
   nrows=np.sqrt(nprocs)
   ncols=nrows

   if rank==0:
       if int(ncols+0.5)**2 != nprocs:
           print("Number of processes must be a perfect square")
           shutdown=True
       elif len(sys.argv)==2:
           nsamps=int(float(sys.argv[1]))
           shutdown=False
       else:
           print("Usage: number of samples")
           shutdown=True

   shutdown=comm.bcast(shutdown,root=0)
   if shutdown:
       MPI.Finalize()

   nsamps=comm.bcast(nsamps,root=0)

   #Fairly crude effort to get different seeds on different processes
   rng = np.random.default_rng()

   # Define the parameters to test
   xlo=-10.*np.pi; xhi=10.*np.pi
   ylo=-10.*np.pi; yhi=10.*np.pi
   x_subrange=(xhi-xlo)/ncols
   y_subrange=(yhi-ylo)/nrows

   # Calculate the range over which I will be working
   col_num=rank%ncols
   row_num=rank//nrows
   my_xlo=xlo+col_num*x_subrange
   my_xhi=my_xlo+x_subrange
   my_ylo=ylo+row_num*y_subrange
   my_yhi=my_ylo+y_subrange

   tic=MPI.Wtime()

   #We would usually have a lot of x,y points so cannot create z as a function
   #of x and y, so have to use a loop
   zmax=0.
   for i in range(nsamps):
       for j in range(nsamps):
           xval=rng.uniform(my_xlo,my_xhi)
           yval=rng.uniform(my_ylo,my_yhi)
           zval=surface(xval,yval)
           if zval>zmax:
               zmax=zval
               xmax=xval
               ymax=yval
               max_xind=i
               max_yind=j

   my_xresult=np.array([xmax])
   my_yresult=np.array([ymax])
   my_zresult=np.array([zmax])

   xresult=np.empty(nprocs)
   yresult=np.empty(nprocs)
   zresult=np.empty(nprocs)

   comm.Gather([my_xresult,MPI.FLOAT],[xresult,MPI.FLOAT])
   comm.Gather([my_yresult,MPI.FLOAT],[yresult,MPI.FLOAT])
   comm.Gather([my_zresult,MPI.FLOAT],[zresult,MPI.FLOAT])

   toc=MPI.Wtime()

   if rank==0:
       g_zmax=zresult.max()
       maxind=np.argmax(zresult)
       g_xmax=xresult[maxind]
       g_ymax=yresult[maxind]
       print(f"Result is {g_zmax:.4f} at x,y= {g_xmax:.4f},{g_ymax:.4f} ")
   
   MPI.Finalize()

Last updated on Nov 17, 2020