MPI Project Set 3: Jacobi Implementation

Project 7

Implement a serial Jacobi code to solve the two-dimensional Poisson equation with boundary conditions as described in our example, namely

u[0,:]=0.
u[nr+1,:]=100.
u[:,0]=100.
u[:,nc+1]=100.

Do not assume a square grid, but you may use one for testing, such as 500x500. Write your output to a file in your preferred format.

Suggestions.

It is usually not necessary to check for convergence at every timestep. Make it a variable and experiment with different values. Examine whether it affects the time required for a run.

Python programmers: once your code is working, try to use NumPy array operations with appropriate ranges to replace the double for loop. It should speed up the code substantially.

Fortran programmers: when writing double do loops over array indices, whenever possible arrange the loops in order of the indices from right to left, for cache efficiency. Correct loop ordering can be approximately a factor of 50 faster than incorrect for a double loop.

Use whatever plotting package you know to make a contour plot of the result. If you do not have a preference, you may use contour.py below. It assumes a single output file and will transpose the result for Fortran, if output is written by column rather than row, so the orientation is the same.

When plotting, the top of an array (row 0) is the bottom of the plot.

Python script to contour output

import sys
import argparse
import glob
import numpy as np
import pylab as plt

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--fortran", help="Fortran ordering", action="store_true")
parser.add_argument("filename", help="output file name")
args = parser.parse_args()
base = args.filename

data=np.loadtxt(args.filename,unpack=False)

if args.fortran:
    data.T

print(data.size, data.shape)

fig=plt.figure()
plt.contourf(data)
plt.colorbar()
plt.show()

Example solutions

C++

#include <iostream>
#include <cmath>
#include <string>
#include <sstream>
#include <fstream>
#include <iomanip>

#define MAX_ITER 10000000

void set_bcs(double **u, int nr, int nc);

using namespace std; 

int main (int argc, char *argv[]) {
  int i, j;
  double diff;            // Change in value 
  double mean;            //Average boundary value 
  int N, M;
  int iterations = 0;
  int diffInterval;
  double epsilon;

/* check number of parameters and read in epsilon, filename, optionally size */
  if (argc < 3) {
    printf ("USAGE:  %s epsilon output-file <N> <M>\n", argv[0]);
    return 1;
  }
  else {
     stringstream ssepsilon;
     ssepsilon<<argv[1];
     if (ssepsilon.fail()) {
         printf ("Error reading in epsilon value\n");
         return 2;
     }
     ssepsilon>>epsilon;

     if (argc==3) {
        N=500;
	M=500;
     }
     if (argc==4) {
        stringstream ssnr;
	ssnr<<argv[3];
        if (ssnr.fail()) {
            printf ("Error converting row dimension \n");
            return 2;
	}
        ssnr>>N;
	M=N;
     }
     if (argc==5) {
        stringstream ssnr;
	ssnr<<argv[3];
        if (ssnr.fail()) {
            printf ("Error converting row dimension \n");
            return 2;
	}
        ssnr>>N;
	stringstream ssnc;
	ssnc<<argv[4];
        if (ssnc.fail()) {
            printf ("Error converting row dimension \n");
            return 2;
	}
        ssnc>>M;
     }
  }

  cout<<"Running until the difference is <="<<epsilon<<" with size "<<N<<"x"<<M<<"\n";

  int nr=N;
  int nc=M;

  int nrows=nr+2;
  int ncols=nc+2;
  double **u=new double*[nrows];  // Old values
  double **w=new double*[nrows];  // New values
  double **diffs=new double*[nrows+1];  // Diffs

  double *uptr=new double[(nrows)*(ncols)];
  double *wptr=new double[(nrows)*(ncols)];
  double *dptr=new double[(nrows)*(ncols)];

  for (i=0;i<nrows;++i,uptr+=ncols)
     u[i] = uptr;
  for (i=0;i<nrows;++i,wptr+=ncols)
     w[i] = wptr;
  for (i=0;i<nrows;++i,dptr+=ncols)
     diffs[i] = dptr;

  mean = 0.0;

  set_bcs(u, nr, nc);

  //Initialize to something like the mean boundary. Not important, can use zeros
  //but this may speed convergence a bit.
  for ( i = 1; i < nr+1; i++ ) {
    mean += u[i][0] + u[i][nc+1] + u[0][i] + u[nr+1][i];
  }
  mean /= (4.0 * (nr+2));

  // Initialize interior values
  for ( i = 1; i <= nr; i++ ) {
      for (j = 1; j <= nc; j++ ) {
          u[i][j] = mean;
      }
  }

  //Arbitrary, just to make sure we enter loop
   diff = 10.*epsilon;

  diffInterval=1;

  time_t time=clock();

  // Compute steady-state solution 
  // We are making a tradeoff between memory and speed here by creating a diffs
  // array. Especially for parallel, this should be reasonable.

  while ( diff >= epsilon) {
     diff=.8*epsilon;  // reset each time through to get max abs diff
     for (i=1; i<=nr;i++) {
        for (j=1;j<=nc;j++) {
            w[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1])/4.0;
     	    diffs[i][j] = abs(w[i][j] - u[i][j]);
         }
     }

     if (iterations%diffInterval==0) {
        for (i=1; i<=nr;i++) {
           for (j=1;j<=nc;j++) {
	       if (diff<diffs[i][j]) {
		   diff=diffs[i][j];
	       }
	   }
	 }
         if (diff <= epsilon) 
             break;
     }

     for (i=1; i<=nr;i++) {
        for (j=1;j<=nc;j++) {
            u[i][j] = w[i][j];
        }
     }

     if (iterations >= MAX_ITER) 
        break;

     iterations++;
  } 

  time_t totalTime=(clock()-time)/CLOCKS_PER_SEC;
  cout << "Completed "<<iterations<<" iterations; time "<<totalTime<<endl;
  // Write solution to output file
  ofstream fp;
  fp.open(argv[2],ios::out);
  for (i = 1; i <= nr; i++) {
     for (j = 1; j <= nc; j++) {
        fp<<u[i][j]<<" ";
     }
     fp<<"\n";
  }
  fp.close();

  // All done!
  cout<<"wrote output file "<<argv[2]<<"\n";
  return 0;
}

void set_bcs(double **u, int nr, int nc) {
  /* Set boundary values.
   * This has an ice bath on the top edge.
   * Note: when plotting, 0,0 is the bottom.
   */

  double topBC=0;
  double bottomBC=100.;
  double edgeBC=100.;

  for (int i=0;i<=nc+1;++i){
        u[0][i]=topBC;
  }
  for (int i=0;i<=nc+1;++i){
        u[nr+1][i]=bottomBC;
  }

  for (int i=0;i<=nr+1;++i){
      u[i][0]=edgeBC;
      u[i][nc+1]=edgeBC;
  }
}

Fortran

program heatedplate
  implicit none

  integer, parameter:: maxiter=10000000
  integer           :: N, M, nr, nc, lb
  integer           :: numargs, i, j, diff_interval, iterations = 0
  double precision  :: eps, mean, diff = 1.0
  character(len=80) :: arg, filename
  double precision, allocatable, dimension(:,:) :: u, w
  real              :: time0, time1

  interface
    subroutine set_bcs(lb,nr,nc,u)
       implicit none
       integer, intent(in)  :: lb, nr, nc
       double precision, dimension(lb:,lb:), intent(inout) :: u
    end subroutine
  end interface


  ! check number of parameters and read in epsilon, filename, optionally size
  numargs=command_argument_count()
  if (numargs .lt. 2) then
     stop 'USAGE: epsilon output-file <N> <M>'
  else
     call get_command_argument(1,arg)
     read(arg,*) eps
     call get_command_argument(2,filename)

     if (numargs==2) then
        N=500
        M=500
     endif
     if (numargs==3) then
        call get_command_argument(3,arg)
        read(arg,*) N
        M=N
     endif
     if (numargs==4) then
        call get_command_argument(3,arg)
        read(arg,*) N
        call get_command_argument(4,arg)
        read(arg,*) M
     endif
  endif

  nr=N; nc=M

  ! Set boundary values and compute mean boundary value. 
  ! This has the ice bath on the top edge.

  ! Allocate arrays)
  allocate(u(0:nr+1,0:nc+1),w(0:nr+1,0:nc+1))

  u=0.d0
  w=0.d0

  call set_bcs(lb,nr,nc,u)

  ! Initialize interior values to the boundary mean, more or les
  ! Doesn't matter much, starting from all zeroes works too. May speed
  ! up convergence a bit.
  mean=sum(u(:,0))+sum(u(:,nc+1))+sum(u(nr+1,:))+sum(u(0,:))
  mean = mean / (4.0 * (N+2))

  u(1:nr,1:nc)=mean

  !Arbitrary, just to make sure we enter loop
  diff=10.*eps

  diff_interval=1

  call cpu_time(time0)

  write (*,'(a,es15.8,a,i6,a,i6)') 'Running until the difference is <= ', eps,&
                                   ' for size ',nr,'x',nc

  ! Compute steady-state solution
  do while ( diff >= eps )
     do j=1,nc
        do i=1,nr
           w(i,j) = (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))/4.0
        enddo
     enddo

     if (mod(iterations,diff_interval)==0) then
           diff=maxval(abs(w(1:nr,1:nc)-u(1:nr,1:nc)))
           if (diff <= eps) then
              exit
           endif
     endif

     ! Don't overwrite BCs
     u(1:nr,1:nc)=w(1:nr,1:nc)

     if (iterations>maxiter) then
           write(*,*) "Warning: maximum iterations exceeded"
          exit
     endif
     iterations = iterations + 1
  enddo

  call cpu_time(time1)

  write (*,*) 'completed; iterations = ', iterations,"in ",time1-time0," sec"

  ! Write solution to output file by row

  !Some compilers (Intel) like to add EOls to line up columns. The Fortran
  ! standard permits this with * IO, but it complicates plotting.  Use some
  ! F08 features to fix this.
  open (unit=10,file=filename)
  do i=1,nr
     write (10,'(*(g0,1x))') u(i,1:nc)
  enddo

  ! All done!
  write (*,'(a,a)') "wrote output file ", filename

end program heatedplate

subroutine set_bcs(lb,nr,nc,u)
implicit none
integer, intent(in)                                 :: lb, nr, nc
double precision, dimension(lb:,lb:), intent(inout) :: u
double precision                                    :: topBC, bottomBC, edgeBC

  ! Set boundary values 
  ! This has the ice bath on the bottom edge.
  ! Note: when plotting, 0,0 is the bottom.

  topBC=100.d0
  bottomBC=0.d0
  edgeBC=100.d0

  !Set boundary conditions
  u(1:nr,0)=edgeBC
  u(1:nr,nc+1)=edgeBC
  u(nr+1,1:nc)=topBC

end subroutine set_bcs

Python

import sys
import time
import numpy as np
# check number of parameters and read in parameters

#Better to use argparse but this is closer to compiled language code and simpler
#if one hasn't used argparse.
argv=sys.argv
if  len(argv) > 2:
   try:
      epsilon=float(argv[1])
      filename=argv[2]
   except:
      "Illegal input"
      sys.exit(1)
   if len(argv) == 3:
       N=500
       M=500
   if len(argv) == 4:
       try:
          N=int(argv[3])
          M=N
       except:
           "Cannot convert grid dimension to integer"
           sys.exit(2)
   if len(argv) == 5:
       try:
           N=int(argv[3])
           M=int(argv[4])
       except:
           "Cannot convert grid dimension to integer"
           sys.exit(2)
else:
   print('USAGE: epsilon output-file <N> <M>')
   sys.exit(1)

print(f'Running until the difference is < {epsilon:} with size {N:}x{M:}')

#Initialize comparison
diff = epsilon

#Set max iterations. 10 million should do it.
maxiter=10000000

#Set size of arrays
nr=N
nc=M

u=np.zeros((nr+2,nc+2))
w=np.zeros((nr+2,nc+2))

# Set boundary values and compute mean boundary value. 
#This has the ice bath on the top edge, not the bottom edge.
u[:,0]=100.
u[:,nc+1]=100.
u[nr+1,:]=100.

#Initialize interior values to the boundary mean.  This value really
#doesn't matter much, it just affects the convergence rate somewhat.
#Using all zeroes would be fine.
mean = sum(u[:,0]) + sum(u[:,nr+1]) + sum(u[0,:]) + sum(u[nr+1,:])
mean = mean / (4.0 * (N+2))

#Be sure not to overwrite the boundary values
u[1:nr+1,1:nc+1]=mean

# Compute steady-state solution
iterations=0
diff_interval=1

#Arbitrary, just to make sure we enter loop
diff=10*epsilon

start_time=time.time()

while ( diff >= epsilon ):
#   for i in range(1,nr+1):
#       for j in range(1,nc+1):
#           w[i,j] = 0.25*(u[i-1,j] + u[i+1,j] + u[i,j-1] + u[i,j+1])
#  It is approximately a factor of 100 faster to use numpy array operations.
   w[1:-1,1:-1]=0.25*(u[:-2,1:-1]+u[2:,1:-1]+u[1:-1,:-2]+u[1:-1,2:])

   if iterations%diff_interval==0:
       diff=np.max(np.abs(w[1:-1,1:-1]-u[1:-1,1:-1]))
       if diff<=epsilon:
           break
   if iterations>maxiter:
           print("Warning: maximum iterations exceeded")
           break
   u[1:nr+1,1:nc+1]=w[1:nr+1,1:nc+1].copy()
   iterations+=1

print(f'completed in {iterations:} iterations with time {time.time()-start_time:.2f}')

# Write solution to output file
fout = open (filename,'w')
for i in range(1,N+1):
    line=" ".join(map(str,list(u[i,1:M+1])))
    row=line+"\n"
    fout.write(row)

# All done!
print(f"wrote output file {filename:}")

Project 8

Using the halo exchange code we have already seen, plus at least one collective communication, to parallelize your heated-plate code. Do not declare a single lage global array; break up the domain as we have learned (row-wise for C++ and Python, column-wise for Fortran). For output, have each rank write its set of rows or columns, labeling the file with its rank ID.

Example syntax:

  fname=argv[2]+to_string(rank);

Fortran

  write(fname,'(a,i4.4)') filename(1:len_trim(filename)),rank

Python

filename = filename + str( rank )

To plot the results you will have to stitch the files together appropriately. If you have no other preference you may use the contour_ranks.py code below. Add the -f command-line option for Fortran. The script must be given an argument that is the “base” of the filenames, and all the output files must be in the same folder and numbered appropriately.

Python script to merge output files and contour

import sys
import argparse
import glob
import numpy as np
import pylab as plt

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--fortran", help="Fortran ordering", action="store_true")
parser.add_argument("filename", help="base name of the output files")
args = parser.parse_args()
base = args.filename
endpos=len(base)

files=glob.glob(base+"*")
numbers=[]
suffixes=[]
for file in files:
    suffix=file[endpos:]
    numbers.append(int(suffix))
    suffixes.append(suffix)

pvec = np.argsort(numbers)

arrays=[]
for n in pvec:
    fname=files[n]
    arrays.append(np.loadtxt(fname,unpack=False))

if args.fortran:
    data=np.concatenate(arrays,axis=1)
else:
    data=np.concatenate(arrays,axis=0)

fig=plt.figure()
plt.contourf(data)
plt.colorbar()
plt.show()

Feel free to use the example solutions provided as your basis, but we recommend that you attempt to write your own serial version first.

Hints

You will need to modify the stopping criterion for the update loop. There are several options to do this. One is to use the maximum number of iterations as the criterion, then break out of the loop when the maximum difference in all the ranks is less than the specified tolerance.
Physical boundary conditions generally must be reapplied at each iteration to the “solution” matrix u. A subprogram to do this might be a good idea.
You need to do a full swap of the “current” solution u and the “updated” solution w at each iteration, which is why the BCs must be reapplied.
Check that your program works for a single process.

Example solutions

C++

#include <iostream>
#include <cmath>
#include <string>
#include <sstream>
#include <fstream>
#include <iomanip>
#include <mpi.h>

#define MAX_ITER 10000000

void set_bcs(double **u, int nr, int nc, int rank, int nprocs, double &bc1, double &bc2, double &bc3, double &bc4);

using namespace std; 

int main (int argc, char *argv[]) {
  int i, j;
  double diff;            // Change in value 
  int N, M;
  int iterations = 0;
  int diffInterval;
  double epsilon;
  double bc1, bc2, bc3, bc4;

  // Added for MPI
  int nrl, ncl;
  int rank, nprocs;
  MPI_Status status;
  int root=0, tag=0;
  int up, down;
  double gdiff;

  //Initialize MPI
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);

/* check number of parameters and read in epsilon, filename, optionally size */
  if (argc < 3) {
    printf ("USAGE:  %s epsilon output-file <N> <M>\n", argv[0]);
    MPI_Finalize();
    return 1;
  }
  else {
     stringstream ssepsilon;
     ssepsilon<<argv[1];
     if (ssepsilon.fail()) {
         printf ("Error reading in epsilon value\n");
         MPI_Finalize();
         return 2;
     }
     ssepsilon>>epsilon;

     if (argc==3) {
        N=500;
	M=500;
     }
     if (argc==4) {
        stringstream ssnr;
	ssnr<<argv[3];
        if (ssnr.fail()) {
            printf ("Error converting row dimension \n");
            MPI_Finalize();
            return 2;
	}
        ssnr>>N;
	M=N;
     }
     if (argc==5) {
        stringstream ssnr;
	ssnr<<argv[3];
        if (ssnr.fail()) {
            printf ("Error converting row dimension \n");
            MPI_Finalize();
            return 2;
	}
        ssnr>>N;
	stringstream ssnc;
	ssnc<<argv[4];
        if (ssnc.fail()) {
            printf ("Error converting row dimension \n");
            MPI_Finalize();
            return 2;
	}
        ssnc>>M;
     }
  }

  //Weak scaling
  //nrl=N;

  //Strong scaling
  if (N%nprocs!=0) {
     MPI_Finalize();
     cout<<"Not an even division of the arrays.\n";
     return 3;
  }
  else {
     nrl=N/nprocs;
  }

  //Both
  ncl=M;

  //Find my neighbors
  if (rank==0) {
      up=MPI_PROC_NULL;
  }
  else {
      up=rank-1;
  }

  if (rank==nprocs-1) {
      down=MPI_PROC_NULL;
  }
  else {
      down=rank+1;
  }

  int nrows=nrl+2;
  int ncols=ncl+2;
  double **u=new double*[nrows];  // Old values
  double **w=new double*[nrows];  // New values
  double **diffs=new double*[nrows];  // Diffs

  double *uptr=new double[(nrows)*(ncols)];
  double *wptr=new double[(nrows)*(ncols)];
  double *dptr=new double[(nrows)*(ncols)];

  for (i=0;i<nrows;++i,uptr+=ncols)
     u[i] = uptr;
  for (i=0;i<nrows;++i,wptr+=ncols)
     w[i] = wptr;
  for (i=0;i<nrows;++i,dptr+=ncols)
     diffs[i] = dptr;

  set_bcs(u, nrl, ncl, rank, nprocs, bc1, bc2, bc3, bc4);

  //Initialize to something like the mean boundary. Not important, can use zeros
  //but this may speed convergence a bit.
  double mean = 0.25*(bc1+bc2+bc3+bc4);

  // Initialize interior values
  for (i = 1; i <= nrl; i++ ) {
      for (j = 1; j <= ncl; j++ ) {
          u[i][j] = mean;
      }
  }

  diffInterval=1;

  float time0=MPI_Wtime();

  // Compute steady-state solution 
  // The diffs array is mostly to make it simpler to extend the testing interval

  if (rank==0) {
      cout<<"Running until the difference is <="<<epsilon<<" with size "<<N<<"x"<<M<<"\n";
  }

  while ( iterations<=MAX_ITER ) {
     // reset diff each time through to get max abs diff
     // max_element doesn't work great for twod arrays and is often slow
     diff=.8*epsilon;  
     for (i=1; i<=nrl;i++) {
        for (j=1;j<=ncl;j++) {
            w[i][j] = (u[i-1][j] + u[i+1][j] + u[i][j-1] + u[i][j+1])/4.0;
     	    diffs[i][j] = abs(w[i][j] - u[i][j]);
         }
     }

      //Exchange halo values (one ghost row each side)
      MPI_Sendrecv(&w[1][1],ncl, MPI_DOUBLE,up,tag,&w[nrl+1][1],
                            ncl, MPI_DOUBLE,down,tag,MPI_COMM_WORLD,&status);

      MPI_Sendrecv(&w[nrl][1],ncl,MPI_DOUBLE,down,tag,&w[0][1],
                              ncl,MPI_DOUBLE,up,tag,MPI_COMM_WORLD,&status);


     if (iterations%diffInterval==0) {
        for (i=1; i<=nrl;i++) {
           for (j=1;j<=ncl;j++) {
	       if (diff<diffs[i][j]) {
		   diff=diffs[i][j];
	       }
	   }
	 }
         //Find max of diff in all the processors.
         MPI_Allreduce(&diff,&gdiff,1,MPI_DOUBLE,MPI_MAX,MPI_COMM_WORLD);
         if (gdiff <= epsilon) 
             break;
     }

     for (i=0; i<=nrl+1;i++) {
        for (j=0;j<=ncl+1;j++) {
            u[i][j] = w[i][j];
        }
     }

     set_bcs(u, nrl, ncl, rank, nprocs, bc1, bc2, bc3, bc4);

     iterations++;
  } //end of computation 

  if (iterations>MAX_ITER) {
     if (rank==0) {
        cout<<"Warning: maximum iterations exceeded\n";
     }
  }

  float totalTime=(MPI_Wtime()-time0);
  if (rank==0) {
      cout << "Completed "<<iterations<<" iterations; time "<<totalTime<<endl;
  }
  // Write solution to output file
  ofstream fp;
  string fname=argv[2]+to_string(rank);
  fp.open(fname,ios::out);
  for (i = 1; i <= nrl; i++) {
     for (j = 1; j <= ncl; j++) {
        fp<<u[i][j]<<" ";
     }
     fp<<"\n";
  }
  fp.close();

  // All done!
  MPI_Finalize();

  return 0;
}

void set_bcs(double **u, int nr, int nc, int rank, int nprocs, double &bc1, double &bc2, double &bc3, double &bc4) {

  /* Set boundary values.
   * This has an ice bath on the top edge.
   * Note: when plotting, 0,0 is the bottom.
   */

  double topBC=0;
  double bottomBC=100.;
  double edgeBC=100.;

  bc1=topBC;
  bc2=bottomBC;
  bc3=edgeBC;
  bc4=edgeBC;

  if (rank==0) {
     for (int i=0;i<=nc+1;++i){
        u[0][i]=topBC;
     }
  }
  if (rank==nprocs-1) {
      for (int i=0;i<nc+1;++i){
        u[nr+1][i]=bottomBC;
      }
  }

  for (int i=0;i<=nr+1;++i){
      u[i][0]=edgeBC;
      u[i][nc+1]=edgeBC;
  }

}

Fortran

program heatedplate
  use mpi
  implicit none

  integer, parameter:: maxiter=10000000
  integer           :: N, M, lb
  integer           :: numargs, i, j, diff_interval, iterations = 0
  double precision  :: eps, mean, diff = 1.0
  character(len=80) :: arg, filename
  double precision, allocatable, dimension(:,:) :: u, w
  double precision  :: bc1, bc2, bc3, bc4
  real              :: time0, time1
  logical           :: strong_scaling

  !Add for MPI
  integer            :: nrl, ncl
  integer            :: rank, nprocs
  integer            :: errcode, ierr
  integer, dimension(MPI_STATUS_SIZE) :: mpi_stat
  integer, parameter :: root=0, tag=0
  integer            :: left, right
  double precision   :: gdiff
  character(len=24)  :: fname

  interface
    subroutine set_bcs(lb,nr,nc,rank,nprocs,bc1,bc2,bc3,bc4,u)
       implicit none
       integer, intent(in)  :: lb, nr, nc, rank, nprocs
       double precision, intent(out) :: bc1,bc2,bc3,bc4
       double precision, dimension(lb:,lb:), intent(inout) :: u
    end subroutine
  end interface

  !Initialize MPI, get the local number of columns
  call MPI_INIT(ierr)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,ierr)
  call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)

  ! check number of parameters and read in epsilon, filename, optionally size
  ! all ranks do this
  numargs=command_argument_count()
  if (numargs .lt. 2) then
     stop 'USAGE: epsilon output-file <N> <M>'
  else
     call get_command_argument(1,arg)
     read(arg,*) eps
     call get_command_argument(2,filename)

     if (numargs==2) then
        N=500
        M=500
     endif
     if (numargs==3) then
        call get_command_argument(3,arg)
        read(arg,*) N
        M=N
     endif
     if (numargs==4) then
        call get_command_argument(3,arg)
        read(arg,*) N
        call get_command_argument(4,arg)
        read(arg,*) M
     endif
  endif

  strong_scaling=.true.

  if (strong_scaling) then
     ! strong scaling
     if (mod(M,nprocs).ne.0) then
         call MPI_FINALIZE(ierr)
         stop 'Not an even division of the array'
      else
          ncl = M/nprocs
      endif
  else
  ! weak scaling
      ncl = M
      M = M*nprocs  !for printing size
  endif

  ! Both
  nrl=N

  ! Find my neighbors
  if ( rank .eq. root ) then
      left = MPI_PROC_NULL
  else
      left = rank - 1
  endif

  if ( rank .eq. nprocs-1 ) then
      right = MPI_PROC_NULL
  else
      right = rank + 1
  endif

  ! Set boundary values and compute mean boundary value. 
  ! This has the ice bath on the top edge.

  ! Allocate arrays)
  lb=0
  allocate(u(lb:nrl+1,lb:ncl+1),w(lb:nrl+1,lb:ncl+1))

  u=0.d0
  w=0.d0

  call set_bcs(lb,nrl,ncl,rank,nprocs,bc1,bc2,bc3,bc4,u)

  ! Initialize interior values to the boundary mean, more or les
  ! Doesn't matter much, starting from all zeroes works too. May speed
  ! up convergencle a bit.
  mean = 0.25d0*(bc1+bc2+bc3+bc4)
  u(1:nrl,1:ncl)=mean

  diff_interval=1

  ! Walltime from process 0 is good enough for me
  if (rank .eq. 0) then
    time0=MPI_WTIME()
  endif

  if (rank==0) then
      write (*,'(a,es15.8,a,i6,a,i6)') 'Running until the difference is <= ',  &
                                              eps,' for global size ',N,'x',M
  endif

  ! Compute steady-state solution
  do while ( iterations<=maxiter )
     do j=1,ncl
        do i=1,nrl
           w(i,j) = 0.25*(u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))
        enddo
     enddo

     ! Exchange halo values
     call MPI_SENDRECV(w(1:nrl,1)    ,nrl,MPI_DOUBLE_PRECISION,left,tag,       &
                       w(1:nrl,ncl+1),nrl,MPI_DOUBLE_PRECISION,right,tag,       &
                                     MPI_COMM_WORLD,mpi_stat,ierr)

     call MPI_SENDRECV(w(1:nrl,ncl)  ,nrl,MPI_DOUBLE_PRECISION,right,tag,      &
                       w(1:nrl,0)    ,nrl,MPI_DOUBLE_PRECISION,left,tag,         &
                                     MPI_COMM_WORLD,mpi_stat,ierr)

     if (mod(iterations,diff_interval)==0) then
           if (diff_interval==-1) continue  !disable convergence test
           diff=maxval(abs(w(1:nrl,1:ncl)-u(1:nrl,1:ncl)))
           call MPI_ALLREDUCE(diff,gdiff,1,MPI_DOUBLE_PRECISION,MPI_MAX,       &
                                                   MPI_COMM_WORLD, ierr)

           if (gdiff <= eps) then
              exit
           endif
     endif

     u = w

     ! Reset physical boundaries (they get overwritten in the halo exchange)
     call set_bcs(lb,nrl,ncl,rank,nprocs,bc1,bc2,bc3,bc4,u)

     iterations = iterations + 1

     !If this happens we will exit at next conditional test.
     if (iterations>maxiter) then
           if (rank==0) then
               write(*,*) "Warning: maximum iterations exceeded"
           endif
     endif
  enddo

  if (rank==root) then
    time1 = MPI_WTIME()
    write (*,*) 'completed; iterations = ', iterations,"in ",time1-time0," sec"
  endif

  ! Write solution to output file by row. Must use correct stitching to get
  ! overall result, such as a numpy concatenate on axis=1.
  ! It would also work to output by columns, in which case stitching is more
  ! straightforward.

  !Some compilers (Intel) like to add EOLs to line up columns in output. The 
  ! Fortran standard permits this with * IO, but it complicates plotting.  
  ! Use some F08 features to fix this.

  write(fname,'(a,i4.4)') filename(1:len_trim(filename)),rank

  open (unit=10,file=fname)
  do i=1,nrl
     write (10,'(*(g0,1x))') u(i,1:ncl)
  enddo

  ! All done!
  call MPI_Finalize(ierr)

end program heatedplate

subroutine set_bcs(lb,nr,nc,rank,nprocs,bc1,bc2,bc3,bc4,u)
implicit none
integer, intent(in)                                 :: lb, nr, nc, rank, nprocs
double precision, intent(out)                       :: bc1,bc2,bc3,bc4
double precision, dimension(lb:,lb:), intent(inout) :: u
double precision                                    :: topBC, bottomBC, edgeBC

  ! Set boundary values 
  ! This has the ice bath on the bottom edge.
  ! Note: when plotting, 0,0 is the bottom.

  topBC=100.d0
  bottomBC=0.d0
  edgeBC=100.d0
  bc1=topBC
  bc2=bottomBC
  bc3=edgeBC
  bc4=edgeBC

  !Set boundary conditions
  if (rank==0) then
     u(:,0)=edgeBC
  endif
  if (rank==nprocs-1) then
     u(:,nc+1)=edgeBC
  endif
  u(nr+1,:)=topBC
  u(0,:)=bottomBC


end subroutine set_bcs

Python

import sys
import time
import numpy as np
from mpi4py import MPI

def set_bcs(u,nrl,ncl,rank):
    # Set physical boundary values and compute mean boundary value. 
    # Due to Python pass-by-assignment, mutable parameters will be
    #changed outside (so this is a subroutine)

    topBC=0.
    bottomBC=100.
    edgeBC = 100.

    bc1=topBC
    bc2=bottomBC
    bc3=edgeBC
    bc4=edgeBC

    if rank==0:
        u[0,:] = topBC
    if rank==nprocs-1:
        u[nrl+1,:]=bottomBC

    u[:,0]=edgeBC
    u[:,ncl+1]=edgeBC

    return bc1,bc2,bc3,bc4

# check number of parameters and read in parameters

#Better to use argparse but this is closer to compiled language code and simpler
#if one hasn't used argparse.
argv=sys.argv
if  len(argv) > 2:
   try:
      epsilon=float(argv[1])
      filename=argv[2]
   except:
      "Illegal input"
      sys.exit(1)
   if len(argv) == 3:
       N=500
       M=500
   if len(argv) == 4:
       try:
          N=int(argv[3])
          M=N
       except:
           "Cannot convert grid dimension to integer"
           sys.exit(2)
   if len(argv) == 5:
       try:
           N=int(argv[3])
           M=int(argv[4])
       except:
           "Cannot convert grid dimension to integer"
           sys.exit(2)
else:
   print('USAGE: epsilon output-file <N> <M>')
   sys.exit(1)

# Initializing MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()
root = 0
tag = 0

#Set max iterations. 10 million should do it.
max_iterations=10000000

#Set nprocs of local arrays

#Strong scaling
if M%nprocs!=0:
    print("Number of rows must be evenly divisible by number of processes")
    sys.exit(0)
else:
    nrl = N//nprocs

#Weak scaling
#nrl=N

#Both
ncl=M

u=np.zeros((nrl+2,ncl+2))
w=np.zeros((nrl+2,ncl+2))

bc1,bc2,bc3,bc4=set_bcs(u,nrl,ncl,rank)

#Initialize interior values to the boundary mean.  This value really
#doesn't matter much, it just affects the convergencle rate somewhat.
#Using all zeroes would be fine.
mean = 0.25*(bc1+bc2+bc3+bc4)

#Be sure not to overwrite the boundary values
u[1:nrl+1,1:ncl+1]=mean

#Set up the up and down rank for each process
if rank == 0 :
    up = MPI.PROC_NULL
else :
    up = rank - 1

if rank == nprocs - 1 :
    down = MPI.PROC_NULL
else :
    down = rank + 1

# Compute steady-state solution
iterations=0
diff_interval=1

start_time=MPI.Wtime()
my_diff = np.zeros(1)
global_diff=np.zeros(1)

if rank==root:
    print(f'Running until the difference is < {epsilon:}, global size {N:}x{M:}')

while iterations<=max_iterations:

    w[1:-1,1:-1]=0.25*(u[:-2,1:-1]+u[2:,1:-1]+u[1:-1,:-2]+u[1:-1,2:])

   # Send up and receive down
    comm.Sendrecv([w[1,1:ncl+1], MPI.DOUBLE], up, tag, [w[nrl+1,1:ncl+1], MPI.DOUBLE], down, tag)
    # Send down and receive up
    comm.Sendrecv([w[nrl, 1:ncl+1], MPI.DOUBLE], down, tag, [w[0,1:ncl+1], MPI.DOUBLE], up, tag)

    if iterations%diff_interval==0:
        my_diff[0]=np.max(np.abs(w[1:-1,1:-1]-u[1:-1,1:-1]))
        comm.Allreduce(my_diff,global_diff,op=MPI.MAX)

        if global_diff[0]<=epsilon:
            break

    np.copyto(u,w)
    #Reapply physical boundary conditions
    set_bcs(u,nrl,ncl,rank)

    iterations+=1
#This is what the weird "else" clause in Python for/while loops is used to do.
#Our stopping criterion is now on exceeding max_iterations so don't need
# to break, but we need to warn that it happened.
else:
     if rank==0:
         print("Warning: maximum iterations exceeded")

total_time=MPI.Wtime()-start_time

if rank==0:
    print(f'completed in {iterations:} iterations with time {total_time:.2f}')

# Write solution to output file
filename = filename + str(rank)
fout = open (filename,'w')
for i in range(1,nrl+1):
    line=" ".join(map(str,list(u[i,1:ncl+1])))
    row=line+"\n"
    fout.write(row)

# All done!
#print(f"wrote output file {filename:}")

Project 9

Scaling Studies

We have discussed weak and strong scaling. If you have not already done so, add timing to your heated plate solution in the language of your choice. Use MPI_Wtime() since it will generate uniform timings regardless of your programming language. It is sufficient to instrument the while loop since nearly all the time will be spent there.

The example solutions show how to do strong or weak scaling. You can do only one at a time. Weak scaling will increase the size of the global domain in a way that will make it rectangular rather than square, but that doesn’t matter for our study.
Start with strong scaling, dividing up the same amount of work over an increasing number of processes. Choose a value for the number of rows or columns that will be evenly divisible by 8. Run your code for 1, 2, 4, and 8 processes. Make a plot of time versus number of processes. Compute the parallel efficiency for each run.
Repeat for weak scaling.

Make sure you are using idle cores for this test, or your results will not reflect the time spent on your processes. If you have access to a high-performance computing cluster with a resource manager (queueing system), submit each test case as a separate job. Are the results what you expected?

Last updated on Jan 1, 0001