GPU acceleration

Certain tasks can be greatly accelerated if run on a graphics processing unit (GPU). A GPU can be regarded as a device that runs hundreds or thousands of threads. The memory per thread is usually fairly limited but has a very high bandwidth. Data must be moved to and from the host computer’s memory to the GPU’s memory.

GPU programming is an advanced topic and we will mention here only a few basics, and some resources for the interested reader.

In what follows, the host is the computer in which the GPU is installed, and the GPU itself is the device. The GPU has its own memory system and cannot access the RAM of the host; similarly the host cannot directly access GPU memory. Data must be copied back and forth between them.

In most cases, it may be advisable to set up a separate environment for different Python+CUDA packages.

CuPy

CuPy is an implementation of many of the features of NumPy and SciPy that takes advantage of the GPU. It is one of the simplest introductions to GPU programming. It can be installed with conda through the conda-forge channel. For installing from the command line, a terminal in MacOS or Linux or a Miniforge or Anaconda shell on Windows , use

conda install -c conda-forge cupy

You can also use pip to install CuPy. Alternatively, use a Docker container.

You must set the CUDA_PATH environment variable for CuPy to be able to accelerate your code properly. If you are working with your own computer, CUDA is installed from NVIDIA packages.

For example, on a local Linux workstation, NVIDIA installs into /usr/local/cuda so you should set this as your CUDA_PATH.

export CUDA_PATH=/usr/local/cuda

Refer to NVIDIA’s instructions for other operating systems.

On a system such as UVA’s HPC environment, the CUDA module will set the CUDA_PATH environment variable.

module load cuda

Methods invoked through the CuPy module will be carried out on the GPU. Corresponding NumPy methods will be processed by the CPU as usual. Data transfer happens through streams. The null stream is the default.

CuPy provides several packages. In this example we show its FFT implementation.

import numpy as np
import cupy as cp
import time

x_cpu = np.array([1, 2, 3])
x_gpu = cp.array([1, 2, 3])
  
l2_cpu = np.linalg.norm(x_cpu)
l2_gpu = cp.linalg.norm(x_gpu)
  
print("Norm output using Numpy: ", l2_cpu)
print("Norm output Using Cupy: ", l2_gpu)
print()

print("Setting up arrays on host and device")
s = time.time()
x_cpu = np.ones((1000,1000,100))
e = time.time()
print("CPU:",e - s)
s = time.time()
x_gpu = cp.ones((1000,1000,100))
cp.cuda.Stream.null.synchronize()
e = time.time()
print("GPU:",e - s)

print()
N=10000000
print("FFT on arrays on host and device")
### Numpy and CPU
s = time.time()
a = np.random.random(N).astype(complex)
b = np.fft.fft(a) 
e = time.time()
print("CPU:",e - s)
s = time.time()
a = cp.random.random(N).astype(complex)
b = cp.fft.fft(a) 
cp.cuda.Stream.null.synchronize()
e = time.time()
print("GPU:",e - s)

Exercise Providing enough work to fill the GPU’s threads is critical to overcoming the overhead of copying data.

Reduce the length of the array in the example to 10000, then increase it until you find a size for which the GPU is faster.

PyCUDA

CUDA is a package from NVIDIA that enables access to the GPU through the C programming language. With appropriate bindings it can be called from other languages. PyCUDA is a package that implements CUDA bindings to Python.

Like CuPy, it is available through conda-forge.

conda install -c conda-forge pycuda

On Linux the PATH variable must include the location of the nvcc compiler. If you have your own Linux workstation you must first locate nvcc. It should be in the folder indicated by the CUDA_PATH variable, with a “bin” appended.

ls $CUDA_PATH/bin

Then add this location to your path, e.g.

export PATH=$CUDA_PATH/bin:$PATH

Example

This script is copied directly from PyCUDA’s examples.

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print(dest-a*b)

Much as we saw when discussing using compiled code, we must define our function in C style. This block of code to be executed on the device is called a kernel. PyCUDA compiles the kernel, uses its interface with NumPy to allocate memory on the device, copy the Ndarrays, carry out the computation, then copy the result from the device to the dest array.

Notice that the Ndarrays are declared to be dtype=float32. Very few GPUs support double-precision (float or float64 in Python) hardware, so using doubles may be slow as the computations must be done in software on the device.

Numba

When used for GPU acceleration, the package relies on a conda package cudatoolkit.

conda install cudatoolkit

If you must use pip, you must also install the NVIDIA CUDA SDK.

Numba can be used with PyCUDA so adding it to the PyCUDA environment, which should already contain cudatoolkit, might be advisable. This example is from the PyCUDA tutorial.

# Copyright 2008-2021 Andreas Kloeckner
# Copyright 2021 NVIDIA Corporation

from numba import cuda

import pycuda.driver as pycuda
# We use autoprimaryctx instead of autoinit because Numba can only operate on a
# primary context
import pycuda.autoprimaryctx  # noqa
import pycuda.gpuarray as gpuarray

import numpy


# Create a PyCUDA gpuarray
a_gpu = gpuarray.to_gpu(numpy.random.randn(4, 4).astype(numpy.float32))
print("original array:")
print(a_gpu)


# A standard Numba kernel that doubles its input array
@cuda.jit
def double(x):
    i, j = cuda.grid(2)

    if i < x.shape[0] and j < x.shape[1]:
        x[i, j] *= 2


# Call the Numba kernel on the PyCUDA gpuarray, using the CUDA Array Interface
# transparently
double[(4, 4), (1, 1)](a_gpu)
print("doubled with numba:")
print(a_gpu)

Numba Vectorization

Numba CUDA can “vectorize” a universal function (ufunc) by compiling it and running it on the GPU. Vectorization is implemented through a decorator.

For best performance, the signature of the function arguments must be specified. If more than one type should be supported, all must be passed as a list.

Example

From the Numba documentation:

import math
from numba import vectorize, cuda
import numpy as np

@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
           target='cuda')
def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

N = 10000
dtype = np.float32

# prepare the input
A = np.array(np.random.sample(N), dtype=dtype)
B = np.array(np.random.sample(N) + 10, dtype=dtype)
C = np.array(np.random.sample(N), dtype=dtype)

D = cu_discriminant(A, B, C)

print(D)  # print result

The run may emit a warning about under-utilization:

Grid size (1) < 2 * SM count (40) will likely result in GPU under utilization due to low occupancy.

This is because efficient use of a GPU requires that the device threads be as filled as possible. If too many threads are idle, GPU code can be slower than the equivalent CPU code.

Another requirement for efficient GPU utilization is memory management. Using NumPy arrays will result in copying from the host memory to the device memory. Calling a ufunc with NumPy arrays can result in a large amount of copying of data back and forth between host and device. We can instead declare arrays that will be set up in the GPU memory. The todevice method will copy the host array to the device array. We can also declare an output array on the device for the result. When we are done with our calculations, we explicitly copy the result back to the host memory.

import math
from numba import vectorize, cuda
import numpy as np

@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
           target='cuda')
def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

N = 10000
dtype = np.float32

# prepare the input
A = np.array(np.random.sample(N), dtype=dtype)
B = np.array(np.random.sample(N) + 10, dtype=dtype)
C = np.array(np.random.sample(N), dtype=dtype)

a_device = cuda.to_device(A)
b_device = cuda.to_device(B)
c_device = cuda.to_device(C)

#sets up but doesn't initialize
d_device = cuda.device_array(shape=A.shape, dtype=np.float32)  

#"hidden" keyword argument instead of regular return
cu_discriminant(a_device,b_device,c_device,out=d_device)

#copy result back and print
D = d_device.copy_to_host()
print(D)

If you time these two scripts, you may see a small speedup even for this relatively low-efficiency problem when copying is minimized. Of course, we should aim for a large amount of work on the device arrays before we return the output to the host, and we should strive to make the problem large enough to fill the GPU’s threads.

Numba GPU Kernels

Numba has a cuda.jit decorator that can be used like the jit equivalent, but defines a CUDA kernel. Some accommodations for the CUDA programming model must be made. For the best performance, the number of threads on the hardware must be divided into thread blocks and the optimal number may depend on the device. Similarly to PyCUDA, Numba over CUDA makes use of NumPy arrays.

Example

This example of a matrix-multiplication kernel is taken from the Numba CUDA documentation.

from numba import cuda, float32
import math
import numpy as np

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
# TPB should not be larger than 32 in this example
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
    """
    Perform matrix multiplication of C = A * B using CUDA shared memory.
    Reference: https://stackoverflow.com/a/64198479/13697228 by @RobertCrovella
    """

    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx + i * TPB) < A.shape[1]:
            sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty + i * TPB) < B.shape[0]:
            sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

x_h = np.arange(16).reshape([4, 4])
y_h = np.ones([4, 4])
z_h = np.zeros([4, 4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h @ y_h)

RAPIDS

RAPIDS is a set of libraries released by NVIDIA for its GPU architectures (Pascal or later). It builds upon CuPY and introduces a GPU DataFrame cuDF, and a package cuML that mostly replicates scikit-learn. See our RAPIDS workshop to learn more about using RAPIDS.

Last updated on Nov 17, 2020