GPU acceleration

Certain tasks can be greatly accelerated if run on a graphics processing unit (GPU). A GPU can be regarded as a device that runs hundreds or thousands of threads. The memory per thread is usually fairly limited but has a very high bandwidth. Data must be moved to and from the host computer’s memory to the GPU’s memory.

GPU programming is an advanced topic and we will mention here only a few basics, and some resources for the interested reader.

In what follows, the host is the computer in which the GPU is installed, and the GPU itself is the device. The GPU has its own memory system and cannot access the RAM of the host; similarly the host cannot directly access GPU memory. Data must be copied back and forth between them.

In most cases, it may be advisable to set up a separate environment for different Python+CUDA packages.

CuPy

CuPy is an implementation of many of the features of NumPy and SciPy that takes advantage of the GPU. It is one of the simplest introductions to GPU programming. It can be installed with conda through the conda-forge channel. If using the Anaconda Navigator GUI, install the channel, then switch to it and install cuPy through the interface. For installing from the command line, use

conda install -c conda-forge cupy

You can also use pip to install CuPy. Alternatively, use a Docker container.

You must set the CUDA_PATH environment variable for CuPy to be able to accelerate your code properly.

export CUDA_PATH=/usr/local/cuda/bin

Methods invoked through the CuPy module will be carried out on the GPU. Corresponding NumPy methods will be processed by the CPU as usual. Data transfer happens through streams. The null stream is the default.

CuPy provides several packages. In this example we show its FFT implementation.

import numpy as np
import cupy as cp
import time

x_cpu = np.array([1, 2, 3])
x_gpu = cp.array([1, 2, 3])
  
l2_cpu = np.linalg.norm(x_cpu)
l2_gpu = cp.linalg.norm(x_gpu)
  
print("Using Numpy: ", l2_cpu)
print("\nUsing Cupy: ", l2_gpu)

print("Setting up arrays on host and device")
s = time.time()
x_cpu = np.ones((1000,1000,100))
e = time.time()
print("CPU:",e - s)
s = time.time()
x_gpu = cp.ones((1000,1000,100))
cp.cuda.Stream.null.synchronize()
e = time.time()
print("GPU:",e - s)

print()
N=10000000
print("FFT on arrays on host and device")
### Numpy and CPU
s = time.time()
a = np.random.random(N).astype(complex)
b = np.fft.fft(a) 
e = time.time()
print("CPU:",e - s)
s = time.time()
a = cp.random.random(N).astype(complex)
b = cp.fft.fft(a) 
cp.cuda.Stream.null.synchronize()
e = time.time()
print("GPU:",e - s)

Exercise Providing enough work to fill the GPU’s threads is critical to overcoming the overhead of copying data.

Reduce the length of the array in the example to 10000, then increase it until you find a size for which the GPU is faster.

PyCUDA

CUDA is a package from NVIDIA that enables access to the GPU through the C programming language. With appropriate bindings it can be called from other languages. PyCUDA is a package that implements CUDA bindings to Python.

Like CuPy, it is available through conda-forge.

conda install -c conda-forge pycuda

On Linux the PATH variable must include the location of the nvcc compiler.

export PATH=/usr/local/cuda/bin:$PATH

Example

This script is copied directly from PyCUDA’s examples.

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print(dest-a*b)


Much as we saw when discussing using compiled code, we must define our function in C style. This block of code to be executed on the device is called a kernel. PyCUDA compiles the kernel, uses its interface with NumPy to allocate memory on the device, copy the Ndarrays, carry out the computation, then copy the result from the device to the dest array.

Notice that the Ndarrays are declared to be dtype=float32. Very few GPUs support double-precision (float or float64 in Python) hardware, so using doubles may be slow as the computations must be done in software on the device.

Numba

When used for GPU acceleration, the package relies on a conda package cudatoolkit.

conda install cudatoolkit

If you must use pip, you must also install the NVIDIA CUDA SDK.

Numba Vectorization

Numba CUDA can “vectorize” a universal function (ufunc) by compiling it and running it on the GPU. Vectorization is implemented through a decorator.

For best performance, the signature of the function arguments must be specified. If more than one type should be supported, all must be passed as a list.

Example

From the Numba documentation:

import math
from numba import vectorize, cuda
import numpy as np

@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
           target='cuda')
def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

N = 10000
dtype = np.float32

# prepare the input
A = np.array(np.random.sample(N), dtype=dtype)
B = np.array(np.random.sample(N) + 10, dtype=dtype)
C = np.array(np.random.sample(N), dtype=dtype)

D = cu_discriminant(A, B, C)

print(D)  # print result


You may ignore the deprecation warning. The run may also emit a warning about underutilization:

Grid size (1) < 2 * SM count (40) will likely result in GPU under utilization due to low occupancy.

This is because efficient use of a GPU requires that the device threads be as filled as possible. If too many threads are idle, GPU code can be slower than the equivalent CPU code.

Another requirement for efficient GPU utilization is memory management. Using NumPy arrays will result in copying from the host memory to the device memory. Calling a ufunc with NumPy arrays can result in a large amount of copying of data back and forth between host and device. We can instead declare arrays that will be set up in the GPU memory. The todevice method will copy the host array to the device array. We can also declare an output array on the device for the result. When we are done with our calculations, we explicitly copy the result back to the host memory.

import math
from numba import vectorize, cuda
import numpy as np

@vectorize(['float32(float32, float32, float32)',
            'float64(float64, float64, float64)'],
           target='cuda')
def cu_discriminant(a, b, c):
    return math.sqrt(b ** 2 - 4 * a * c)

N = 10000
dtype = np.float32

# prepare the input
A = np.array(np.random.sample(N), dtype=dtype)
B = np.array(np.random.sample(N) + 10, dtype=dtype)
C = np.array(np.random.sample(N), dtype=dtype)

a_device = cuda.to_device(A)
b_device = cuda.to_device(B)
c_device = cuda.to_device(C)

#sets up but doesn't initialize
d_device = cuda.device_array(shape=A.shape, dtype=np.float32)  

#"hidden" keyword argument instead of regular return
cu_discriminant(a_device,b_device,c_device,out=d_device)

#copy result back and print
D = d_device.copy_to_host()
print(D)



If you time these two scripts, you may see a small speedup even for this relatively low-efficiency problem when copying is minimized. Of course, we should aim for a large amount of work on the device arrays before we return the output to the host, and we should strive to make the problem large enough to fill the GPU’s threads.

Numba GPU Kernels

Numba has a cuda.jit decorator that can be used like the jit equivalent, but defines a CUDA kernel. Some accommodations for the CUDA programming model must be made. For the best performance, the number of threads on the hardware must be divided into thread blocks and the optimal number may depend on the device. Similarly to PyCUDA, Numba over CUDA makes use of NumPy arrays.

Example

This example of a matrix-multiplication kernel is taken from the Numba CUDA documentation.

from numba import cuda, float32
import math
import numpy as np

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
# TPB should not be larger than 32 in this example
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
    """
    Perform matrix multiplication of C = A * B using CUDA shared memory.
    Reference: https://stackoverflow.com/a/64198479/13697228 by @RobertCrovella
    """

    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx + i * TPB) < A.shape[1]:
            sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty + i * TPB) < B.shape[0]:
            sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

x_h = np.arange(16).reshape([4, 4])
y_h = np.ones([4, 4])
z_h = np.zeros([4, 4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h @ y_h)


RAPIDS

RAPIDS is a set of libraries released by NVIDIA for its GPU architectures (Pascal or later). It builds upon CuPY and introduces a GPU DataFrame cuDF, and a package cuML that mostly replicates scikit-learn. See our RAPIDS workshop to learn more about using RAPIDS.

Previous