CUDA kernel launch
mat_user()
__constant__ MatParams cmat;
void kernelMaterial(UserMatHost host, UserMatDevice device, cudaStream_t stream)
{
cudaMemcpyToSymbolAsync(cmat, host.p_material, sizeof(double) * 200, 0, cudaMemcpyHostToDevice, stream);
const unsigned int block_size = 128;
const unsigned int num_blocks = calcNumBlocks(data.num_tasks, block_size);
kernel_material<<<num_blocks, block_size, 0, stream>>>(device);
kernel_error_check(stream, __FILE__, __LINE__);
}
Copy material parameters array to constant memory space
To use the material parameters in the kernel, we copy the data from host to device, more specifically the __constant__
memory space. This is done using the cudaMemcpyToSymbolAsync function
(read more here).
Launching the kernel
Number of blocks and threads per block
Note: When launching a CUDA kernel, the total number of threads launched is determined by the product of the block size and the number of blocks. As a result, more threads are usually launched than necessary to process the data, since execution occurs at the block level rather than individual threads.
const unsigned int block_size = 128;
const unsigned int num_blocks = calcNumBlocks(data.num_tasks, block_size);
This code executes a GPU kernel to perform parallel computations on a large dataset. In CUDA, thread launching is
achieved by executing multiple blocks, each with a specified size. The block size should be a multiple of 32, with a
maximum of 1024 threads per block. Here, we use 128 threads per block and utilize the calcNumBlocks macro to calculate
the number of blocks required based on the block size and the number of tasks to be processed.
Kernel launch
Note: CUDA kernels are launched asynchronously, meaning that they do not wait for the kernel to finish before continuing. To ensure that the kernel is finished before continuing with the application, synchronization is required. This can be done by calling the
cudaStreamSynchronizefunction (read more here).
kernel_material<<<num_blocks, block_size, 0, stream>>>(device);
kernel_material<<< num_blocks, block_size, 0, stream >>>(); is the kernel launch statement. The syntax <<<...>>> is
called the execution configuration and it specifies how the kernel should be executed on the GPU. Any calls to a
__global__ function must specify the execution configuration.
To specify the execution configuration, you insert an expression in the format
<<< gridDim, blockDim, sharedMem, stream >>> immediately after the function name and before the opening parenthesis of
the argument, where:
gridDim: The number of blocks in the grid (calculated by thecalcNumBlocksmacro);blockDim: The number of threads in each block (128 in this example);sharedMem: The number of bytes in shared memory that is dynamically allocated per block;stream: A CUDA stream object that manages the execution of the kernel.streamis an optional argument which defaults to 0.
Error checking and synchronisation
kernel_error_check(stream, __FILE__, __LINE__);
Finally, we perform error checking using the kernel_error_check wrapper function. It checks for CUDA error codes,
printing out an error message with file and line number if one occurs. This function also invoke stream synchronization
(waits until the kernel has finished before continuing).