Memory access

General memory access guidelines

The preferred strategy is to load data from global memory at the beginning of a kernel and storing it locally, then writing it back at the end. This is mainly due to the significant difference in memory access speeds between different types of memory.

Global memory latency

Global memory is the main memory of the GPU, and it has a high latency, typically around 400 to 800 clock cycles. If every thread in a CUDA kernel were to access global memory frequently, it would lead to a significant slowdown due to this latency.

Local and shared memory speed

Local memory and shared memory are much faster than global memory. Shared memory, in particular, is on-chip and has a latency comparable to registers (typically around 1 clock cycle). By loading data from global memory into shared memory or registers at the beginning of a kernel, threads can then operate on this data using the much faster local or shared memory.

By working with data locally (in registers or shared memory) during most of the kernel's execution, we minimize the need to repeatedly access global memory. This not only reduces latency but also reduces contention and traffic on the memory bus. Writing back the results to global memory at the end of the kernel also allows for a more structured approach.

Further reading

For more information on memory access, see section 7.2. Variable Memory Space Specifiers: