Global load transaction count when in coalesced memory access - cuda

I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is,
__global__
void copy_coalesced(float * d_in, float * d_out)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
d_out[tid] = d_in[tid];
}
When I run this with the following kernel configurations
#define BLOCKSIZE 32
int data_size = 10240; //always a multiply of the BLOCKSIZE
int gridSize = data_size / BLOCKSIZE;
copy_coalesced<<<gridSize, BLOCKSIZE>>>(d_in, d_out);
Since the the data access in the kernel is fully coalasced, and since the data type is float (4 bytes), The number of Load/Store Transactions expected can be found as following,
Load Transaction Size = 32 bytes
Number of floats that can be loaded per transaction = 32 bytes / 4 bytes = 8
Number of transactions needed to load 10240 of data = 10240/8 = 1280 transactions
The same amount of transactions are expected for writing the data as well.
But when observing the nvprof metrics, following was the results
gld_transactions 2560
gst_transactions 1280
gld_transactions_per_request 8.0
gst_transactions_per_request 4.0
I cannot figure out why it takes twice the transactions that it needs for loading the data. But when it comes to load/store efficiency both the metrics gives out 100%
What am I missing out here?

I reproduced your results on linux,
1 gld_transactions Global Load Transactions 2560
1 gst_transactions Global Store Transactions 1280
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 1280
1 l2_tex_write_transactions L2 Transactions (Texture Writes) 1280
However, on Windows using NSIGHT Visual Studio edition, I get values that appear to be better:
You may want to contact NVIDIA as it could simply be a display issue in nvprof.

Related

Why memory prefetch has no impact when transferring from device to host?

I have the following setup:
constexpr uint32_t N{512};
constexpr uint32_t DATA_SIZE{sizeof(float) * N * N};
__managed__ float ma[N * N];
__managed__ float mb[N * N];
__managed__ float mc[N * N];
__global__ void kernel()
{
for (uint32_t i{0}; i < N * N; ++i)
{
mc[i] = ma[i] + mb[i];
}
}
int main(int argc, char *[])
{
for (uint32_t i{0}; i < N * N; ++i)
{
ma[i] = 1.0f;
mb[i] = 2.0f;
}
int deviceId{};
gpuErrchk(cudaGetDevice(&deviceId));
gpuErrchk(cudaMemPrefetchAsync(ma, DATA_SIZE, deviceId, nullptr));
gpuErrchk(cudaMemPrefetchAsync(mb, DATA_SIZE, deviceId, nullptr));
kernel<<<1, 1>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemPrefetchAsync(mc, DATA_SIZE, cudaCpuDeviceId, nullptr));
gpuErrchk(cudaDeviceSynchronize());
float result{0.0f};
for (uint32_t i{0}; i < N * N; ++i)
{
result += mc[i];
}
return static_cast<int>(result);
}
I compile the code with 03 optimizations. Profiling it with nvprof ./test gives me the following (only the memory part):
==29300== Unified Memory profiling result:
Device "Quadro P1000 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
2 1.0000MB 1.0000MB 1.0000MB 2.000000MB 164.9620us Host To Device
20 153.60KB 4.0000KB 1.0000MB 3.000000MB 266.0500us Device To Host
19 - - - - 551.9440us Gpu page fault groups
Total CPU Page faults: 9
The first line - HtoD - is straightforward - there were 2 prefetches for ma and mb arrays 1MB each.
The second line is strange for 2 reasons:
Prefetching was ignored (well, not completely, more on this later)
The total size of the data is 3MB despite the fact that the total array size is 1MB and in cudaMemPrefetchAsync also 1MB was specified.
If I run the same code with prefetching commented out I have the following results:
==30051== Unified Memory profiling result:
Device "Quadro P1000 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
20 102.40KB 4.0000KB 508.00KB 2.000000MB 189.9230us Host To Device
29 105.93KB 4.0000KB 512.00KB 3.000000MB 278.4960us Device To Host
24 - - - - 1.311533ms Gpu page fault groups
Total CPU Page faults: 14
As seen in the table prefetching has an impact on the number of transfers - for HtoD it was changed from 2 to 20, and for DtoH it changed from 20 to 29. It also has an impact on performance, but that impact is not major. Especially if I compare it with the third variation of the same code, where I use cudaMalloc instead of managed memory:
Type Time(%) Time Calls Avg Min Max Name
0.00% 164.80us 2 82.401us 82.209us 82.593us [CUDA memcpy HtoD]
0.00% 81.665us 1 81.665us 81.665us 81.665us [CUDA memcpy DtoH]
I am running the NVidia Quadro P1000 laptop, Ubuntu 18.04, Cuda 11.8.
To summarize, here are my questions:
Why does prefetch the memory to the host almost have no impact (29 migrations vs 20 migrations)?
Why is more memory transferred to the host than requested (3Mb instead of the requested 1MB)?
Why even with prefetching the managed memory is the order of magnitude slower than the device memory allocated with cudaMalloc?
As Robert Crovella has mentioned in the comment, the behavior is caused by the fact that the initial location of a managed memory is unspecified.
By default, the devices of compute capability lower than 6.x allocate
managed memory directly on the GPU. However, the devices of compute
capability 6.x and greater do not allocate physical memory when
calling cudaMallocManaged(): in this case physical memory is populated
on first touch and may be resident on the CPU or the GPU.
In my case, the memory was allocated on the GPU. That explains why there were 3MB transferred from the device to the host and the number of transfers themselves. If I remove the initialization loop, the number of HtoD becomes zero (despite the 2 calls to cudaMemPrefetchAsync) and DtoH becomes one.

Is memory operation for L2 cache significantly faster than global memory for NVIDIA GPU?

Modern GPU architectures have both L1 cache and L2 cache. It is well-known that L1 cache is much faster than global memory. However, the speed of L2 cache is less clear in the CUDA documentation. I looked up the CUDA documentation, but can only find that the latency of global memory operation is about 300-500 cycles while L1 cache operation takes only about 30 cycles. Can anyone give the speed of L2 cache? Such information may be very useful, since the programming will not focus on optimizing the use of L2 cache if it is not very fast compared with global memory. If the speed is different for different architectures, I just want to focus on the latest architecture, such as NVIDIA Titan RTX 3090 (Compute Capability 8.6) or NVIDIA Telsa V100 (Compute Capability 7.0).
Thank you!
There are at least 2 figures of merit commonly used when discussing GPU memory: latency and bandwidth. From a latency perspective, this number is not published by NVIDIA (that I know of) and the usual practice is to discover it with careful microbenchmarking.
From a bandwidth perspective, AFAIK this number is also not published by NVIDIA (for L2 cache), but it should be fairly easy to discover it with a fairly simple test case of a copy kernel. We can estimate the bandwidth of global memory simply by ensuring that our copy kernel uses a copy footprint that is much larger than the published L2 cache size (6MB for V100), whereas we can estimate the bandwidth of L2 by keeping our copy footprint smaller than that.
Such a code (IMO) is fairly trivial to write:
$ cat t44.cu
template <typename T>
__global__ void k(volatile T * __restrict__ d1, volatile T * __restrict__ d2, const int loops, const int ds){
for (int i = 0; i < loops; i++)
for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
if (i&1) d1[j] = d2[j];
else d2[j] = d1[j];
}
const int dsize = 1048576*128;
const int iter = 64;
int main(){
int *d;
cudaMalloc(&d, dsize);
// case 1: 32MB copy, should exceed L2 cache on V100
int csize = 1048576*8;
k<<<80*2, 1024>>>(d, d+csize, iter, csize);
// case 2: 2MB copy, should fit in L2 cache on V100
csize = 1048576/2;
k<<<80*2, 1024>>>(d, d+csize, iter, csize);
cudaDeviceSynchronize();
}
$ nvcc -o t44 t44.cu
$ nvprof ./t44
==53310== NVPROF is profiling process 53310, command: ./t44
==53310== Profiling application: ./t44
==53310== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 6.9032ms 2 3.4516ms 123.39us 6.7798ms void k<int>(int volatile *, int volatile *, int, int)
API calls: 89.47% 263.86ms 1 263.86ms 263.86ms 263.86ms cudaMalloc
4.45% 13.111ms 8 1.6388ms 942.75us 2.2322ms cuDeviceTotalMem
3.37% 9.9523ms 808 12.317us 186ns 725.86us cuDeviceGetAttribute
2.34% 6.9006ms 1 6.9006ms 6.9006ms 6.9006ms cudaDeviceSynchronize
0.33% 985.49us 8 123.19us 85.864us 180.73us cuDeviceGetName
0.01% 42.668us 8 5.3330us 1.8710us 22.553us cuDeviceGetPCIBusId
0.01% 34.281us 2 17.140us 6.2880us 27.993us cudaLaunchKernel
0.00% 8.0290us 16 501ns 256ns 1.7980us cuDeviceGet
0.00% 3.4000us 8 425ns 217ns 876ns cuDeviceGetUuid
0.00% 3.3970us 3 1.1320us 652ns 2.0020us cuDeviceGetCount
$
Based on the profiler output, we can estimate global memory bandwidth as:
2*64*32MB/6.78ms = 604GB/s
we can estimate L2 bandwidth as:
2*64*2MB/123us = 2.08TB/s
Both of these are rough measurements (I'm not doing careful benchmarking here), but bandwidthTest on this V100 GPU reports a device memory bandwidth of ~700GB/s, so I believe the 600GB/s number is "in the ballpark". If we use that to judge that the L2 cache measurement is in the ballpark, then we might guess that the L2 cache may be ~3-4x faster than global memory in some circumstances.

How to properly add in global memory in CUDA?

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?
Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

Memory coalescing while implementing FDTD equations

I was trying to implement FDTD equations on the GPU. I initially
had implemented the kernel which used global memory. The memory
coalescing wasn't that great. Hence I implemented another kernel
which used shared memory to load the values. I am working on a grid
of 1024x1024.
The code is below
__global__ void update_Hx(float *Hx, float *Ez, float *coef1, float* coef2){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
__shared__ float Ez_shared[BLOCKSIZE_HX][BLOCKSIZE_HY + 1];
/*int top = offset + x_index_dim;*/
if(threadIdx.y == (blockDim.y - 1)){
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Ez_shared[threadIdx.x][threadIdx.y + 1] = Ez[offset + x_index_dim];
}
else{
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
}
}
The constants BLOCKSIZE_HX = 16 and BLOCKSIZE_HY = 16.
When I run the visual profiler, it still says that the memory is not coalesced.
EDIT:
I am using GT 520 graphic card with cuda compute capability of 2.1.
My Global L2 transactions / Access = 7.5 i.e there is 245 760 L2 transactions for
32768 executions of the line
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Global memory load efficiency is 50%.
Global memory load efficiency = 100 * gld_requested_throughput/ gld_throughput
I am not able to figure out why there are so many memory accesses, though my threads are looking at 16 consecutive values. Can somebody point to me what I am doing wrong?
EDIT: Thanks for all the help.
Your memory access pattern is the problem here. You are getting only 50% efficiency (for both L1 and L2) because you are accessing consecutive regions of 16 floats, that is 64 bytes but the L1 transaction size is 128 bytes. This means that for every 64 bytes requested 128 bytes must be loaded into L1 (and in consequence also into L2).
You also have a problem with shared memory bank conflicts but that is currently not negatively affecting your global memory load efficiency.
You could solve the the load efficiency problem in several ways. The easiest would be to change the x dimension block size to 32. If that is not an option you could change the global memory data layout so that each two consecutive blockIdx.y ([0, 1], [2,3] etc.) values would map to a continuous memory block. If even that is not an option and you have to load the global data only once anyway you could use non-cached global memory loads to bypass the L1 - that would help because L2 uses 32 byte transactions so your 64bytes would be loaded in two L2 transactions without overhead.

cuda threads and blocks

I posted this on the NVIDIA forums, I thought I would get a few more eyes to help.
I'm having trouble trying to expand my code out to perform with multiple cases. I have been developing with the most common case in mind, now its time for testing and i need to ensure that it all works for the different cases. Currently my kernel is executed within a loop (there are reasons why we aren't doing one kernel call to do the whole thing.) to calculate a value across the row of a matrix. The most common case is 512 columns by 512 rows. I need to consider matricies of the size 512 x 512, 1024 x 512, 512 x 1024, and other combinations, but the largest will be a 1024 x 1024 matrix. I have been using a rather simple kernel call:
launchKernel<<<1,512>>>(................)
This kernel works fine for the common 512x512 and 512 x 1024 (column, row respectively) case, but not for the 1024 x 512 case. This case requires 1024 threads to execute. In my naivety i have been trying different versions of the simple kernel call to launch 1024 threads.
launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???
I beleive my problem has something to do with my lack of understanding of the threads and blocks
Here is the output of deviceQuery, as you can see i can have a max of 1024 threads
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Tesla C2050"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818572288 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 40 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro 600"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.28 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 15 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600
I am using only the Tesla C2050 device
Here is a stripped out version of my kernel, so you have an idea of what it is doing.
#define twoPi 6.283185307179586
#define speed_of_light 3.0E8
#define MaxSize 999
__global__ void calcRx4CPP4
(
const float *array1,
const double *array2,
const float scalar1,
const float scalar2,
const float scalar3,
const float scalar4,
const float scalar5,
const float scalar6,
const int scalar7,
const int scalar8,
float *outputArray1,
float *outputArray2)
{
float scalar9;
int idx;
double scalar10;
double scalar11;
float sumReal, sumImag;
float real, imag;
float coeff1, coeff2, coeff3, coeff4;
sumReal = 0.0;
sumImag = 0.0;
// kk loop 1 .. 512 (scalar7)
idx = (blockIdx.x * blockDim.x) + threadIdx.x;
/* Declare the shared memory parameters */
__shared__ float SharedArray1[MaxSize];
__shared__ double SharedArray2[MaxSize];
/* populate the arrays on shared memory */
SharedArray1[idx] = array1[idx]; // first 512 elements
SharedArray2[idx] = array2[idx];
if (idx+blockDim.x < MaxSize){
SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();
// input scalars used here.
scalar10 = ...;
scalar11 = ...;
for (int kk = 0; kk < scalar8; kk++)
{
/* some calculations */
// SharedArray1, SharedArray2 and scalar9 used here
sumReal = ...;
sumImag = ...;
}
/* calculation of the exponential of a complex number */
real = ...;
imag = ...;
coeff1 = (sumReal * real);
coeff2 = (sumReal * imag);
coeff3 = (sumImag * real);
coeff4 = (sumImag * imag);
outputArray1[idx] = (coeff1 - coeff4);
outputArray2[idx] = (coeff2 + coeff3);
}
Because my max threads per block is 1024, I thought I would be able to continue to use the simple kernel launch, am I wrong?
How do I successfully launch each kernel with 1024 threads?
You don't want to vary the number of threads per block. You should get the optimal number of threads per block for your kernel by using the CUDA Occupancy Calculator. After you have that number, you simply launch the number of blocks that are required to get the total number of threads that you need. If the number of threads that you need for a given case is not always a multiple of the threads per block, you add code in the top of your kernel to abort the unneeded threads. (if () return;). Then, you pass in the dimensions of your matrix either with extra parameters to the kernel or by using x and y grid dimensions, depending on which information is required in your kernel (I haven't studied it).
My guess is that the reason you're having trouble with 1024 threads is that, even though your GPU supports that many threads in a block, there is another limiting factor to the number of threads you can have in each block based on resource usage in your kernel. The limiting factor can be shared memory or register usage. The Occupancy Calculator will tell you which, though that information is only important if you want to optimize your kernel.
If you use one block with 1024 threads you will have problems since MaxSize is only 999 resulting in wrong data.
Lets simulate it for last thread #1023
__shared__ float SharedArray1[999];
__shared__ double SharedArray2[999];
/* populate the arrays on shared memory */
SharedArray1[1023] = array1[1023];
SharedArray2[1023] = array2[1023];
if (2047 < MaxSize)
{
SharedArray1[2047] = array1[2047];
SharedArray2[2047] = array2[2047];
}
__syncthreads();
If you now use all those elements in your calculation this should not work.
(Your calculation code is not shown so its an assumption)