Increasing achieved occupancy doesn't enhance computation speed linearly - cuda

I had a CUDA program in which kernel registers were limiting maximum theoretical achieved occupancy to %50. So I decided to use shared memory instead of registers for those variables that were constant between block threads and were almost read-only throughout kernel run. I cannot provide source code here; what I did was conceptually like this:
My initial program:
__global__ void GPU_Kernel (...) {
__shared__ int sharedData[N]; //N:maximum amount that doesn't limit maximum occupancy
int r_1 = A; //except for this first initialization, these registers don't change anymore
int r_2 = B;
...
int r_m = Y;
... //rest of kernel;
}
I changed above program to:
__global__ void GPU_Kernel (...) {
__shared__ int sharedData[N-m];
__shared__ int r_1, r_2, ..., r_m;
if ( threadIdx.x == 0 ) {
r_1 = A;
r_2 = B;
...
r_m = Y; //last of them
}
__syncthreads();
... //rest of kernel
}
Now threads of warps inside a block perform broadcast reads to access newly created shared memory variables. At the same time, threads don't use too much registers to limit achieved occupancy.
The second program has maximum theoretical achieved occupancy equal to %100. In actual runs, the average achieved occupancy for the first programs was ~%48 and for the second one is around ~%80. But the issue is enhancement in net speed up is around %5 to %10, much less than what I was anticipating considering improved gained occupancy. Why isn't this correlation linear?
Considering below image from Nvidia whitepaper, what I've been thinking was that when achieved occupancy is %50, for example, half of SMX (in newer architectures) cores are idle at a time because excessive requested resources by other cores stop them from being active. Is my understanding flawed? Or is it incomplete to explain above phenomenon? Or is it added __syncthreads(); and shared memory accesses cost?

Why isn't this correlation linear?
If you are already memory bandwidth bound or compute bound, and either one of those bounds is near the theoretical performance of the device, improving occupancy may not help much. Improving occupancy usually helps when niether of these are the limiters to performance for your code (i.e. you are not at or near peak memory bandwidth utilization or peak compute). Since you haven't provided any code or any metrics for your program, nobody can tell you why it didn't speed up more. The profiling tools can help you find the limiters to performance.
You might be interested in a couple webinars:
CUDA Optimization: Identifying Performance Limiters by Dr Paulius Micikevicius
CUDA Warps and Occupancy Considerations+ Live with Dr Justin Luitjens, NVIDIA
In particular, review slide 10 from the second webinar.

Related

Why does this kernel not achieve peak IPC on a GK210?

I decided that it would be educational for me to try to write a CUDA kernel that achieves peak IPC, so I came up with this kernel (host code omitted for brevity but is available here)
#define WORK_PER_THREAD 4
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
i *= WORK_PER_THREAD;
if (i < n)
{
#pragma unroll
for(int j=0; j<WORK_PER_THREAD; j++)
y[i+j] = a * x[i+j] + y[i+j];
}
}
I ran this kernel on a GK210, with n=32*1000000 elements, and expected to see an IPC of close to 4, but ended up with a lousy IPC of 0.186
ubuntu#ip-172-31-60-181:~/ipc_example$ nvcc saxpy.cu
ubuntu#ip-172-31-60-181:~/ipc_example$ sudo nvprof --metrics achieved_occupancy --metrics ipc ./a.out
==5828== NVPROF is profiling process 5828, command: ./a.out
==5828== Warning: Auto boost enabled on device 0. Profiling results may be inconsistent.
==5828== Profiling application: ./a.out
==5828== Profiling result:
==5828== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K80 (0)"
Kernel: saxpy_parallel(int, float, float*, float*)
1 achieved_occupancy Achieved Occupancy 0.879410 0.879410 0.879410
1 ipc Executed IPC 0.186352 0.186352 0.186352
I was even more confused when I set WORK_PER_THREAD=16, resulting in less threads launched, but 16, as opposed to 4, independent instructions for each to execute, the IPC dropped to 0.01
My two questions are:
What is the peak IPC I can expect on a GK210? I think it is 8 = 4 warp schedulers * 2 instruction dispatches per cycle, but I want to be sure.
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
What is the peak IPC I can expect on a GK210?
The peak IPC per SM is equal to the number of warp schedulers in an SM times the issue rate of each warp scheduler. This information can be found in the whitepaper for a particular GPU. The GK210 whitepaper is here. From that document (e.g. SM diagram on p8) we see that each SM has 4 warp schedulers capable of dual issue. Therefore the peak theoretically achievable IPC is 8 instructions per clock per SM. (however as a practical matter even for well-crafted codes, you're unlikely to see higher than 6 or 7).
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
Your kernel requires global transactions at nearly every operation. Global loads and even L2 cache loads have latency. When everything you do is dependent on those, there is no way to avoid the latency, so your warps are frequently stalled. The peak observable IPC per SM on a GK210 is somewhere in the vicinity of 6, but you won't get that with continuous load and store operations. Your kernel does 2 loads, and one store (12 bytes total moved), for each multiply/add. You won't be able to improve it. (Your kernel has high occupancy because the SMs are loaded up with warps, but low IPC because those warps are frequently stalled, unable to issue an instruction, waiting for latency of load operations to expire.) You'll need to find other useful work to do.
What might that be? Well if you do a matrix multiply operation, which has considerable data reuse and a relatively low number of bytes per math op, you're likely to see better measurements.
What about your code? Sometimes the work you need to do is like this. We'd call that a memory-bound code. For a kernel like this, the figure of merit to use for judging "goodness" is not IPC but achieved bandwidth. If your kernel requires a particular number of bytes loaded and stored to perform its work, then if we compare the kernel duration to just the memory transactions, we can get a measure of goodness. Stated another way, for a pure memory bound code (i.e. your kernel) we would judge goodness by measuring the total number of bytes loaded and stored (profiler has metrics for this, or for a simple code you can compute it directly by inspection), and divide that by the kernel duration. This gives the achieved bandwidth. Then, we compare that to the achievable bandwidth based on a proxy measurement. A possible proxy measurement tool for this is bandwidthTest CUDA sample code.
As the ratio of these two bandwidths approaches 1.0, your kernel is doing "well", given the memory bound work it is trying to do.

Are needless write operations in multi-thread kernels in CUDA inefficient?

I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
*globalmem = sharedmem;//write shared memory to global memory
}
The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?
At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.
However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).
It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
if (!threadIdx.x) sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
}
Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.

Very large instruction replay overhead for random memory access on Kepler

I am studying the performance of random memory access on a Kepler GPU, K40m. The kernel I use is pretty simple as follows,
__global__ void scatter(int *in1, int *out1, int * loc, const size_t n) {
int globalSize = gridDim.x * blockDim.x;
int globalId = blockDim.x * blockIdx.x + threadIdx.x;
for (unsigned int i = globalId; i < n; i += globalSize) {
int pos = loc[i];
out1[pos] = in1[i];
}
}
That is, I will read an array in1 as well as a location array loc. Then I permute in1 according to loc and output to the array out1. Generally, out1[loc[i]] = in1[i]. Note that the location array is sufficiently shuffled and each element is unique.
And I just use the default nvcc compilation setting with -O3 flag opened. The L1 dcache is disabled. Also I fix my # blocks to be 8192 and block size of 1024.
I use nvprof to profile my program. It is easy to know that most of the instructions in the kernel should be memory access. For an instruction of a warp, since each thread demands a discrete 4 Byte data, the instruction should be replayed multiple times (at most 31 times?) and issue multiple memory transactions to fulfill the need of all the threads within the warp. However, the metric "inst_replay_overhead" seems to be confusing: when # tuples n = 16M, the replay overhead is 13.97, which makes sense to me. But when n = 600M, the replay overhead becomes 34.68. Even for larger data, say 700M and 800M, the replay overhead will reach 85.38 and 126.87.
The meaning of "inst_replay_overhead", according to document, is "Average number of replays for each instruction executed". Is that mean when n = 800M, on average each instruction executed has been replayed 127 times? How comes the replay time much larger than 31 here? Am I misunderstanding something or am I missing other factors that will also contribute greatly to the replay times? Thanks a lot!
You may be misunderstanding the fundamental meaning of an instruction replay.
inst_replay_overhead includes the number of times an instruction was issued, but wasn't able to be completed. This can occur for various reasons, which are explained in this answer. Pertinent excerpt from the answer:
If the SM is not able to complete the issued instruction due to
constant cache miss on immediate constant (constant referenced in the instruction),
address divergence in an indexed constant load,
address divergence in a global/local memory load or store,
bank conflict in a
shared memory load or store,
address conflict in an atomic or
reduction operation,
load or store operation require data to be
written to the load store unit or read from a unit exceeding the
read/write bus width (e.g. 128-bit load or store), or
load cache miss
(replay occurs to fetch data when the data is ready in the cache)
then
the SM scheduler has to issue the instruction multiple times. This is
called an instruction replay.
I'm guessing this happens because of scattered reads in your case. This concept of instruction replay also exists on the CPU side of things. Wikipedia article here.

Bank conflict in CUDA when reading from the same location

I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:
__global__ void my_kernel(const float4 * key_pts)
{
if (key_pts[blockIdx.x] < 0 return;
}
The kernel is configured as follows:
dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<<grid, blocks, 0, stream>>>(key_pts);
My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.
I was thinking I could do something like the following in case we have sub-optimal access:
__global__ void my_kernel(const float4 * key_pts)
{
__shared__ float x;
if (threadIdx.x == 0 && threadIdx.y == 0)
x = key_pts[blockIdx.x];
__syncthreads();
if (x < 0) return;
}
Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.
bank conflicts apply to shared memory, not global memory.
Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:
Note that threads can access any words in any order, including the same words.
Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.
I wouldn't expect much performance difference between the two cases.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.