random access gpgpu performance drop? - cuda

I've heard there is a drop in performance when performing computations on arrays with random access on a gpu.
My question is how severe is this performance drop?
Searching around some comments seemed to imply code ran faster on cpu. But seeing the vast difference in int and flop between gpus and cpus it seems difficult to believe performance would drop so bad.

I think that it is related to cache loss. GPU also has L1 L2 caches and if you hit random memory space, then you will have more chance to lose cache. And also because GPU has special memory access pattern that called memory coalescing. It is accessing memory with wide range. It is why GPU is so fast when they run SIMD friendly code. But if you access random memory space, it will break memory coalescing. I think that it would be good to read cuda document to see how GPU works.

Related

"Global Load Efficiency" over 100%

I have a CUDA program in which threads of a block read elements of a long array in several iterations and memory accesses are almost fully coalesced. When I profile, Global Load Efficiency is over 100% (between 119% and 187% depending on the input). Description for Global Load Efficiency is "Ratio of global memory load throughput to required global memory load throughput." Does it mean that I'm hitting L2 cache a lot and my memory accesses are benefiting from it?
My GPU is GeForce GTX 780 (Kepler architecture).
I asked this question at NVIDIA forum here. I quote the answer I got:
"Global Load Efficiency and Global Store Efficiency describe how well the coalescing of DRAM-accesses and (L2?)Cache-accesses works. If they're 100 percent then you've got perfect coalescing. Since efficiencies above 100 percent don't make any sense (you cannot be better than optimal) this has to be an error.
This error is caused by the Visual Profiler, which counts hardware events to calculate some abstract metrics. But the GPU doesn't have the "correct" events to exactly calculate all those metrics, thus Visual Profiler has to estimate those metrics by using some complex formula and "wrong" events. There are some metrics which are just rough estimations and Global Load Efficiency and Global Store Efficiency are two of them. Thus if such an efficiency is bigger than 100 percent it is an estimation error. As far as I observed the Global Load Efficiency and Global Store Efficiency both increased above 100 percent in some of my register spilling kernels. That's why i assume that the Visual-Profiler uses some events, which also may be caused by local memory accesses, to calculate those two efficiencies. Furthermore GPUs just uses 32 Bit Counters. Thus long running kernel tend to overflow those counters, which also causes the Visual Profiler to display wrong metrics."

Which is faster in CUDA, global memory or host memory?

I read from CUDA by Example, chapter 9.4, that when using atomic operations on GPU global memory improperly, performance of the program may be worse than that when executed purely on CPU, because of the memory access contention.
In the worse case, the program executed on GPU is highly serialized and no threads execute in parallel, which is just the way a single-threaded program run on the CPU. So the key problem is how fast the program accesses the memory.
Considering the example in the book I mentioned, it seems that CPU accesses host memory faster than GPU accesses global memory on device.
Is that so? Or are there any other factors that influence the performance of the program under the circumstance I just described?
i think you're misreading things slightly. yes, it's saying that single-threaded code on the GPU is typically slower than on the CPU. but that's not because of raw memory bandwidth - it's because a CPU is much more powerful than a GPU when running a single thread. for example, a CPU has pipelining and sophisticated branch prediction to pre-load data from memory, while a GPU is designed to switch contexts to another thread when waiting for data. the CPU is tuned for the single threaded case while the GPU is tuned for many threads.
if you want to know which memory is fastest, look at the technical specs for your card and mobo, but that's not really what the book is talking about.

What happened when all thread of a warp read the same global memory?

I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are there? Is there any serialization. The GPU is Fermi card, the programming environment is CUDA 4.0.
Besides, can anybody explain the concept of bus utilization? What is the difference between caching loading and non-caching loading? I saw the concept in http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
All threads in warp accessing same address in global memory
I could answer your questions off the top of my head for AMD GPUs. For Nvidia, googling found the answers quickly enough.
I want to know what happened when all threads of a warp read the same 32-bit address of global memory. How many memory requests are
there? Is there any serialization. The GPU is Fermi card, the
programming environment is CUDA 4.0.
http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_Best_Practises_For_OpenCL_Programming.pdf, dated 2009, says:
Coalescing:
Global memory latency: 400-600 cycles. The single most important
performance consideration!
Global memory access by threads of a half warp can be coalesced to
one transaction for word of size 8-bit, 16-bit, 32-bit, 64-bit or two
transactions for 128-bit.
Global memory can be viewed as composing aligned segments of 16 and
32 words.
Coalescing in Compute Capability 1.0 and 1.1:
K-th thread in a half warp must access the k-th word in a segment;
however, not all threads need to participate
Coalescing in Compute Capability
1.2 and 1.3:
Coalescing for any pattern of access that fits into a segment size
So, it sounds like having all threads of a warp access the same 32-bit address of global memory will work as well as could be hoped for, in anything >= Compute Capability 1.2. But not for 1.0 and 1.1.
Your card is okay.
I must admit that I have not tested this for Nvidia. I have tested it for AMD.
Difference between cache and uncached load
To start off, look at slide 4 of the presentation you refer to, http://theinf2.informatik.uni-jena.de/theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_2011.pdf.
I.e. the slide titled "Differences between CPU & GPUs" - that says that CPUs have huge caches, and GPUs don't.
A few years ago such a slide might have said that GPUs don't have any caches at all. However, GPUs have begin to add more and more cache, and/or switch more and more local to cache.
I am not sure if you understand what a "cache" is in computer architecture. It's a big topic, so I will only provide a short answer.
Basically, a cache is like local memory. Both cache and local memory - are closer to the processor or GPU than DRAM, main memory - whether that be the GPU's private DRAM, or the CPU's system memory. DRAM main memory is called by Nvidia Global Memory. Slide 9 illustrates this.
Both cache and local memory are closer to the GPU than DRAM global memory: on slide 9 they are drawn as being inside the same chip as the GPU, whereas the DRAMs are separate chips. This can have several good effects, on latency, throughput, power - and, yes, bus utilization (related to bandwidth).
Latency: global memory is 400-800 cycles away. This means that if you only had one warp in your application, it would only execute one memory operation every 400-800 cycles. This means that, in order not to slow down, you need many threads/warps producing memory requests that can be run in parallel, i.e. that have high MLP (Memory Level Parallelism). Fortunately graphics usually does this. The caches are closer, so will have lower latency. Your slides do not say what it is, but other places say 50-200 cycles, 4-8X faster than global memory. This translates to needing fewer threads&warps to avoid slowing down.
Throughput/Bandwidth: there is typically more bandwidth to local memory and/or cache than to DRAM global memory. Your slides say 1+ TB/s versus 177 GB/s - i.e. cache and local memory is more than 5X faster. This higher bandwidth could translate to significantly higher framerates.
Power: you save a lot of power going to cache or local memory rather than to DRAM global memory. This may not matter to a desktop gaming PC, but it matters to a laptop, or to a tablet PC. Actually, it matters even to a desktop gaming PC, because less opower means it can be (over)clocked faster.
OK, so local and cache memory are similar in the above? What's the difference?
Basically, it is easier to program cache than it is local memory. Very good, expert, nionja programmers are needed to manage local memory properly, copying stuff from global memory as needed, and flushing it out. Whereas cache memory is much easier to manage, because you just doing a cached load, and the memory is put in cache automatically, where it will be accessed faster the next time around.
But caches have a downside as well.
First, they actually burn a bit more power than local memory - or they would, if there were actually separate local and global memories. However, in Fermi, the local memory may be configured as cache, and vice versa. (For years GPU folks said "we don't need no stinking cache - cache tags and other overhead are jujst wasteful.)
More importantly, caches tend to operate on cache lines - but not all programs do. This leads to the bus utilization issue you mention. If a warp accesses all words in a cache line, great. But if a warp only accesses 1 word in a cache line, i.e. 1 4 byte word and then skips 124 bytes, then 128 bytes of data are transferred over the bus, but only 4 bytes are used. I.e. >96% of the bus bandwidth is being wasted. This is the low bus utilization.
Whereas the very next slide shows that a non-caching load, suich as one might use to load data into local memory, would transfer only 32 bytes, so "only" 28 bytes out of 32 are wasted. In other words, the non-cache loads could be 4X more efficient, 4X faster, than the cached loads.
Then why not use non-cache loads everywhere? Because they are harder to program - it requirwes expert ninja programmers. And caches work pretty well much of the time.
So, instead of paying your expert ninja programmers to spend a lot of time optimizing all of the code to use non-cache loads and hand-managed local memory - instead you do the easy stuff using cached loads, and you let the highly paid expert ninja programmers concentrate on the stuff that the cache does not work well for.
Besides: nobody likes admitting it, but oftentimes the cache does better than the expert ninja programmers.
Hope this helps.
Throughput, power, and bus utilization: in addition to reducing

Can you predict the runtime of a CUDA kernel?

To what degree can one predict / calculate the performance of a CUDA kernel?
Having worked a bit with CUDA, this seems non trivial.
But a colleage of mine, who is not working on CUDA, told me, that it cant be hard if you have the memory bandwidth, the number of processors and their speed?
What he said seems not to be consistent with what I read. This is what I could imagine could work. What do you think?
Memory processed
------------------ = runtime for memory bound kernels ?
Memory bandwidth
or
Flops
------------ = runtime for computation bound kernels?
Max GFlops
Such calculation will barely give good prediction. There are many factors that hurt the performance. And those factors interact with each other in a extremely complicated way. So your calculation will give the upper bound of the performance, which is far away from the actual performance (in most cases).
For example, for memory bound kernels, those with a lot cache misses will be different with those with hits. Or those with divergences, those with barriers...
I suggest you to read this paper, which might give you more ideas on the problem: "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness".
Hope it helps.
I think you can predict a best-case with a bit of work. Like you said, with instruction counts, memory bandwidth, input size, etc.
However, predicting the actual or worst-case is much trickier.
First off, there are factors like memory access patterns. Eg: with older CUDA capable cards, you had to pay attention to distribute your global memory accesses so that they wouldn't all contend for a single memory bank. (The newer CUDA cards use a hash between logical and physical addresses to resolve this).
Secondly, there are non-deterministic factors like: how busy is the PCI bus? How busy is the host kernel? Etc.
I suspect the easiest way to get close to actual run-times is basically to run the kernel on subsets of the input and see how long it actually takes.

CUDA: Can i find out if i have global memory coalescence?

I'm using a GeForce GTX 580 (compute capability 2.0).
In my program i'm suspecting that the bottleneck is access to global memory in the kernel. I suspect this because all the calculations involve numbers gotten by indexing an array stored in global memory, and because switching from double precision to single precision only improves the performance by like 10%. (afaik it should be twice as fast with a fermi device if the floating point operations are the bottleneck (?))
So to improve this bottleneck i thought about memory coalescence. The problem here is that i don't know if i achieved it or not. Either i already have it, and this is as good as it gets (25 times faster than the sequential version on an intel i7), or i might get it to run much faster by somehow rewriting to get coalescence.
But is there a way to know? Can i somehow "turn off" coalescence to find out, or find out in another way?
The CUDA Visual profiler will show you the load/store efficiency of each kernel in the summary table; Grizzly gave a good answer about how this has changed in the newer cards here: Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)
No, memory coalescence is not something you turn on or off, it is something you achieve by using the correct memory access patterns and alignment. I am not sure as I have never used (not working on Windows) but I think nVidia's Parallel Nsight can tell you if your memory accesses are coalesced or not.