I'm using a GeForce GTX 580 (compute capability 2.0).
In my program i'm suspecting that the bottleneck is access to global memory in the kernel. I suspect this because all the calculations involve numbers gotten by indexing an array stored in global memory, and because switching from double precision to single precision only improves the performance by like 10%. (afaik it should be twice as fast with a fermi device if the floating point operations are the bottleneck (?))
So to improve this bottleneck i thought about memory coalescence. The problem here is that i don't know if i achieved it or not. Either i already have it, and this is as good as it gets (25 times faster than the sequential version on an intel i7), or i might get it to run much faster by somehow rewriting to get coalescence.
But is there a way to know? Can i somehow "turn off" coalescence to find out, or find out in another way?
The CUDA Visual profiler will show you the load/store efficiency of each kernel in the summary table; Grizzly gave a good answer about how this has changed in the newer cards here: Compute Prof's fields for incoherent and coherent gst/gld? (CUDA/OpenCL)
No, memory coalescence is not something you turn on or off, it is something you achieve by using the correct memory access patterns and alignment. I am not sure as I have never used (not working on Windows) but I think nVidia's Parallel Nsight can tell you if your memory accesses are coalesced or not.
Related
I want to use cudaMallocManaged, but is it possible force it allocate memory on a specific gpu id (e.g. via cudaSetDevice) on a multiple GPU system?
The reason is that I need allocate several arrays on the GPU, and I know which set of these arrays need to work together, so I want to manually make sure they are on the same GPU.
I searched CUDA documents, but didn't find any info related to this. Can someone help? Thanks!
No you can't do this directly via cudaMallocManaged. The idea behind managed memory is that the allocation migrates to whatever processor it is needed on.
If you want to manually make sure a managed allocation is "present" on (migrated to) a particular GPU, you would typically use cudaMemPrefetchAsync. Some examples are here and here. This is generally recommended for good performance if you know which GPU the data will be needed on, rather than using "on-demand" migration.
Some blogs on managed memory/unified memory usage are here and here, and some recorded training is available here, session 6.
From N.2.1.1. Explicit Allocation Using cudaMallocManaged() (emphasis mine):
By default, the devices of compute capability lower than 6.x allocate managed memory directly on the GPU. However, the devices of compute capability 6.x and greater do not allocate physical memory when calling cudaMallocManaged(): in this case physical memory is populated on first touch and may be resident on the CPU or the GPU.
So for any recent architecture it works like NUMA nodes on the CPU: Allocation says nothing about where the memory will be physically allocated. This instead is decided on "first touch", i.e. initialization. So as long as the first write to these locations comes from the GPU where you want it to be resident, you are fine.
Therefore I also don't think a feature request will find support. In this memory model allocation and placement just are completely independent operations.
In addition to explicit prefetching as Robert Crovella described it, you can give more information about which devices will access which memory locations in which way (reading/writing) by using cudaMemAdvise (See N.3.2. Data Usage Hints).
The idea behind all this is that you can start off by just using cudaMallocManaged and not caring about placement, etc. during fast prototyping. Later you profile your code and then optimize the parts that are slow using hints and prefetching to get (almost) the same performance as with explicit memory management and copies. The final code may not be that much easier to read / less complex than with explicit management (e.g. cudaMemcpy gets replaced with cudaMemPrefetchAsync), but the big difference is that you pay for certain mistakes with worse performance instead of a buggy application with e.g. corrupted data that might be overlooked.
In Multi-GPU applications this idea of not caring about placement at the start is probably not applicable, but NVIDIA seems to want cudaMallocManaged to be as uncomplicated as possible for this type of workflow.
I am using CUDA 6.0 and the OpenCL implementation that comes bundled with the CUDA SDK. I have two identical kernels for each platform (they differ in the platform specific keywords). They only read and write global memory, each thread different location. The launch configuration for CUDA is 200 blocks of 250 threads (1D), which corresponds directly to the configuration for OpenCL - 50,000 global work size and 250 local work size.
The OpenCL code runs faster. Is this possible or am I timing it wrong? My understanding is that the NVIDIA's OpenCL implementation is based on the one for CUDA. I get around 15% better performance with OpenCL.
It would be great if you could suggest why I might be seeing this and perhaps some differences between CUDA and OpenCL as implemented by NVIDIA?
Kernels executing on a modern GPU are almost never compute bound, and are almost always memory bandwidth bound. (Because there are so many compute cores running compared to the available path to memory.)
This means that the performance of a given kernel usually depends largely on the memory access patterns exhibited by the given algorithm.
In practice this makes it very difficult to predict (or even understand) what performance to expect ahead of time.
The differences you observed are likely due to subtle differences in the memory access patterns between the two kernels that result from different optimizations made by the OpenCL vs CUDA toolchain.
To learn how to optimize your GPU kernels it pays to learn the details of the memory caching hardware available to you, and how to use it to best advantage. (e.g., making strategic use of "local" memory caches vs always going directly to "global" memory in OpenCL.)
I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.
I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).
My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.
Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?
Thanks for the post. Glad to hear initial results were giving some speedup. I work on ArrayFire and can chime in here on your questions.
First and foremost, code is really required here for anyone to help with specificity. Can you share the code you wrote?
Second, you should think about CUDA and ArrayFire in the following way: CUDA is a way to program the GPU that provides you with the ability to write any GPU code you want. But there is a huge difference between naive CUDA code (often slower than the CPU) and expert, time-staking, hand-optimized CUDA code. ArrayFire (and some other GPU libraries like CUBLAS) have many man-years of optimizations poured into them, and are typically going to give better results than most normal people will have time to achieve on their own. However, there is also variability in how well someone uses ArrayFire (or other libraries). There are variables that can and should be tweaked in the usage of ArrayFire library calls to get the best performance. If you post your code, we can help share some of those here.
Third, ArrayFire uses CUBLAS in the functions that rely on BLAS, so you're not likely to see much difference using CUBLAS directly.
Fourth, yes, ArrayFire uses all the optimizations that are available in the NVIDIA CUDA Programming Guide for (e.g. faster data-transfer and reducing memory bank conflicts like you mention). That's where the bulk of ArrayFire development is focused, on optimizing those sorts of things.
Finally, the data discrepancies you noticed are likely due to that nature of CPU vs GPU computing. Since they are different devices, you will often see slightly different results. It's not that the CPU gives better results than the GPU, but rather that they are both working with finite amounts of precision in slightly different ways. If you're using single-precision instead of double, you might consider that. Posting code will let us help on that too.
Happy to expand my answer once code is posted.
I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.
To what degree can one predict / calculate the performance of a CUDA kernel?
Having worked a bit with CUDA, this seems non trivial.
But a colleage of mine, who is not working on CUDA, told me, that it cant be hard if you have the memory bandwidth, the number of processors and their speed?
What he said seems not to be consistent with what I read. This is what I could imagine could work. What do you think?
Memory processed
------------------ = runtime for memory bound kernels ?
Memory bandwidth
or
Flops
------------ = runtime for computation bound kernels?
Max GFlops
Such calculation will barely give good prediction. There are many factors that hurt the performance. And those factors interact with each other in a extremely complicated way. So your calculation will give the upper bound of the performance, which is far away from the actual performance (in most cases).
For example, for memory bound kernels, those with a lot cache misses will be different with those with hits. Or those with divergences, those with barriers...
I suggest you to read this paper, which might give you more ideas on the problem: "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness".
Hope it helps.
I think you can predict a best-case with a bit of work. Like you said, with instruction counts, memory bandwidth, input size, etc.
However, predicting the actual or worst-case is much trickier.
First off, there are factors like memory access patterns. Eg: with older CUDA capable cards, you had to pay attention to distribute your global memory accesses so that they wouldn't all contend for a single memory bank. (The newer CUDA cards use a hash between logical and physical addresses to resolve this).
Secondly, there are non-deterministic factors like: how busy is the PCI bus? How busy is the host kernel? Etc.
I suspect the easiest way to get close to actual run-times is basically to run the kernel on subsets of the input and see how long it actually takes.