How to estimate relative performance of CUDA gpus? - cuda

How can I estimate the cuda performance of cards that I don't own, ie. new cards?
For instance I found an incomplete Cuda example and the author wrote, that it takes him 0,7 s on his GF 8600 GT. But on my Quadro it takes 1,7s.
My question is: Is the code which I used to fill the gaps faulty or is the GF 8600 really twice as fast?
The kernel is memory bound, but my card has an higher memory bandwidth. I don't know what conclusions to draw from this.
Name Quadro FX 580 GeForce 8600 GT
CUDA Cores 32 32
Core clock (MHz) 450 540
Memory clock (MHz) 400 700
Memory BW (GB/s) 25.6 22.4
Shader Clock (MHz) ???? 1180

Just want to provide you with some pointers that may be possible sources of error. Firstly, use cudaEvents to time your code, not cuda profiler as cudaEvents is more accurate. Secondly, please check what the author is measuring; is he only talking about the computation time, or is he also considering the time to transfer data to and from the GPU. Are you measuring the same time?
Secondly, the cuda architecture is changing quite fast. For example, for cards with cc 1.x, it is suggested that we should use shared memory to get better performance; however, for cards with cc 2.x, there is a L1 cache with each multiprocessor that makes global memory accesses quite fast. So, you may aslo want to compare the architecture of the two cards and their compute capabilities.

Related

Maximum number of GPU Threads on Hardware and used memory

I have already read several threads about the capacity about the GPU and understood that the concept of blocks and threads has to be seperated from the physical Hardware. Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. However, as the number of streaming processors is finite, there has to be a physical limit. After I wrote a GPU program, I would be interested in evaluating the used capacity of my GPU. To do this, I have to know how many threads I could start theoretically at one time on hardware. My graphics card is a Nvidia Geforce 1080Ti, so I have 3584 CUDA-Cores. As far as I understood, each Cuda core executes one Thread, so in theory, I would be able to execute 3584 threads per cycle. Is this correct?
Another question is the one about memory. I installed and used nvprof to get some insight into the used kernels. What is displayed there is for example the number of used registers. I transfer my arrays to the GPU using cuda.to_device (in Python Numba) and as far as I understood, the arrays then reside in global memory. How do I find out how big this global memory is? Is it equivalent to the DRAM size?
Thanks in advance
I'll focus on the first part of the question. The second should really be its own separate question.
CUDA cores do not map 1-to-1 to threads. They are more like ports in a multiscalar CPU. Multiple threads can issue instructions to the same CUDA core in different clock cycles. Sort of like hyperthreading in a CPU.
You can see the relation and numbers in the documentation in chapter K Compute Capabilities compared to the table Table 3. Throughput of Native Arithmetic Instructions. Depending on your architecture, you may have for example for your card (compute capability 6.1) 2048 threads per SM and 128 32 bit floating point operations per clock cycle. That means you have 128 CUDA cores shared by a maximum of 2048 threads.
Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). TechPowerup's excellent GPU database documents 28 SMs for your card which should give you 28 * 2048 threads unless I did something wrong.

What does it mean by say GPU under ultilization due to low occupancy?

I am using NUMBA and cupy to perform GPU coding. Now I have switched my code from a V100 NVIDIA card to A100, but then, I got the following warnings:
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
Does anyone know what the two warnings really suggests? How should I improve my code then?
NumbaPerformanceWarning: Grid size (27) < 2 * SM count (216) will likely result in GPU under utilization due to low occupancy.
A GPU is subdivided into SMs. Each SM can hold a complement of threadblocks (which is like saying it can hold a complement of threads). In order to "fully utilize" the GPU, you would want each SM to be "full", which roughly means each SM has enough threadblocks to fill its complement of threads. An A100 GPU has 108 SMs. If your kernel has less than 108 threadblocks in the kernel launch (i.e. the grid), then your kernel will not be able to fully utilize the GPU. Some SMs will be empty. A threadblock cannot be resident on 2 or more SMs at the same time. Even 108 (one per SM) may not be enough. A A100 SM can hold 2048 threads, which is at least two threadblocks of 1024 threads each. Anything less than 2*108 threadblocks in your kernel launch may not fully utilize the GPU. When you don't fully utilize the GPU, your performance may not be as good as possible.
The solution is to expose enough parallelism (enough threads) in your kernel launch to fully "occupy" or "utilize" the GPU. 216 threadblocks of 1024 threads each is sufficient for an A100. Anything less may not be.
For additional understanding here, I recommend the first 4 sections of this course.
NumbaPerformanceWarning:Host array used in CUDA kernel will incur copy overhead to/from device.
One of the cool things about a numba kernel launch is that I can pass to it a host data array:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel[blocks, threads](a)
and numba will "do the right thing". In the above example it will:
create a device array that is for storage of a in device memory, let's call this d_a
copy the data from a to d_a (Host->Device)
launch your kernel, where the kernel is actually using d_a
when the kernel is finished, copy the contents of d_a back to a (Device->Host)
That's all very convenient. But what if I were doing something like this:
a = numpy.ones(32, dtype=numpy.int64)
my_kernel1[blocks, threads](a)
my_kernel2[blocks, threads](a)
What numba will do is it will perform steps 1-4 above for the launch of my_kernel1 and then perform steps 1-4 again for the launch of my_kernel2. In most cases this is probably not what you want as a numba cuda programmer.
The solution in this case is to "take control" of data movement:
a = numpy.ones(32, dtype=numpy.int64)
d_a = numba.cuda.to_device(a)
my_kernel1[blocks, threads](d_a)
my_kernel2[blocks, threads](d_a)
a = d_a.to_host()
This eliminates unnecessary copying and will generally make your program run faster, in many cases. (For trivial examples involving a single kernel launch, there probably will be no difference.)
For additional understanding, probably any online tutorial such as this one, or just the numba cuda docs, will be useful.

Strange results for CUDA SDK Bandwidth Test

I have a CUDA application that is data movement bound (i.e. large memcopies from host to device with relatively little computations done in the kernel). On older GPUs I was compute-bound (e.g. QUADRO FX 5800), but with Fermi and Kepler architectures, that is no longer the case (for my optimized code).
I just moved my code to a GTX 680 and was impressed with the increased compute performance, but perplexed that the bandwidth between the host and GPU seems to have dropped (relative to my Fermi M20270).
In short when I run the canned SDK bandwidth test I get ~5000 MB/sec on the GTX 680 versus ~5700 MB/sec on the M2070. I recognize that the GTX is "only a gamer card", but the specs for the GTX 680 seem more impressive than for the M2070, WITH THE EXCEPTION OF THE BUS WIDTH.
From wikipedia:
M2070: 102.4 GB/sec, GDDR3, 512 bit bus width
GTX 680: 192 GB/sec, GDDR5, 256 bit bus width
I'm running the canned test with "--wc --memory=pinned" to use write-combined memory.
The improved results I get with this test are mirrored by the results I am getting with my optimized CUDA code.
Unfortunately, I can't run the test on the same machine (and just switch video cards), but I have tried the GTX 680 on older and newer machines and get the same "degraded" results (relative to what I get on the M2070).
Can anyone confirm that they are able to achieve higher throughput memcopies with the M2070 Quadro than the GTX 680? Doesn't the bandwidth spec take into consideration the bus width? The other possibility is that I'm not doing the memcopies correctly/optimally on the GTX 680, but in that case, is there a patch for the bandwidth test so that it will also show that I'm transfering data faster to the 680 than to the M2070?
Thanks.
As Robert Crovella has already commented, your bottleneck is the PCIe bandwidth, not the GPU memory bandwidth.
Your GTX 680 can potentially outperform the M2070 by a factor of two here as it supports PCIe 3.0 which doubles the bandwidth over the PCIe 2.0 interface of the M2070. However you need a mainboard supporting PCIe 3.0 for that.
The bus width of the GPU memory is not a concern in itself, even for programs that are GPU memory bandwidth bound. Nvidia managed to substantially increase the frequencies used on the memory bus of the GTX 680, which more than compensates for the reduced bus width relative to the M2070.

Scalability Analysis on GPU

I am trying to do a scalability analysis using my Quadro FX 5800 which have 240 cores to the run time scales with number of cores which is a classic study for parallel computing.
I was wondering how does the definition of the core fit it in this ?
And how can I use it to run on different core settings say ( 8,16,32,64,128,240 cores ) ?
My test case is the simple matrix multiplication.
Scalability on the GPU should not be measured in terms of CUDA cores but in terms of SM utilization. IPC is probably the best single measure of SM utilization. When developing an algorithm you want to partition your work such that you can distribute sufficient work to all SMs such that on every cycle the warp scheduler has at least one warp eligible to issue and instruction. In general this means you have to have sufficient warps on each SM to hide instruction and memory latency and to provide a variety of instruction types to fill the execution pipeline.
If you want to test scaling across CUDA cores (meaningless) then you can launch thread blocks containing 1, 2, 3, ... 32 threads per block. Launching non-multiple of WARP_SIZE (=32) threads per thread block will result in only using a subset of the cores. These are basically wasted execution slots.
If you want to test scaling in terms of SMs you can scale you algorithm from 1 thread block to 1000s of thread blocks. In order to understand scaling you can artificially limit the thread blocks per SM by configuring the shared memory per thread block when you launch.
Re-writing matrix multiply to optimally scale in each of these directions is likely to be frustrating. Before you undertake that project I would recommend understanding how distributing a simply parallel computation such as summing from 0-100000 or calculating a factorial scales across multiple thread blocks. These algorithms are only a few lines of code and the aforementioned scaling can be tried by varying the launch configuration (GridDim, BlockDim, SharedMemoryPerBlock) and kernel 1-2 parameters You can time the different launches using the CUDA profiler, Visual Profiler, Parallel Nsight, or CUevents.
Assuming you are using CUDA or OpenCL as the programming model: One simple way to restrict the utilization to M number of multiprocessors (SM) is to launch your kernel with an execution configuration of M blocks (of threads). If each SM is composed of N cores, in this way you can test scalability across N, 2N, 4N, ... cores.
For example, if the GPU has 4 SMs, each SM having 32 cores. By running kernels of 1, 2 and 4 blocks, your kernel will utilize 32, 64 and 128 cores of the GPU.

GTX 295 vs other nvidia cards for cuda development

what is the best nvidia Video Card for cuda development. a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
what is the best nvidia Video Card for cuda development.
Whatever fits in your budget and suits your needs. I know this is a bit vague, but after all it really is as simple as that ;)
a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
Sure, it is. The only drawback is that the 2 GPUs on the GTX 295 share a single PCI. Whether this is relevant for you or not depends if the application needs intensive communication with the host or not.
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
From the point of view of raw peak performance a GTX 295 (which is almost 2x GTX 280, not considering the shared PCI) is better than a 480. However the GF10x series architecture improved on many points compared to the GT200, for details see the "Fermi whitepaper" and the "Fermi Tuning Guide".
If you're planning to use double precision, the GF10x series has much improved double precision support, but it's good to know that this is capped on GeForce cards to 1/8-th of the single precision performance (normally it's about half)
Therefor, I would suggest that unless you have a strong reason to get lots of GFlops (Folding#Home?) in the form of soon to be outdated hardware, get a GTX 480 or a 470 if you want to save ~25%.
Direct answer: I would go with one or maybe two GTX 480's. But I think my reasoning is a bit different from #bobince or #pszilard.
Backgroud: I just made the same decision you're facing, but our situations may be vastly different.
I'm a statistics graduate student in a department with minimal funding for gpu computing resources, the campus does have one fermi box hooked up to two nodes that I have access to. But these were in linux -- which I love -- but I really want to use nSight to benchmark and tune my code, so I need windows -- so I decided to purchase a development box which I dual boot, Ubuntu x64 for production runs and Win 7 with VS 2010 (a battle which I'm presently fighting) and nSight 1.5 for development. That said, back to the reason why I bought two GTX 480's (EVGA is awesome!!) and not two GTX 285's or 295's.
I've spent the past two years developing a couple of CUDA kernels. The trickiest part of the development, for me, is the memory management. I spent the better part of three months trying to squeeze a Cholesky decomposition & back substitution into 16 single-precision registers -- the max you can use before either the GTX 285 or 295 incur a 50% performance penalty (literally 3 weeks going from 17 to 16 registers). For me, the fact that all Fermi architectures have double the registers means that those three months would've gained me about 10% improvement on a GTX 480 instead of 50% on GTX 285 and hence, probably not worth my time -- in truth a bit more subtle than that, but you get the drift.
If you're fairly new to CUDA -- which you probably are since you're asking -- I would say 32 registers is HUGE. Second, I think the L1 cache of the Fermi architecture can directly translate to faster global memory accesses -- of course it does, but I haven't measured the impact directly yet. If you don't need the global memory as much, you can trade the bigger L1 cache for triple the shared memory -- which was also a tight squeeze for me as the matrix sizes increased.
Then I would agree with #pszilard that if you need double precision, Fermi is definitely the way to go -- though I'd still write your code in single precision first, tune it, and then migrate to double.
I don't think that concurrent kernel execution will matter for you -- it's really cool, the delays to kernel completion can be orders of magnitude less -- but you're probably going to focus on one kernel first, not parallel kernels. If you want to do streaming or parallel kernels, then you need Fermi -- the 285 / 295's simply can't do it.
And lastly, the drawback of going with the 295's is that you have to write two layers of parallelism: (1) one to distribute blocks (or kernels?) across the cards and (2) the gpu kernel itself. If you're just starting out, it's much easier to keep the parallelism in one place (on a single card) as opposed to fighting two battles at once.
Ps. If you haven't written your kernels yet, you might consider getting only one card and waiting six months to see if the landscape changes again -- though I have no idea when the next cards are to be released.
PPs. I absolutely loved running my cuda kernel on the GTX 480 which I had debugged / designed on a Tesla C1070 and instantly realizing a 2x speed improvement. Money well spent.
is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
Yes. Or quad, if you're totally insane.
is it better to get two 480 cards rather than two 295?
Arguable. 295 as a dual-gpu has slightly more raw oomph, but 480 as a 40nm-process card without the dual-gpu overhead may use its resources better. Benchmarks vary. Of course the Fermi 4xx range has more modern feature support (3D, DirectX, OpenCL etc).
But dual-295 is going to have seriously huge PSU and cooling requirements. And dual-480 runs almost as hot. Not to mention the expense. What are you working on that you think you're going to need this? Have you considered the more mainstream parts, eg 460, which is generally considered to offer a better price/performance than the troubled 470–480 (GF100) part?