Estimating increase in speed when changing NVIDIA GPU model - cuda

I am currently developing a CUDA application that will most certainly be deployed on a GPU much better than mine. Given another GPU model, how can I estimate how much faster my algorithm will run on it?

You're going to have a difficult time, for a number of reasons:
Clock rate and memory speed only have a weak relationship to code speed, because there is a lot more going on under the hood (e.g., thread context switching) that gets improved/changed for almost all new hardware.
Caches have been added to new hardware (e.g., Fermi) and unless you model cache hit/miss rates, you'll have a tough time predicting how this will affect the speed.
Floating point performance in general is very dependent on model (e.g.: Tesla C2050 has better performance than the "top of the line" GTX-480).
Register usage per device can change for different devices, and this can also affect performance; occupancy will be affected in many cases.
Performance can be improved by targeting specific hardware, so even if your algorithm is perfect for your GPU, it could be better if you optimize it for the new hardware.
Now, that said, you can probably make some predictions if you run your app through one of the profilers (such as the NVIDIA Compute Profiler), and you look at your occupancy and your SM utilization. If your GPU has 2 SMs and the one you will eventually run on has 16 SMs, then you will almost certainly see an improvement, but not specifically because of that.
So, unfortunately, it isn't easy to make the type of predictions you want. If you're writing something open source, you could post the code and ask others to test it with newer hardware, but that isn't always an option.

This can be very hard to predict for certain hardware changes and trivial for others. Highlight the differences between the two cards you're considering.
For example, the change could be as trivial as -- if I had purchased one of those EVGA water-cooled behemoths, how much better would it perform over a standard GTX 580? This is just an exercise in computing the differences in the limiting clock speed (memory or gpu clock). I've also encountered this question when wondering if I should overclock my card.
If you're going to a similar architecture, GTX 580 to Tesla C2070, you can make a similar case of differences in clock speeds, but you have to be careful of the single/double precision issue.
If you're doing something much more drastic, say going from a mobile card -- GTX 240M -- to a top of the line card -- Tesla C2070 -- then you may not get any performance improvement at all.
Note: Chris is very correct in his answer, but I wanted to stress this caution because I envision this common work path:
One says to the boss:
So I've heard about this CUDA thing... I think it could make function X much more efficient.
Boss says you can have 0.05% of work time to test out CUDA -- hey we already have this mobile card, use that.
One year later... So CUDA could get us a three fold speedup. Could I buy a better card to test it out? (A GTX 580 only costs $400 -- less than that intern fiasco...)
You spend the $$, buy the card, and your CUDA code runs slower.
Your boss is now upset. You've wasted time and money.
So what happened? Developing on an old card, think 8800, 9800, or even the mobile GTX 2XX with like 30 cores, leads one to optimize and design your algorithm in a very different way from how you would to efficiently utilize a card with 512 cores. Caveat Emptor You get what you pay for -- those awesome cards are awesome -- but your code may not run faster.
Warning issued, what's the walk away message? When you get that nicer card, be sure to invest time in tuning, testing, and possibly redesigning your algorithm from the ground up.
OK, so that said, rule of thumb? GPUs get twice as fast every six months. So if you're moving from a card that's two years old to a card that's top of the line, claim to your boss that it will run between 4 to 8 times faster (and if you get the full 16-fold improvement, bravo!!)

Related

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

Utilizing GPU worth it?

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

GTX 295 vs other nvidia cards for cuda development

what is the best nvidia Video Card for cuda development. a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
what is the best nvidia Video Card for cuda development.
Whatever fits in your budget and suits your needs. I know this is a bit vague, but after all it really is as simple as that ;)
a single GTX 295 has 2 GPUs, is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
Sure, it is. The only drawback is that the 2 GPUs on the GTX 295 share a single PCI. Whether this is relevant for you or not depends if the application needs intensive communication with the host or not.
is it better to get two 480 cards rather than two 295? would a fermi be better than both cards?
From the point of view of raw peak performance a GTX 295 (which is almost 2x GTX 280, not considering the shared PCI) is better than a 480. However the GF10x series architecture improved on many points compared to the GT200, for details see the "Fermi whitepaper" and the "Fermi Tuning Guide".
If you're planning to use double precision, the GF10x series has much improved double precision support, but it's good to know that this is capped on GeForce cards to 1/8-th of the single precision performance (normally it's about half)
Therefor, I would suggest that unless you have a strong reason to get lots of GFlops (Folding#Home?) in the form of soon to be outdated hardware, get a GTX 480 or a 470 if you want to save ~25%.
Direct answer: I would go with one or maybe two GTX 480's. But I think my reasoning is a bit different from #bobince or #pszilard.
Backgroud: I just made the same decision you're facing, but our situations may be vastly different.
I'm a statistics graduate student in a department with minimal funding for gpu computing resources, the campus does have one fermi box hooked up to two nodes that I have access to. But these were in linux -- which I love -- but I really want to use nSight to benchmark and tune my code, so I need windows -- so I decided to purchase a development box which I dual boot, Ubuntu x64 for production runs and Win 7 with VS 2010 (a battle which I'm presently fighting) and nSight 1.5 for development. That said, back to the reason why I bought two GTX 480's (EVGA is awesome!!) and not two GTX 285's or 295's.
I've spent the past two years developing a couple of CUDA kernels. The trickiest part of the development, for me, is the memory management. I spent the better part of three months trying to squeeze a Cholesky decomposition & back substitution into 16 single-precision registers -- the max you can use before either the GTX 285 or 295 incur a 50% performance penalty (literally 3 weeks going from 17 to 16 registers). For me, the fact that all Fermi architectures have double the registers means that those three months would've gained me about 10% improvement on a GTX 480 instead of 50% on GTX 285 and hence, probably not worth my time -- in truth a bit more subtle than that, but you get the drift.
If you're fairly new to CUDA -- which you probably are since you're asking -- I would say 32 registers is HUGE. Second, I think the L1 cache of the Fermi architecture can directly translate to faster global memory accesses -- of course it does, but I haven't measured the impact directly yet. If you don't need the global memory as much, you can trade the bigger L1 cache for triple the shared memory -- which was also a tight squeeze for me as the matrix sizes increased.
Then I would agree with #pszilard that if you need double precision, Fermi is definitely the way to go -- though I'd still write your code in single precision first, tune it, and then migrate to double.
I don't think that concurrent kernel execution will matter for you -- it's really cool, the delays to kernel completion can be orders of magnitude less -- but you're probably going to focus on one kernel first, not parallel kernels. If you want to do streaming or parallel kernels, then you need Fermi -- the 285 / 295's simply can't do it.
And lastly, the drawback of going with the 295's is that you have to write two layers of parallelism: (1) one to distribute blocks (or kernels?) across the cards and (2) the gpu kernel itself. If you're just starting out, it's much easier to keep the parallelism in one place (on a single card) as opposed to fighting two battles at once.
Ps. If you haven't written your kernels yet, you might consider getting only one card and waiting six months to see if the landscape changes again -- though I have no idea when the next cards are to be released.
PPs. I absolutely loved running my cuda kernel on the GTX 480 which I had debugged / designed on a Tesla C1070 and instantly realizing a 2x speed improvement. Money well spent.
is it possible to have 2 GTX 295 and use the 4 GPUs in my cuda code?
Yes. Or quad, if you're totally insane.
is it better to get two 480 cards rather than two 295?
Arguable. 295 as a dual-gpu has slightly more raw oomph, but 480 as a 40nm-process card without the dual-gpu overhead may use its resources better. Benchmarks vary. Of course the Fermi 4xx range has more modern feature support (3D, DirectX, OpenCL etc).
But dual-295 is going to have seriously huge PSU and cooling requirements. And dual-480 runs almost as hot. Not to mention the expense. What are you working on that you think you're going to need this? Have you considered the more mainstream parts, eg 460, which is generally considered to offer a better price/performance than the troubled 470–480 (GF100) part?

Feasibility of GPU as a CPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What do you think the future of GPU as a CPU initiatives like CUDA are? Do you think they are going to become mainstream and be the next adopted fad in the industry? Apple is building a new framework for using the GPU to do CPU tasks and there has been alot of success in the Nvidias CUDA project in the sciences. Would you suggest that a student commit time into this field?
Commit time if you are interested in scientific and parallel computing. Don't think of CUDA and making a GPU appear as a CPU. It only allows a more direct method of programming GPUs than older GPGPU programming techniques.
General purpose CPUs derive their ability to work well on a wide variety of tasks from all the work that has gone into branch prediction, pipelining, superscaler, etc. This makes it possible for them to achieve good performance on a wide variety of workloads, while making them suck at high-throughput memory intensive floating point operations.
GPUs were originally designed to do one thing, and do it very, very well. Graphics operations are inherently parallel. You can calculate the colour of all pixels on the screen at the same time, because there are no data dependencies between the results. Additionally, the algorithms needed did not have to deal with branches, since nearly any branch that would be required could be achieved by setting a co-efficient to zero or one. The hardware could therefore be very simple. It is not necessary to worry about branch prediction, and instead of making a processor superscaler, you can simply add as many ALU's as you can cram on the chip.
With programmable texture and vertex shaders, GPU's gained a path to general programmability, but they are still limited by the hardware, which is still designed for high throughput floating point operations. Some additional circuitry will probably be added to enable more general purpose computation, but only up to a point. Anything that compromises the ability of a GPU to do graphics won't make it in. After all, GPU companies are still in the graphics business and the target market is still gamers and people who need high end visualization.
The GPGPU market is still a drop in the bucket, and to a certain extent will remain so. After all, "it looks pretty" is a much lower standard to meet than "100% guaranteed and reproducible results, every time."
So in short, GPU's will never be feasible as CPU's. They are simply designed for different kinds of workloads. I expect GPU's will gain features that make them useful for quickly solving a wider variety of problems, but they will always be graphics processing units first and foremost.
It will always be important to always match the problem you have with the most appropriate tool you have to solve it.
Long-term I think that the GPU will cease to exist, as general purpose processors evolve to take over those functions. Intel's Larrabee is the first step. History has shown that betting against x86 is a bad idea.
Study of massively parallel architectures and vector processing will still be useful.
First of all I don't think this questions really belongs on SO.
In my opinion the GPU is a very interesting alternative whenever you do vector-based float mathematics. However this translates to: It will not become mainstream. Most mainstream (Desktop) applications do very few floating-point calculations.
It has already gained traction in games (physics-engines) and in scientific calculations. If you consider any of those two as "mainstream", than yes, the GPU will become mainstream.
I would not consider these two as mainstream and I therefore think, the GPU will raise to be the next adopted fad in the mainstream industry.
If you, as a student have any interest in heavily physics based scientific calculations, you should absolutely commit some time to it (GPUs are very interesting pieces of hardware anyway).
GPU's will never supplant CPU's. A CPU executes a set of sequential instructions, and a GPU does a very specific type of calculation in parallel. These GPU's have great utility in numerical computing and graphics; however, most programs can in no way utilize this flavor of computing.
You will soon begin seeing new processers from Intel and AMD that include GPU-esque floating point vector computations as well as standard CPU computations.
I think it's the right way to go.
Considering that GPUs have been tapped to create cheap supercomputers, it appears to be the natural evolution of things. With so much computing power and R&D already done for you, why not exploit the available technology?
So go ahead and do it. It will make for some cool research, as well as a legit reason to buy that high-end graphic card so you can play Crysis and Assassin's Creed on full graphic detail ;)
Its one of those things that you see 1 or 2 applications for, but soon enough someone will come up with a 'killer app' that figures out how to do something more generally useful with it, at superfast speeds.
Pixel shaders to apply routines to large arrays of float values, maybe we'll see some GIS coverage applications or well, I don't know. If you don't devote more time to it than I have then you'll have the same level of insight as me - ie little!
I have a feeling it could be a really big thing, as do Intel and S3, maybe it just needs 1 little tweak adding to the hardware, or someone with a lightbulb above their head.
With so much untapped power I cannot see how it would go unused for too long. The question is, though, how the GPU will be used for this. CUDA seems to be a good guess for now but other techologies are emerging on the horizon which might make it more approachable by the average developer.
Apple have recently announced OpenCL which they claim is much more than CUDA, yet quite simple. I'm not sure what exactly to make of that but the khronos group (The guys working on the OpenGL standard) are working on the OpenCL standard, and is trying to make it highly interoperable with OpenGL. This might lead to a technology which is better suited for normal software development.
It's an interesting subject and, incidentally, I'm about to start my master thesis on the subject of how best to make the GPU power available to the average developers (if possible) with CUDA as the main focus.
A long time ago, it was really hard to do floating point calculations (thousands/millions of cycles of emulation per instruction on terribly performing (by today's standards) CPUs like the 80386). People that needed floating point performance could get an FPU (for example, the 80387. The old FPU were fairly tightly integrated into the CPU's operation, but they were external. Later on they became integrated, with the 80486 having an FPU built-in.
The old-time FPU is analagous to GPU computation. We can already get it with AMD's APUs. An APU is a CPU with a GPU built into it.
So, I think the actual answer to your question is, GPU's won't become CPUs, instead CPU's will have a GPU built in.