Feasibility of GPU as a CPU? [closed] - cuda

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What do you think the future of GPU as a CPU initiatives like CUDA are? Do you think they are going to become mainstream and be the next adopted fad in the industry? Apple is building a new framework for using the GPU to do CPU tasks and there has been alot of success in the Nvidias CUDA project in the sciences. Would you suggest that a student commit time into this field?

Commit time if you are interested in scientific and parallel computing. Don't think of CUDA and making a GPU appear as a CPU. It only allows a more direct method of programming GPUs than older GPGPU programming techniques.
General purpose CPUs derive their ability to work well on a wide variety of tasks from all the work that has gone into branch prediction, pipelining, superscaler, etc. This makes it possible for them to achieve good performance on a wide variety of workloads, while making them suck at high-throughput memory intensive floating point operations.
GPUs were originally designed to do one thing, and do it very, very well. Graphics operations are inherently parallel. You can calculate the colour of all pixels on the screen at the same time, because there are no data dependencies between the results. Additionally, the algorithms needed did not have to deal with branches, since nearly any branch that would be required could be achieved by setting a co-efficient to zero or one. The hardware could therefore be very simple. It is not necessary to worry about branch prediction, and instead of making a processor superscaler, you can simply add as many ALU's as you can cram on the chip.
With programmable texture and vertex shaders, GPU's gained a path to general programmability, but they are still limited by the hardware, which is still designed for high throughput floating point operations. Some additional circuitry will probably be added to enable more general purpose computation, but only up to a point. Anything that compromises the ability of a GPU to do graphics won't make it in. After all, GPU companies are still in the graphics business and the target market is still gamers and people who need high end visualization.
The GPGPU market is still a drop in the bucket, and to a certain extent will remain so. After all, "it looks pretty" is a much lower standard to meet than "100% guaranteed and reproducible results, every time."
So in short, GPU's will never be feasible as CPU's. They are simply designed for different kinds of workloads. I expect GPU's will gain features that make them useful for quickly solving a wider variety of problems, but they will always be graphics processing units first and foremost.
It will always be important to always match the problem you have with the most appropriate tool you have to solve it.

Long-term I think that the GPU will cease to exist, as general purpose processors evolve to take over those functions. Intel's Larrabee is the first step. History has shown that betting against x86 is a bad idea.
Study of massively parallel architectures and vector processing will still be useful.

First of all I don't think this questions really belongs on SO.
In my opinion the GPU is a very interesting alternative whenever you do vector-based float mathematics. However this translates to: It will not become mainstream. Most mainstream (Desktop) applications do very few floating-point calculations.
It has already gained traction in games (physics-engines) and in scientific calculations. If you consider any of those two as "mainstream", than yes, the GPU will become mainstream.
I would not consider these two as mainstream and I therefore think, the GPU will raise to be the next adopted fad in the mainstream industry.
If you, as a student have any interest in heavily physics based scientific calculations, you should absolutely commit some time to it (GPUs are very interesting pieces of hardware anyway).

GPU's will never supplant CPU's. A CPU executes a set of sequential instructions, and a GPU does a very specific type of calculation in parallel. These GPU's have great utility in numerical computing and graphics; however, most programs can in no way utilize this flavor of computing.
You will soon begin seeing new processers from Intel and AMD that include GPU-esque floating point vector computations as well as standard CPU computations.

I think it's the right way to go.
Considering that GPUs have been tapped to create cheap supercomputers, it appears to be the natural evolution of things. With so much computing power and R&D already done for you, why not exploit the available technology?
So go ahead and do it. It will make for some cool research, as well as a legit reason to buy that high-end graphic card so you can play Crysis and Assassin's Creed on full graphic detail ;)

Its one of those things that you see 1 or 2 applications for, but soon enough someone will come up with a 'killer app' that figures out how to do something more generally useful with it, at superfast speeds.
Pixel shaders to apply routines to large arrays of float values, maybe we'll see some GIS coverage applications or well, I don't know. If you don't devote more time to it than I have then you'll have the same level of insight as me - ie little!
I have a feeling it could be a really big thing, as do Intel and S3, maybe it just needs 1 little tweak adding to the hardware, or someone with a lightbulb above their head.

With so much untapped power I cannot see how it would go unused for too long. The question is, though, how the GPU will be used for this. CUDA seems to be a good guess for now but other techologies are emerging on the horizon which might make it more approachable by the average developer.
Apple have recently announced OpenCL which they claim is much more than CUDA, yet quite simple. I'm not sure what exactly to make of that but the khronos group (The guys working on the OpenGL standard) are working on the OpenCL standard, and is trying to make it highly interoperable with OpenGL. This might lead to a technology which is better suited for normal software development.
It's an interesting subject and, incidentally, I'm about to start my master thesis on the subject of how best to make the GPU power available to the average developers (if possible) with CUDA as the main focus.

A long time ago, it was really hard to do floating point calculations (thousands/millions of cycles of emulation per instruction on terribly performing (by today's standards) CPUs like the 80386). People that needed floating point performance could get an FPU (for example, the 80387. The old FPU were fairly tightly integrated into the CPU's operation, but they were external. Later on they became integrated, with the 80486 having an FPU built-in.
The old-time FPU is analagous to GPU computation. We can already get it with AMD's APUs. An APU is a CPU with a GPU built into it.
So, I think the actual answer to your question is, GPU's won't become CPUs, instead CPU's will have a GPU built in.

Related

Explanation of GPGPU energy efficiency relative to CPU? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've heard the statement that for many applications GPUs are more energy efficient than multi-core CPUs, particularly when the graphics hardware is well utilized. I'm having trouble finding papers, articles, or anything describing the specific architectural features that result in that claim, or a scientific study directly comparing the energy consumption of GPUs to CPUs on a set of benchmarks. Could anyone provide more insight into the backing of this claim, or point me to some studies that show evidence for it?
If I had to guess, I would say that it mostly stems from the lower frequencies of GPU clocks. Additionally, this paper:
http://accel.cs.vt.edu/sites/default/files/paper/huang-hppac09-gpu.pdf
suggests that it's partially a result of GPUs just getting the problem done quicker, so even though the peak power consumption is higher for GPUs the time they spend at that level is much shorter than CPUs (again, for the right problems). Can anyone add anything more?
Thanks!
TL;DR answer: more of the transistors in a gpu are actually working on the computation than in a cpu.
The big power efficiency-killer of today's cpus is a trade-off to allow general computation on the chip. Whether it is a RISC, x86, or other cpu architecture, there is extra hardware dedicated to the general purpose usage of the cpu. These transistors require electricity, although they are not doing any actual math.
Fast cpus require advanced branch prediction hardware and large cache memory to be able to avoid lengthy processing which could be discarded later in the pipeline. For the most part, cpus execute their instruction one at a time (per cpu core, SIMD helps out cpus as well...), and handle conditions extremely well. Gpus rely on doing the same operation on many pieces of data (SIMD/vector operation), and suffer greatly with simple conditions found in 'if' and 'for' statements.
There is also a lot of hardware used to fetch, decode, and schedule instructions -- this is true for cpus and gpus. This big difference being that the ratio of fetch+decode+schedule transistors to computating transistors tends to be much higher for a gpu.
Here is an AMD presentation (2011) about how their gpus have changed over time, but this really applies to most gpus in general. PDF link. It helped me understand the power advantage of gpus by knowing a bit of the history behind how gpus got to be so good at certain computations.
I gave an answer to a similar question a while ago. SO Link.
Usually, these claims are backed by comparing the GFLOPs performance and estimating the power per floating point operation as shown in this post. But this is essentially what you wrote in your last sentence.
You also have to take into account that the CPU and GPU architectures target different problems. Whereas a CPU core (at least on x86) has deep pipelines, a large instruction set and very sophisticated caching strategies to cater to a wide array of problems, a GPU core is rather simple and thus draws a lot less power. To make up for this, there are many more computing cores in a GPU than in a CPU. But you probably know that already.

OpenCL for GPU vs. FPGA

I read recently about OpenCL/CUDA for FPGA vs. GPU
As I understood FPGA wins in power criteria.
The explanation for that ,I`ve found in some article:
Reconfigurable devices can have much lower power consumption from peak
values since only configured portions of the chip are active
Based on said above I have a question - does it mean that ,if some CU [Compute Unit] doen`t execute any work-item,it still consumes power? (and if yes - what for it consumes power?)
Yes, idle circuitry still consumes power. It doesn't consume as much, but it still consumes some. The reason for this is down to how transistors work, and how CMOS logic gates consume power.
Classically, CMOS logic (the type on all modern chips) only consumes power when it switches state. This made is very low power when compared to the technologies that came before it which consumed power all the time. Even so, every time a clock edge occurs, some logic changes state even if there's no work to do. The higher the clock rate, the more power used. GPUs tend to have high clock rates so they can do lots of work; FPGAs tend to have low clock rates. That's the first effect, but it can be mitigated by not clocking circuits that have no work to do (called 'clock gating')
As the size of transistors became smaller and smaller, the amount of power used when switching became smaller, but other effects (known as leakage) became more significant. Now we're at a point where the leakage power is very significant, and it's multiplied up by the number of gates you have in a design. Complex designs have high leakage power; Simple designs have low leakage power (in very basic terms). This is a second effect.
Hence, for a simple task it may be more power efficient to have a small dedicated low speed FPGA rather than a large complex, but high speed / general purpose CPU/GPU.
As always, it depends on the workload. For workloads that are well-supported by native GPU hardware (e.g. floating point, texture filtering), I doubt an FPGA can compete. Anecdotally, I've heard about image processing workloads where FPGAs are competitive or better. That makes sense, since GPUs are not optimized to operate on small integers. (For that reason, GPUs often are uncompetitive with CPUs running SSE2-optimized image processing code.)
As for power consumption, for GPUs, suitable workloads generally keep all the execution units busy, so it's a bit of an all-or-nothing proposition.
Based on my research on FPGAs and the way they work, these devices can be designed to be very power efficient and really fine-tuned for one special task (e.g., an algorithm) and use the smallest resources possible (therefore the lower amount of energy consumption among all possible choices except ASIC)
When implementing turning-complete algorithms using FPGAs, the designers have the option of either unrolling their algorithms to use the maximum parallelism offered or use a compact sequential design. Each method has its own cost-benefits; the former helps maximizing performance at the cost of higher resource consumption, and the latter helps minimizing area and resource consumption by reusing hardware at the cost of minimizing the performance.
This level of control over implementation of algorithms doesn’t exist when developing for GPUs. The developers have the control to use the most efficient algorithms yet they are not the one determining the final precise hardware implementation of their algorithms. Unlike FPGA designers who even count “nano-seconds” when calculating their design’s hardware implementation (using post-layout tools), GPU developers rely on available frameworks to enhance all implementation details for them automatically. They develop at much higher levels compared to FPGA designers.
So the well known topic of trade-offs pops up here too; you want exact control over the hardware implementation at the cost of longer development times? Choose FPGAs. You want parallelism, yet have made up your mind to give up exact control over hardware implementation and want to develop using your existing software skills? use OpenCL.
Kudos to #hamzed, but OpenCL is not taking control away from the designer of OpenCL on FPGAs. It actually gives the best of the both worlds: full programmability of FPGA with all custom parallel algorithm benefits as well as much better design closure speed vs. RTL. By being clever about your algorithm moving and not moving data you can get to near theoretical performance of FPGAs. Please see the last chart in this reference: https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf

Is GPGPU a hack?

I had started working on GPGPU some days ago and successfully implemented cholesky factorization with good performacne and I attended a conference on High Performance Computing where some people said that "GPGPU is a Hack".
I am still confused what does it mean and why they were saying it hack. One said that this is hack because you are converting your problem into a matrix and doing operations on it. But still I am confused that does people think it is a hack or if yes then why?
Can anyone help me, why they called it a hack while I found nothing wrong with it.
One possible reason for such opinion is that the GPU was not originally intended for general purpose computations. Also programming a GPU is less traditional and more hardcore and therefore more likely to be perceived as a hack.
The point that "you convert the problem into a matrix" is not reasonable at all. Whatever task you solve with writing code you choose reasonable data structures. In case of GPU matrices are likely the most reasonable datastructures and it's not a hack but just a natural choice to use them.
However I suppose that it's a matter of time for GPGPU becoming widespread. People just have to get used to the idea. After all who cares which unit of the computer runs the program?
On the GPU, having efficient memory access is paramount to achieving optimal performance. This often involves restructuring or even choosing entirely new algorithms and data structures. This is reason why GPU programming can be perceived as a hack.
Secondly, adapting an existing algorithm to run on the GPU is not in and of itself science. The relatively low scientific contribution of some GPU algorithm-related papers has led to a negative perception of GPU programming as strictly "engineering".
Obviously, only the person who said that can say for certain why he said it, but, here's my take:
A "Hack" is not a bad thing.
It forces people to learn new programming languages and concepts. For people who are just trying to model the weather or protein folding or drug reactions, this is an unwelcome annoyance. They didn't really want to learn FORTRAN (or whatever) in the first place, and now the have to learn another programming system.
The programming tools are NOT very mature yet.
The hardware isn't as reliable as CPUs (yet) so all of the calculations have to be done twice to make sure you've got the right answer. One reason for this is that GPUs don't come with error-correcting memory yet, so if you're trying to build a supercomputer with thousands of processors, the probability of a cosmic ray flipping a bit in you numbers approaches certainty.
As for the comment "you are converting your problem into a matrix and doing operations on it", I think that shows a lot of ignorance. Virtually ALL of high-performance computing fits that description!
One of the major problems in GPGPU for the past few years and probably for the next few is that programming them for arbitrary tasks was not very easy. Up until DX10 there was no integer support among GPUs and branching is still very poor. This is very much a situation where in order to get maximum benefit you have to write your code in a very awkward manner to extract all sorts of efficiency gains from the GPU. This is because you're running on hardware that is still dedicated to processing polygons and textures, rather than abstract parallel tasks.
Obviously, thats my take on it and YMMV
The GPGPU harks back to the days of the math co-processor. A hack is a shortcut to solving a long winded problem. GPGPU is a hack just like NAT on top of IPV4 is a hack. Computational problems just like networks are getting bigger as we try to do more, GPGPU is an useful interim solution, whether it stays outside the core CPU chip and has separate cranky API or gets sucked into the CPU via API or manufacture is up to the path finders.
I suppose he meant that using GPGPU forced you to restructure your implementation, so that it fitted the hardware, not the problem domain. Elegant implementation should fit the latter.
Note, that the word "hack" may have several different meanings:
http://www.urbandictionary.com/define.php?term=hack

Have you successfully used a GPGPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?
I have been doing gpgpu development with ATI's stream SDK instead of Cuda.
What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)
A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.
For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.
For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.
Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.
(source: earthlink.net)
Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.
I have written trivial applications, it really helps if you can parallize floating point calculations.
I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).
I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).
IMO, there are three typical stages when porting an algorithm to CUDA:
Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).
This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.
I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing.
I've been getting about a 10-20X speedup in these situations.
From what I've read, this is fairly typical for data-parallel algorithms.
While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).
This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.
In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.
Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.
I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.
Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.
Also, since CUDA is a C extension, integrating it into another application can be non-trivial.
I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right
I wrote a complex valued matrix multiplication kernel that beat the cuBLAS implementation by about 30% for the application I was using it for, and a sort of vector outer product function that ran several orders of magnitude than a multiply-trace solution for the rest of the problem.
It was a final year project. It took me a full year.
http://www.maths.tcd.ie/~oconbhup/Maths_Project.pdf
I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. My observations were
Got performance speedup upto 10 times.
Working on same problem to optimize it more, by scaling it to multiple GPUs.
Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.
It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).
I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.
The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.

raytracing with CUDA [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm currently implementing a raytracer. Since raytracing is extremely computation heavy and since I am going to be looking into CUDA programming anyway, I was wondering if anyone has any experience with combining the two. I can't really tell if the computational models match and I would like to know what to expect. I get the impression that it's not exactly a match made in heaven, but a decent speed increasy would be better than nothing.
One thing to be very wary of in CUDA is that divergent control flow in your kernel code absolutely KILLS performance, due to the structure of the underlying GPU hardware. GPUs typically have massively data-parallel workloads with highly-coherent control flow (i.e. you have a couple million pixels, each of which (or at least large swaths of which) will be operated on by the exact same shader program, even taking the same direction through all the branches. This enables them to make some hardware optimizations, like only having a single instruction cache, fetch unit, and decode logic for each group of 32 threads. In the ideal case, which is common in graphics, they can broadcast the same instruction to all 32 sets of execution units in the same cycle (this is known as SIMD, or Single-Instruction Multiple-Data). They can emulate MIMD (Multiple-Instruction) and SPMD (Single-Program), but when threads within a Streaming Multiprocessor (SM) diverge (take different code paths out of a branch), the issue logic actually switches between each code path on a cycle-by-cycle basis. You can imagine that, in the worst case, where all threads are on separate paths, your hardware utilization just went down by a factor of 32, effectively killing any benefit you would've had by running on a GPU over a CPU, particularly considering the overhead associated with marshalling the dataset from the CPU, over PCIe, to the GPU.
That said, ray-tracing, while data-parallel in some sense, has widely-diverging control flow for even modestly-complex scenes. Even if you manage to map a bunch of tightly-spaced rays that you cast out right next to each other onto the same SM, the data and instruction locality you have for the initial bounce won't hold for very long. For instance, imagine all 32 highly-coherent rays bouncing off a sphere. They will all go in fairly different directions after this bounce, and will probably hit objects made out of different materials, with different lighting conditions, and so forth. Every material and set of lighting, occlusion, etc. conditions has its own instruction stream associated with it (to compute refraction, reflection, absorption, etc.), and so it becomes quite difficult to run the same instruction stream on even a significant fraction of the threads in an SM. This problem, with the current state of the art in ray-tracing code, reduces your GPU utilization by a factor of 16-32, which may make performance unacceptable for your application, especially if it's real-time (e.g. a game). It still might be superior to a CPU for e.g. a render farm.
There is an emerging class of MIMD or SPMD accelerators being looked at now in the research community. I would look at these as logical platforms for software, real-time raytracing.
If you're interested in the algorithms involved and mapping them to code, check out POVRay. Also look into photon mapping, it's an interesting technique that even goes one step closer to representing physical reality than raytracing.
It can certainly be done, has been done, and is a hot topic currently among the raytracing and Cuda gurus. I'd start by perusing http://www.nvidia.com/object/cuda_home.html
But it's basically a research problem. People who are doing it well are getting peer-reviewed research papers out of it. But well at this point still means that the best GPU/Cuda results are approximately competitive with best-of-class solutions on CPU/multi-core/SSE. So I think that it's a little early to assume that using Cuda is going to accelerate a ray tracer. The problem is that although ray tracing is "embarrassingly parallel" (as they say), it is not the kind of "fixed input and output size" problem that maps straightforwardly to GPUs -- you want trees, stacks, dynamic data structures, etc. It can be done with Cuda/GPU, but it's tricky.
Your question wasn't clear about your experience level or the goals of your project. If this is your first ray tracer and you're just trying to learn, I'd avoid Cuda -- it'll take you 10x longer to develop and you probably won't get good speed. If you're a moderately experienced Cuda programmer and are looking for a challenging project and ray tracing is just a fun thing to learn, by all means, try to do it in Cuda. If you're making a commercial app and you're looking to get a competitive speed edge -- well, it's probably a crap shoot at this point... you might get a performance edge, but at the expense of more difficult development and dependence on particular hardware.
Check back in a year, the answer may be different after another generation or two of GPU speed, Cuda compiler development, and research community experience.