What is the diffrence between SPMD and SIMD? - terminology

I just cant understand whats the diffrence between them...
is SPMD is in the programming level and SIMD in the hardware level ?
example would be good !
thanks

SIMD is vectorization at the instruction level - each CPU instruction processes multiple data elements.
SPMD is a much higher level abstraction where processes or programs are split across multiple processors and operate on different subsets of the data.

Related

When does it make sense to use a GPU?

I have code doing a lot of operations with objects which can be represented as arrays.
When does it make to sense to use GPGPU environments (like CUDA) in an application? Can I predict performance gains before writing real code?
The convenience depends on a number of factors. Elementwise independent operations on large arrays/matrices are a good candidate.
For your particular problem (machine learning/fuzzy logic), I would recommend reading some related documents, as
Large Scale Machine Learning using NVIDIA CUDA
and
Fuzzy Logic-Based Image Processing Using Graphics Processor Units
to have a feeling on the speedup achieved by other people.
As already mentioned, you should specify your problem. However, if large parts of your code involve operations on your objects that are independent in a sense that object n does not have to wait for the results of the operations objects 0 to n-1, GPUs may enhance performance.
You could go to CUDA Zone to get yourself a general idea about what CUDA can do and do better than CPU.
https://developer.nvidia.com/category/zone/cuda-zone
CUDA has already provided lots of performance libraries, tools and ecosystems to reduce the development difficulty. It could also help you understand what kind of operations CUDA are good at.
https://developer.nvidia.com/cuda-tools-ecosystem
Further more, CUDA provided benchmark report on some of the most common and representative operations. You could find if your code can benefit from that.
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/CUDA_5.0_Math_Libraries_Performance.pdf

Is it fair to compare SSE/AVX units to GPU cores?

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.
After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).
This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.
Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.
PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !
EDIT:
Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.
I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison. For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.
This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))
Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.
It may help if I explain why Haswell can perform 32 single-precision operations per cycle.
8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.
This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/
In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.
A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/
If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.
PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.
I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.
One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.
A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.
A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).
So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.
It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.
For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.
I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like.
This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

How is parallelism on a single thread/core possible?

Modern programming languages provide parallelism and concurrency mechanisms as first class citizens to their users. I understand how parallel algorithms are programmed and can well imagine how two threads on a multi-core CPU can run in parallel.
Yet, most of these platforms also support running parallel processes on a single thread.
Do these processes really run in parallel?
How, on an assembly level can two different routines be executed simultaneously on a single thread?
TLTR; : parallelism (in the sense of true simultanenous execution) on a single, non-hyperthreaded CPU core, is NOT possible.
Hardware (<- EDIT) Paralellism can be achieved at several levels. Ordered by decreasing granularity :
multi-host
multi-processor
multi-core
multi-threads ("Hyper-Threading", i.e. "HT")
(EDIT: I voluntarity omit the case of vectorized compuations where several ALUs can be driven by the same core)
Your question relates to running two software threads in cases 3. (in case HT is unavailable / disabled) or 4.
In both cases, the processes actually do NOT run in parallel. The user has an impression of simultaneity due to the extremely fast context switches performed at the CPU level, that tend to allocate, sequentially, the physical core (resp. thread) time to one or the other software thread
In both cases, those routines are simply not executed simultaneously, but sequentially
The relative priority allocated to each of those 2 routines can be set on various OSes by the "Priority" you give to the process, that will be handled by the OS's scheduler, which in turn will allocate CPU time.
HTH.
To perform tests to better understand this topic, you may want to google "cpu affinity". This will let you run a two-threaded process on one physical single core of a multi-core CPU, and time the time taken by each of the threads, while modifying their priority, etc...
Yes, there is parallelism in each thread and you get it for free, no matter which programming language you use (although the amount of parallelism may vary).
It's called instruction-level parallelism. The details are quite complex and differ between different processor micro-architectures.
Computer Architecture: A Quantitative Approach is a brilliant book which includes a chapter on instruction-level parallelism and the book's examples teach how to think rationally about engineering.
Check out the following links for more information:
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/Instruction_pipelining
http://en.wikipedia.org/wiki/Out-of-order_execution

Question on CUDA programming model

Hi I am new to CUDA programming and I had 2 questions on the CUDA programming model.
In brief, the model says there is a memory hierarchy in terms of thread, blocks and then grids. Threads within a block have shared memory and are able to communicate with each other easily, but cannot communicate if they are in different blocks. There is also a global memory on the GPU device.
My questions are:
(1)Why do we need to have such a memory hierarchy consisting of threads and then blocks?
That way any two threads can communicate with each other if needed and hence probably simplify programming effort.
(2) Why is there a restriction of setting up threads only upto 3D configuations and not beyond?
Thank you.
1) This allows you to have a generalized programming model that supports hardware with different numbers of processors. It is also a reflection of the underlying GPU hardware which treats thread within a block differently from threads in different blocks WRT to memory access and synchronization.
Threads can communicate via global memory, or shared memory depending on their block affinity. You can also use synchronization primatives, like __syncthreads.
2) This is part of the programming model. I suspect is largely due to user demand to allow data decomposition for 3 dimensional problems and there was little demand for further dimension support.
The Cuda Programming Guide covers a lot of this sort of stuff. There are also a couple of books available. There's a good discussion in Programming Massively Parallel Processors: A Hands-on Approach that goes into why GPU hardware is the way it is and how that has been reflected in the programming model.
(1) Local memory is used to store local values that doesn't fit into registers. Shared memory is used to store common data, that is shared by threads. Local memory + registers compose execution context of thread, and shared memory is storage for data to be processed.
(2) You can easily use 1D to represent any D. For example if you have 1D index you can convert it to 2D space by using: x = i % width, y = i / width and inverse is i = y*width + x. 2D and 3D were added for your convenience. It is pretty same as N-D arrays are implemented in C++.

Different threads on different multiprocessors

Is it possible to run different threads on different multiprocessors? similar to CPU cores?
Suppose I have 2 large arrays a, b and I want to compute both sum and difference. Lets say I have 2 multiprocessors on my device. Is it possible to run both kernel functions (which compute sum and difference) concurrently on 2 different multiprocessors?
Using your example of computing both the sum and difference, the best performance is probably going to be achieved if you compute both at the same time (i.e. in the same kernel).
Assuming this is not possible for some reason, then if your arrays are very large then the best performance may be to use the whole GPU (i.e. multiple multiprocessors) to compute the result in which case it doesn't matter too much that you do one after the other.
For both of the above cases I would strongly recommend you check out the reduction sample in the SDK which walks you through a naive implementation up to a pretty quick version with good documentation.
Having said all of that, if the amount of work is sufficiently small that you would not be fully utilising the whole GPU for one of your computations then there are two ways to run different computations on different multiprocessors:
Use "Concurrent Kernels", where multiple kernels run on the same GPU at the same time. See the CUDA Programming Guide for more information and check out the concurrentKernels sample in the SDK, in essence you manual tell the scheduler that there is no dependency between the two (by using CUDA streams) which allows thenm to be executed simultaneously.
Have a switch on the blockIdx to select which code to execute.
The first of these is far more preferable if your hardware supports it (you will need Compute Capability 2.0 or greater) since it is far simpler to read and maintain.
yes, using Fermi devices and multiple streams.