Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
So I'm taking an algorithms class, and for my final project I'm decided to take some of the Cuda stuff I learned at work and put together some GPGPU sorting algorithms an evaluate their growth over different objects. But then I've gone and confused myself over how many threads actually are running at the same time...Please let me know if I am understanding this incorectly.
So I have a GeForce GT 650M. Cuda Capable 3.0 card.
It has 2 multiprocessors, so each processor takes 1 block at once. That part I get...but beyond that is where it starts to get fuzzy for me.
The largest number of threads running concurrent is 384: 2 MPU*192 Cores/MPU, or 2 MPUs*6 Warps/MPU*
Each MPU has a set of 192 Cuda Cores, which means that each processor can do up to 192 operations at once (yes? no?) regardless of the maintained thread count. So the parellization factor <=384, correct?
Each block is running n<=1024 threads at once, which the warp scheduler will choose which ones are taking up the 192 cores. A warp (of 32 threads, I believe, but I could be wrong) is the group of concurrency that is set.
When a kernel is called, the GPU distributes the blocks equally. If you have a odd number of blocks, there will be a period of time where you have 192 threads.
However, if a single thread in a warp finishes early, then it must wait until all other threads in the warp finish before it skips on to the next warp.
A block of warps will finish before moving to the next block. Up to 16 blocks are allowed to run at the same time on a MPU. (why on earth would this happen, btw?). However, all blocks must finish before calling the next kernel.
Is this right?
It's preferred that you ask one question per question. Furthermore, there are many questions like this on SO. You might try searching and reading some of those.
each processor takes 1 block at once.
That could be true for a specific code, but it is not generally true. An SM (MPU) can have multiple threadblocks "open", and on a cycle-by-cycle basis, selecting warps from any of them to schedule for execution.
each processor can do up to 192 operations at once
It depends on the operation. Single precision floating point add/multiply operations, probably yes. Others, probably not.
regardless of the maintained thread count
What? No. If you're not running a full complement of threads, and in fact usually if you are not oversubscribed on threads (i.e. warps) the machine will not likely run at full capacity.
So the parellization factor <=384, correct?
Do you want to define parallelization factor, it's not entirely obvious? We've already established that for some types of operations, you could get for example 384 SP floating point operations retired in a single clock cycle. But your mileage may vary, depending on the operation. (Integer ops will typically be less.)
(why on earth would this happen, btw?).
Because, in fact we generally want to oversubscribe the SM's. If an SM has 192 cuda "cores", that does not mean we want to think about exactly 192 threads (or 6 warps) for that SM. This is a common misconception in GPU programming. A GPU hides latency by doing rapid context switching from a warp that is stalled (perhaps due to a memory reference) to a warp that is not stalled. If there are no other (un-stalled) warps available, then the SM will stall waiting for a warp to become ready to execute, and your performance will suffer. Having lots of "extra" warps ready to go helps prevent a SM stall.
Is this right?
Some of your assertions are correct. I've tried to address the ones that seemed incorrect, but it seems overall your understanding is not clear (as you say, "fuzzy"). In my opinion, your question is poorly written. It's nice to have one or a very small number of "crisp" questions to answer. This question feels like you want a dialog or a treatise, and SO is not designed for that. If you want a comprehensive intro to CUDA, read the available documentation or take some of the available webinars.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I almost know nothing about GPU computing. I already have seen articles written about GPU computing, say Fast minimum spanning tree for large graphs on the GPU or All-pairs shortest-paths for large graphs on the GPU. It sounds GPU has some restrictions in computing that CPU doesn't have. I need to know what kind of computations a GPU can do?
thanks.
Well, I'm a CUDA rookie with some experience, so I think I may help with a response from one begneer to another one.
A very short answer to your question is:
It can do the very same thing as CPU, but it has different features which can make it deliver the desired result faster or slower (if you take in account the same cost in hardware).
The CPU, even multicore ones seeks lower latency and it leads to set of demands in construction. On the opposite direction, GPU assumes that you have so much independent data to process in a way that if you process a single instruction for each data entry result from the first data entry should be available to take part in the next code instruction before processing everything in the current instruction (it is kinda hard to achieve and a expressive amount of experience in parallel development is required). Thus, the GPU construction does not take in account the processing latency with the same intensity as CPU does, because it can be "hidden" by the bulk processing, also, it does not worry that much about the clock frequency, since it can be compensated in the number of processors.
So, I would not dare to say that GPU has restrictions over CPU, I would say that it has a more specific processing purpose, as a sound card for example, and it construction takes advantage of this specificity. Comparing both is the same as comparing a snowmobile to a bike, it does not make real sense.
But, one thing, is possible to state: if a high parallel approach is possible, the GPU can provide more efficiency for a lower cost than CPU, just remember that CPU stands for Central Processing Unit, and Central can be understood as it must be more general the peripheric ones.
First of all your code should consists of so many loops so that scheduler can switch between loops when it can't find enough resources to complete a loop. After that you should make sure that your code doesn't face one of the following lamitaions:
1.Divergance: If your code has long if statements then your code is likely to be divergant on GPU. Every 32 threads are grouped together and one instruction is assigned to all of them at once. So when the if is excuted on some threads, others that fall into else statement should wait and vice versa, which drops performance.
Uncoalesced memory access: One other thing is memory access pattern. If you access global memory orderly then you can utilize maximum memory bandwidth but if your access to data on global memory is misordered then you'll find memory access as a botteleneck. So if your code is very cache favorable, don't go for GPU as the ratio of ALU/cache on GPU is mich lower than CPU.
Low occupancy: If your code consume so many registers, shared memory, loading/ storing data and especial math functions (like trigonometrics) then it's likely that you find shortage in resources which prevent you to establish the full computational capacity of GPU.
I read that the number of threads in a warp can be 32 or more. why is that? if the number is less than 32 threads, does that mean the resources goes underutilized or we will not be able to tolerate memory latency?
Your question needs clarification - perhaps you are confusing the CUDA "warp" and "block" concepts?
Regarding warps, it's important to remember that warp and their size is a property of the hardware. Warps are a grouping of hardware threads that execute the same instruction (these days) every cycle. In other words, the size width indicates the SIMD-style execution width, something that the programmer can not change. In CUDA you launch blocks of threads which, when mapped to the hardware, get executed in warp-sized bunches. If you start blocks with thread count that is not divisible by the warp size, the hardware will simply execute the last warp with some of the threads "masked out" (i.e. they do have to execute, but without any effect on the state of the GPU/memory).
For more details I recommend reading carefully the hardware and execution-related sections of the CUDA programming guide.
The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Now that we have GPGPUs with languages like CUDA and OpenCL, do the multimedia SIMD extensions (SSE/AVX/NEON) still serve a purpose?
I read an article recently about how SSE instructions could be used to accelerate sorting networks. I thought this was pretty neat but when I told my comp arch professor he laughed and said that running similar code on a GPU would destroy the SIMD version. I don't doubt this because SSE is very simple and GPUs are large highly-complex accelerators with a lot more parallelism, but it got me thinking, are there many scenarios where the multimedia SIMD extensions are more useful than using a GPU?
If GPGPUs make SIMD redundant, why would Intel be increasing their SIMD support? SSE was 128 bits, now it's 256 bits with AVX and next year it will be 512 bits. If GPGPUs are better processing code with data parallelism why is Intel pushing these SIMD extensions? They might be able to put the equivalent resources (research and area) into a larger cache and branch predictor thus improving serial performance.
Why use SIMD instead of GPGPUs?
Absolutely SIMD is still relevant.
First, SIMD can more easily interoperate with scalar code, because it can read and write the same memory directly, while GPUs require the data to be uploaded to GPU memory before it can be accessed. For example, it's straightforward to vectorize a function like memcmp() via SIMD, but it would be absurd to implement memcmp() by uploading the data to the GPU and running it there. The latency would be crushing.
Second, both SIMD and GPUs are bad at highly branchy code, but SIMD is somewhat less worse. This is due to the fact that GPUs group multiple threads (a "warp") under a single instruction dispatcher. So what happens when threads need to take different paths: an if branch is taken in one thread, and the else branch is taken in another? This is called a "branch divergence" and it is slow: all the "if" threads execute while the "else" threads wait, and then the "else" threads execute while the "if" threads wait. CPU cores, of course, do not have this limitation.
The upshot is that SIMD is better for what might be called "intermediate workloads:" workloads up to intermediate size, with some data-parallelism, some unpredictability in access patterns, some branchiness. GPUs are better for very large workloads that have predictable execution flow and access patterns.
(There's also some peripheral reasons, such as better support for double precision floating point in CPUs.)
GPU has controllable dedicated caches, CPU has better branching. Other than that, compute performance relies on SIMD width, integer core density, and instruction level parallelism.
Also another important parameter is that how far the data is to a CPU or GPU. (Your data could be an opengl buffer in a discrete GPU and you may need to download it to RAM before computing with CPU, same effect can be seen when a host buffer is in RAM and needs to be computed on discrete GPU )
i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.