I hardly understand what the value given by the multiProcessorCount property represent, due to the fact that I experience difficulties in grasping the CUDA architecture.
I'm sorry if some of the following statements appear to be naive. From what I understood so far, here are the hardware "layers":
A CUDA processor is a grid of building blocks.
A building block is composed of two or more streaming multiprocessors.
A streaming multiprocessor is composed of many streaming processors, also called core.
A streaming processor is "massively" threaded, meaning that it implements many hardware managed threads. One streaming processor, one core, can really compute only one thread at a time, but it has many "hardware threads" that can load data while waiting for their turn to be computed by the SP.
On the software side:
A block is composed of threads, and is executed by a streaming multiprocessor
If one launched more blocks than the number of streaming multiprocessors on the card, I guess blocks wait in some sort of queue, to be executed.
Software threads are distributed to streaming processors, which distribute them to their hardware threads. And similar to the previous case, if one launched more threads that the streaming processors can handle with their hardware threads, software threads wait in a queue.
In both cases, the max number of threads, and blocks, that it is allowed to launch, is independent from the number of streaming multiprocessors, streaming processors, and hardware threads of each streaming processor, that actually exist on the card. Those notions are software!
Am I at least close to the reality?
With that being said, what does the multiProcessorCount property gives? On my 610M, it says I only have one multiprocessor... Does that mean that I only have one streaming multiprocessor? I would have a building block composed of only one streaming multiprocessor? That seems impossible to me. And that would mean that I can only execute one block at a time!
Besides, when the specifications of my card says that I have 48 cuda cores, are they talking about streaming processors?
Perhaps this answer will help. It's a little out of date now since it refers to old architectures, but the principles are the same.
It is entirely possible for a GPU to consist of a single SM (streaming multiprocessor), especially if it is a mobile GPU. That single SM, which is composed of multiple CUDA cores, can accommodate multiple thread blocks (up to 16 on the latest Kepler-generation GPUs).
In your case, your 610M GPU has one Streaming Multiprocessor (SM), composed of 48 CUDA cores (aka Streaming Processors, SPs).
Related
I am trying to understand the basic architecture of a GPU. I have gone through a lot of material including this very good SO answer. But I am still confused not able to get a good picture of it.
My Understanding:
A GPU contains two or more Streaming Multiprocessors (SM) depending upon the compute capablity value.
Each SM consists of Streaming Processors (SP) which are actually responisible for the execution of instructions.
Each block is processed by SP in form of warps (32 threads).
Each block has access to a shared memory. A different block cannot access the data of some other block's shared memory.
Confusion:
In the following image, I am not able to understand which one is the Streaming Multiprocessor (SM) and which one is SP. I think that Multiprocessor-1 respresent a single SM and Processor-1 (upto M) respresent a single SP. But I am not sure about this because I can see that each Processor (in blue color) has been provided a Register but as far as I know, a register is provided to a thread unit.
It would be very helpful to me if you could provide some basic overview w.r.t this image or any other image.
First, some comments on the "My understanding" portion of the question:
The number of SMs depends on GPU model - there are low-end models with just one SM, and high-end ones with as many as 30! Compute capability defines what those SMs are capable of, but not how many SMs there are in a GPU.
Each thread block is assigned to an SM, not SP. There can be multiple thread blocks running on a given SM, subject to its resource limitations.
On to the diagram:
Orange boxes are indeed SMs, just as they are labeled. Each SM has shared memory pool, divided between all thread blocks running on this SM.
Blue boxes are SPs. Since SP is a scalar lane, it runs one thread, and each thread is provided with its own set of registers, again, just like the diagram shows.
Addressing the follow-up question:
Each SM can have multiple resident thread blocks. The maximum number of thread blocks resident on SM is determined by compute capability. Achieved number can be lower than maximum when it is limited by the number of registers or the amount of shared memory consumed by each thread block.
SM will then schedule instruction from all warps resident on it, picking among warps that have instructions ready for execution - and those warps may come from any thread block resident on this SM. You generally want to have many warps resident, so that at any given moment of time SPs can be kept busy running instructions from whatever warps are ready.
Number of cores per SM is not a very useful metric, and you need not think too much about it at this point.
Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.
I'm studying OpenCL concepts as well as the CUDA architecture for a small project, and there is one thing that is unclear to me: the necessity for Warps.
I know a lot of questions have been asked on this subject, however after having read some articles i still don't get the "meaning" of warps.
As far as I understand (speaking for my GPU card which is a Tesla, but i guess this easily translates to other boards):
A work-item is linked to a CUDA thread, which several of them can be executed by a Streaming Processor (SP). BTW, does a SP treats those WI in parallel?
Work-items are grouped into Work-groups. Work-groups operate on a Stream Multiprocessor and can not migrate. However, work-items in a work-group can collaborate via shared memory (a.k.a local memory). One or more work-groups may be executed by a Stream MultiProcessor. BTW, does a SM treats those WG in parallel?
Work-item are executed in parallel inside a work-group. However, synchronization is NOT guaranteed, that's why you need concurrent programming primitives, such as barriers.
As far as I understand, all of this is rather a logical view than a 'physical', hardware perspective.
If all of the above is correct, can you help me on the following. Is that true to say that:
1 - Warps execute 32 threads or work-items simultaneously. Thus, they will 'consume' parts of a work-group. And that's why in the end you need stuff like memory fences to synchronize work-items in work groups.
2 - The Warp scheduler allocates the registers for the 32 threads of warp when it becomes active.
3 - Also, are executed thread in a warp synchronized at all?
Thanks for any input on Warps, and especially why they are necessary in the CUDA architecture.
My best analogon is that a Warp is the vector that be processed in parallel, not unlike an AVX or SSE vector with an Intel CPU. This makes an SM a 32-length vector processor.
Then, to your questions:
Yes, all 32 elements will be run in parallel. Note that also a GPU puts hyperthreading to the extreme: a workgroup will consist of multiple Warps, which all are run more-or-less in parallel. You will need memory fences to sychronise that all.
Yes, typically all 32 work elements (CUDA: thread) in a Warp will work in parallel. Note that you typically will have multiple regsters per work element.
Not guaranteed, AFAIK.
In my current understanding, the hardware hierarchy of CUDA model is GPU card -> Streaming Multiprocessors (SMs) -> cores, and the program hierarchy is kernel-> grid -> block -> warp -> single thread. I want to know the correspondence between the hardware and program hierarchy. That is, Is a kernel in general composed of several grids? is grid contained in the GPU card or in the SMs? if grid is contained in the GPU card, can the the GPU card contain only one grid or multiple grids? Does block correspond to a SMS? Can a SMs contains only one block or multiple blocks? Can a block span several SMs? Can a core execute only one thread or multiple threads? etc.
A kernel is a function that runs on the GPU.
The grid is all threadblocks associated with a kernel launch. A kernel launch creates a single grid.
A grid can run on the entire GPU devices (all SM's in the GPU).
A grid is composed of threadblocks.
Threadblocks are groups of threads. Threads are grouped into warps (32 threads) for execution purposes, so we can also say threadblocks are groups of warps.
Threadblocks (the warps they contain) execute on an SM. Once a threadblock begins executing on a particular SM, it stays on that SM and will not migrate to another SM.
SMs are composed of cores. Each core executes one thread. The core execution engine may have the ability to handle multiple instructions at a time, so it can actually handle more than one thread, but not from the same warp. This part gets complicated and it's not essentialy to good beginner understanding of how a GPU works, so it's convenient and useful to think of a core only handling one thread at any given instant (instruction cycle).
An SM can handle multiple blocks simultaneously.
Please don't post questions that contain many questions.
Questions on SO should show some research effort.
Good research effort for questions like these would be take take some basic webinars from the nvidia webinar page, which will only require a couple hours of study.
Try these two first:
GPU Computing using CUDA C – An Introduction (2010)
An introduction to the basics of GPU computing using CUDA C. Concepts will be illustrated with walkthroughs of code samples. No prior GPU Computing experience required
GPU Computing using CUDA C – Advanced 1 (2010)
First level optimization techniques such as global memory optimization, and processor utilization. Concepts will be illustrated using real code examples
I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.
My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.
I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. This is only on Kepler.