I learned that test-and-set can work on single processor, but does it also work on multiprocessor systems?
The primary goal of the test-and-set operation is to allow implement spinlocks for mutual exclusion on multiprocessor systems. So it has to work.
Related
I have read the documentation carefully but still, confused due to the large amount of information for different CUDA versions.
Is it that there is only one default stream on the entire device or
there is one default stream per-process on the HOST CPU? If the answer depends on the version of CUDA, could you also list the situation for different CUDA versions?
By default, CUDA has a per-process default stream. There is a compiler flag --default-stream per-thread which changes the behaviour to per-host-thread default stream, see the documentation.
Note that streams and host threads are programming-level abstractions for hardware details. Even with a single process, there is a limited number of streams you can use concurrently, depending on the hardware. For example, on the Fermi architecture, all streams were multiplexed into a single hardware queue, but since Kepler there are 32 separate hardware queues (see CUDA Streams: Best Practices and Common Pitfalls).
Since the programming guide does not talk about multiple processes in this part, I believe these abstractions do not define the behaviour of multi-process scenarios. As for multi-process, the right term is "CUDA context" which is created for each process and even each host thread (when using the runtime API). The question of how many contexts can be active on a device at the same time: the guide says in 3.4 Compute modes that in the default mode, "Multiple host threads can use the device". Since the following exclusive-process mode talks about CUDA contexts instead, I assume that this means that the description of the default mode covers even multiple host threads from multiple processes.
For more info about multi-process concurrency see e.g. How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?, Unleash legacy MPI codes with Kepler's Hyper-Q and CUDA Streams: Best Practices and Common Pitfalls.
Finally, note that multi-process concurrency works this way since the Kepler architecture, which is the oldest supported architecture nowadays. Since the Pascal architecture there is support for compute preemption (see 3.4 Compute modes for details).
I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").
I hardly understand what the value given by the multiProcessorCount property represent, due to the fact that I experience difficulties in grasping the CUDA architecture.
I'm sorry if some of the following statements appear to be naive. From what I understood so far, here are the hardware "layers":
A CUDA processor is a grid of building blocks.
A building block is composed of two or more streaming multiprocessors.
A streaming multiprocessor is composed of many streaming processors, also called core.
A streaming processor is "massively" threaded, meaning that it implements many hardware managed threads. One streaming processor, one core, can really compute only one thread at a time, but it has many "hardware threads" that can load data while waiting for their turn to be computed by the SP.
On the software side:
A block is composed of threads, and is executed by a streaming multiprocessor
If one launched more blocks than the number of streaming multiprocessors on the card, I guess blocks wait in some sort of queue, to be executed.
Software threads are distributed to streaming processors, which distribute them to their hardware threads. And similar to the previous case, if one launched more threads that the streaming processors can handle with their hardware threads, software threads wait in a queue.
In both cases, the max number of threads, and blocks, that it is allowed to launch, is independent from the number of streaming multiprocessors, streaming processors, and hardware threads of each streaming processor, that actually exist on the card. Those notions are software!
Am I at least close to the reality?
With that being said, what does the multiProcessorCount property gives? On my 610M, it says I only have one multiprocessor... Does that mean that I only have one streaming multiprocessor? I would have a building block composed of only one streaming multiprocessor? That seems impossible to me. And that would mean that I can only execute one block at a time!
Besides, when the specifications of my card says that I have 48 cuda cores, are they talking about streaming processors?
Perhaps this answer will help. It's a little out of date now since it refers to old architectures, but the principles are the same.
It is entirely possible for a GPU to consist of a single SM (streaming multiprocessor), especially if it is a mobile GPU. That single SM, which is composed of multiple CUDA cores, can accommodate multiple thread blocks (up to 16 on the latest Kepler-generation GPUs).
In your case, your 610M GPU has one Streaming Multiprocessor (SM), composed of 48 CUDA cores (aka Streaming Processors, SPs).
Assume I have 8 threadblocks and my GPU has 8 SMs. Then how does GPU issue this threadblocks to the SMs?
I found some programs or articles suggest a breadth-first manner, that is , each SM runs a threadblock in this example.
However, according to a few documents, increasing occupancy may be a good idea if GPU kernels are latency-limited. It might be inferred that 8 threadblocks will run on 4 or less SMs if it can.
I wonder which one is the reality.
Thanks in advance.
It's hard to tell what the GPU is doing exactly. If you have a specific kernel you're interested in, you could try reading and storing the %smid register for each block.
An example of how to do this is given here.
You ask the wrong question: you shouldn't worry about how hardware allocates thread-blocks to SMs. That's GPU's responsibility. In fact, since their programming model makes no assumptions as for which blocks will run on which SMs, you get scalability across a pool of computing devices/future generations.
Instead, you should try to feed GPU with the optimal number of thread-blocks. That's non-trivial, since it's subject to many restrictions
Yet another bandwidth related question. I expected the plots of Device-to-host bandwidth and that of Host-to-Device to be similar, but I see that there is a significant difference between the two. Considering both following the same route, so the effective bandwidth should be the same, isn't it? The testbed consists of total 12 Intel Westmere CPUs on two sockets, 4 Tesla C2050 GPUs with 4 PCIe Gen2 Express slots. Using the bandwidthtest program from NVidia code samples.
What are the overheads of doing a cudamemCpy from the host vs the device?
First, I would say those two curves are similar. I can honestly say that I've never seen symmetric PCI-e bandwidth on any system I have used -- and that includes both CUDA and graphics (OpenGL/D3D) tests, so I don't think it's something (especially this small difference) that should concern you.
As with your other PCI-e bandwidth question, the answer is similar -- the driver may use different strategies for different types and sizes of transfers, attempting to get the highest throughput possible.
Actual throughput depends on many factors, including the type of GPU, and especially on the host chipset in use.