Jetson TK1 Multiple Streams parallel execution - cuda

Considering that Tk1 has single SM, is it really possible to run streams concurrently ? I have been unable to do so, even with latest vesions of cuda libraries.
So is it really possible ? any sample code would be great. The sample code under cuda Blas also runs sequential as show on visual profiler.
Also a better insight into what "Streams" are good for in a Single SM ?
[Already asked on nvidia dev forum, the forum isnt very active i think]

With a single Kepler SM, it is not possible to run several streams concurrently. Kepler does not support preemption. This is not related to a CUDA version, rather related to the capability of the SM. Something related to preemption has be discussed for Pascal at GTC 2016, but nothing before.
Regarding the actual use of streams with single SM, some async functions may behave slightly differently between stream 0 and other streams. Hence, I assume some corner case of async memcopy and execution might benefit from streams with single SM - as TK1 device query reads that it has concurrent copy and exec with 1 copy engine. (even though it might be that ZeroCopy is a better approach on TK1).

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Concurrent GPU kernel execution from multiple processes

I have an application in which I would like to share a single GPU between multiple processes. That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process.
My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? Do AMD's GPUs support something similar?
[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
In response to the first question, NVIDIA has published some hyper-Q results in a blog here. The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU.
I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. This is only on Kepler.

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

Multi-GPU programming strategies using CUDA

I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.
Single host thread - multiple devices (shared context)
Single host thread - concurrent execution of kernels on a single device (shared context)
Multiple host threads - (Equal) Multiple devices (independent contexts)
Single host thread - Sequential kernel execution on one device
Multiple host threads - concurrent execution of kernels on one device (independent contexts)
Multiple host threads - sequential execution of kernels on one device (independent contexts)
Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.
Thanks,
Sayan
EDIT:
I thought that the previous categorization involved some redundancy, so modified it.
Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.
Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.
Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.
Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.

Different threads on different multiprocessors

Is it possible to run different threads on different multiprocessors? similar to CPU cores?
Suppose I have 2 large arrays a, b and I want to compute both sum and difference. Lets say I have 2 multiprocessors on my device. Is it possible to run both kernel functions (which compute sum and difference) concurrently on 2 different multiprocessors?
Using your example of computing both the sum and difference, the best performance is probably going to be achieved if you compute both at the same time (i.e. in the same kernel).
Assuming this is not possible for some reason, then if your arrays are very large then the best performance may be to use the whole GPU (i.e. multiple multiprocessors) to compute the result in which case it doesn't matter too much that you do one after the other.
For both of the above cases I would strongly recommend you check out the reduction sample in the SDK which walks you through a naive implementation up to a pretty quick version with good documentation.
Having said all of that, if the amount of work is sufficiently small that you would not be fully utilising the whole GPU for one of your computations then there are two ways to run different computations on different multiprocessors:
Use "Concurrent Kernels", where multiple kernels run on the same GPU at the same time. See the CUDA Programming Guide for more information and check out the concurrentKernels sample in the SDK, in essence you manual tell the scheduler that there is no dependency between the two (by using CUDA streams) which allows thenm to be executed simultaneously.
Have a switch on the blockIdx to select which code to execute.
The first of these is far more preferable if your hardware supports it (you will need Compute Capability 2.0 or greater) since it is far simpler to read and maintain.
yes, using Fermi devices and multiple streams.