What is static and dynamic scheduling on GPUs? - cuda

GTX 4xx, 5xx (Fermi) had dynamic scheduling and GTX 6xx (Kepler) switched to static scheduling.
What is static and dynamic scheduling in the context of GPUs?
How does the design choice of static vs. dynamic affect the performance of real world compute workloads?
Is there anything that can be done in code to optimize an algorithm for static or dynamic scheduling?

I assume you're referring to static/dynamic instruction scheduling in hardware.
Dynamic instruction scheduling means that the processor may re-order the individual instructions at runtime. This usually involves some bit of hardware that will try to predict the best order for whatever is in the instruction pipeline. On the GPUs you mentioned, this refers to the re-ordering of instructions for each individual warp.
The reason for switching from a dynamic scheduler back to a static scheduler is described in the GK110 Architecture Whitepaper as follows:
We also looked for opportunities to optimize the power in the SMX warp
scheduler logic. For example, both Kepler and Fermi schedulers contain
similar hardware units to handle the scheduling function, including:
Register scoreboarding for long latency operations (texture and
load)
Inter‐warp scheduling decisions (e.g., pick the best warp to go
next among eligible candidates)
Thread block level scheduling (e.g., the GigaThread engine)
However, Fermi’s scheduler also contains a complex hardware stage to
prevent data hazards in the math datapath itself. A multi‐port
register scoreboard keeps track of any registers that are not yet
ready with valid data, and a dependency checker block analyzes
register usage across a multitude of fully decoded warp instructions
against the scoreboard, to determine which are eligible to issue.
For Kepler, we recognized that this information is deterministic (the
math pipeline latencies are not variable), and therefore it is
possible for the compiler to determine up front when instructions will
be ready to issue, and provide this information in the instruction
itself. This allowed us to replace several complex and power‐expensive
blocks with a simple hardware block that extracts the pre‐determined
latency information and uses it to mask out warps from eligibility at
the inter‐warp scheduler stage.
So basically, they're trading chip complexity, i.e. a simpler scheduler, for efficiency. But that potentially lost efficiency is now picked up by the compiler which can predict the best order, at least for the math pipeline.
As for your final question, i.e. what can be done in code to optimize an algorithm for static or dynamic scheduling, my personal recommendation would be to not use any inline assembler and just let the compiler/scheduler do its thing.

Related

how exactly does CUDA handle a memory access?

i would like to know how CUDA hardware/run-time system handles the following case.
If a warp (warp1 in the following) instruction involves access to global memory (load/store); the run-time system schedules the next ready warp for execution.
When the new warp is executed,
Will the "memory access" of warp1 be conducted in parallel, i.e. while the new warp is running ?
Will the run time system put warp1 into a memory access waiting queue; once the memory request is completed, the warp is then moved into the runnable queue?
Will the instruction pointer related to warp1 execution be incremented automatically and in parallel to the new warp execution, to annotate that the memory request is completed?
For instance, consider this pseudo code output=input+array[i]; where output and input are both scalar variables mapped into registers, whereas array is saved in the global memory.
To run the above instruction, we need to load the value of array[i] into a (temporary) register before updating output; i.e the above instruction can be translated into 2 macro assembly instructions load reg, reg=&array[i], output_register=input_register+reg.
I would like to know how the hardware and runtime system handle the execution of the above 2 macro assembly instructions, given that load can't return immediately
I am not sure I understand your questions correctly, so I'll just try to answer them as I read them:
Yes, while a memory transaction is in flight further independent instructions will continue to be issued. There isn't necessarily a switch to a different warp though - while instructions from other warps will always be independent, the following instructions from the same warp might be independent as well and the same warp may keep running (i.e. further instructions may be issued from the same warp).
No. As explained under 1. the warp can and will continue executing instructions until either the result of the load is needed by dependent instruction, or a memory fence / barrier instruction requires it to wait for the effect of the store being visible to other threads.
This can go as far as issuing further (independent) load or store instructions, so that multiple memory transactions can be in flight for the same warp at the same time. So the status of a warp after issuing a load/store doesn't change fundamentally and it is not halted until necessary.
The instruction pointer will always be incremented automatically (there is no situation where you ever do this manually, nor are there instructions allowing to do so). However, as 2. implies, this doesn't necessarily indicate that the memory access has been performed - there is separate hardware to track progress of memory accesses.
Please note that the hardware implementation is completely undocumented by Nvidia. You might find some indications of possible implementations if you search through Nvidia's patent applications.
GPUs up to the Fermi generation (compute capability 2.x) tracked outstanding memory transaction completely in hardware. While undocumented by Nvidia, the common mechanism to track (memory) transactions in flight is scoreboarding.
GPUs from newer generations starting with Kepler (compute capability 3.x) use some assistance in the form of control words embedded in the shader assembly code. While again undocumented, Scott Gray has reversed engineered these for his Maxas Maxwell assembler. He found that (amongst other things) the control words contain barrier instructions for tracking memory transactions and was kind enough to document his findings on his Control-Codes wiki page.

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

MIPS Architecture : NOP (No-Operation) Vs Data Forwarding in Hazard Prevention

I learnt in computer architecture course that, data hazard can be prevented by using several arbitrary, independent nop instructions in between two mutually dependent instructions. This can be done at assembly level in compiler design.
The alternative way to avoid data hazard is to use data forwarding.
I am bit confused, How these two alternatives differ as far as performance, speed and hardware is concerned. Because as per my knowledge data forwarding is to be implemented at hardware level, whereas nop can be implemented at assembly level.
Anybody please explain me which approach is better if we consider factors such as performance, speed, hardware etc?
Thanks.
Obviously, having the compiler insert nops into the code stream to fill pipeline slots allows hardware to be simplified which can reduce the duration of a pipeline stage or the depth of the pipeline, reduce design effort (time to market, project risk, design cost), or allow a full processor core to fit on a single chip (which helps performance). However, this benefit is tiny compared to the loss of performance from not using forwarding. Higher latency for dependent instructions is very bad for typical programs.
The MIPS R2000, which had both delayed branches and delayed loads, provided result forwarding. (MIPS is an acronym for "Microprocessor without Interlocked Pipeline Stages"). Delayed loads were soon removed from MIPS (which was possible because such did not affect binary compatibility of correct code). The use of delayed instructions was partially from a belief that most delay slots could be filled by the compiler with useful instructions and partially from believing that the increase in code size was not important relative to the simplification of hardware.
Reducing the latency of a load operation was not practical, so the pipeline would need to be stalled for a cycle anyway. The cost of a nop is in cache and memory capacity effects (i.e., the effect of lower code density), and in some cases a single load delay slot could be filled.
Exposing the pipeline organization also has implications for binary compatibility. Later binary compatible implementations must accommodate the ISA designed for the original pipeline organization. A single delayed branch slot works reasonably well for a simple 5-stage scalar implementation (it can be filled with a useful instruction most of the time and allows zero-effective-delay branches [i.e., no stall to resolve the branch or prediction and flushing the pipeline on misprediction]), but when the pipeline is deepened (or made wider) prediction or stalling becomes necessary anyway.
If sufficient parallelism exists in the targeted workloads, hardware simplicity is sufficiently important, and binary compatibility is not a problem, then exposing a pipeline with minimal support for dynamically detecting and handling stall conditions may be sensible. (There are also ways of encoding nops that avoid most of the code size expansion issues.) Having reliably sufficient parallelism (whether instruction-level or thread-level) allows the avoiding of nops; by compiler scheduling with instruction-level parallelism or by hardware thread interleaving with thread-level parallelism.
Hardware simplicity tends to reduce energy per unit of work (as well as chip area), and many modern designs are limited by power use. It also makes sense to perform optimizations at compile time (when they are less latency critical and can be done once rather than each time the code is executed) if the storage and communication cost of additional information is not too expensive (assuming information necessary to perform the optimization is available at compile time [dynamic branch prediction is a classic example of where dynamic information is helpful]).
Well, basically since hardware is optimised with feed forwarding, there has to be no use of explicitly declared software NOPs. But that's not the case.
Though, feed forwarding proves helpful in reducing data hazards, but some hazards cannot be dealt with feed forwarding. It just isn't possible.
Eg.
beq R1,R5,label
instruction 2nd
Here the instruction 2nd will not be fetched until instruction 1 has completed its execution stage and decided whether or not to branch. Until then the 2nd instruction has to be stalled. (stalled for 2 memory cycles). This is done by software by sending out NOPs.
With improvements in technology and hardware optimizations, the beq instruction can complete its execution stage in its register fetch/decode stage by inserting a comparator in the fetch stage itself. Even so, the 2nd instruction will be stalled for(1 memory cycle now). Again NOP is needed.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

How is parallelism on a single thread/core possible?

Modern programming languages provide parallelism and concurrency mechanisms as first class citizens to their users. I understand how parallel algorithms are programmed and can well imagine how two threads on a multi-core CPU can run in parallel.
Yet, most of these platforms also support running parallel processes on a single thread.
Do these processes really run in parallel?
How, on an assembly level can two different routines be executed simultaneously on a single thread?
TLTR; : parallelism (in the sense of true simultanenous execution) on a single, non-hyperthreaded CPU core, is NOT possible.
Hardware (<- EDIT) Paralellism can be achieved at several levels. Ordered by decreasing granularity :
multi-host
multi-processor
multi-core
multi-threads ("Hyper-Threading", i.e. "HT")
(EDIT: I voluntarity omit the case of vectorized compuations where several ALUs can be driven by the same core)
Your question relates to running two software threads in cases 3. (in case HT is unavailable / disabled) or 4.
In both cases, the processes actually do NOT run in parallel. The user has an impression of simultaneity due to the extremely fast context switches performed at the CPU level, that tend to allocate, sequentially, the physical core (resp. thread) time to one or the other software thread
In both cases, those routines are simply not executed simultaneously, but sequentially
The relative priority allocated to each of those 2 routines can be set on various OSes by the "Priority" you give to the process, that will be handled by the OS's scheduler, which in turn will allocate CPU time.
HTH.
To perform tests to better understand this topic, you may want to google "cpu affinity". This will let you run a two-threaded process on one physical single core of a multi-core CPU, and time the time taken by each of the threads, while modifying their priority, etc...
Yes, there is parallelism in each thread and you get it for free, no matter which programming language you use (although the amount of parallelism may vary).
It's called instruction-level parallelism. The details are quite complex and differ between different processor micro-architectures.
Computer Architecture: A Quantitative Approach is a brilliant book which includes a chapter on instruction-level parallelism and the book's examples teach how to think rationally about engineering.
Check out the following links for more information:
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/Instruction_pipelining
http://en.wikipedia.org/wiki/Out-of-order_execution