When can threads of a warp get scheduled independently on Volta+? - cuda

Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?

From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.

Related

Very large instruction replay overhead for random memory access on Kepler

I am studying the performance of random memory access on a Kepler GPU, K40m. The kernel I use is pretty simple as follows,
__global__ void scatter(int *in1, int *out1, int * loc, const size_t n) {
int globalSize = gridDim.x * blockDim.x;
int globalId = blockDim.x * blockIdx.x + threadIdx.x;
for (unsigned int i = globalId; i < n; i += globalSize) {
int pos = loc[i];
out1[pos] = in1[i];
}
}
That is, I will read an array in1 as well as a location array loc. Then I permute in1 according to loc and output to the array out1. Generally, out1[loc[i]] = in1[i]. Note that the location array is sufficiently shuffled and each element is unique.
And I just use the default nvcc compilation setting with -O3 flag opened. The L1 dcache is disabled. Also I fix my # blocks to be 8192 and block size of 1024.
I use nvprof to profile my program. It is easy to know that most of the instructions in the kernel should be memory access. For an instruction of a warp, since each thread demands a discrete 4 Byte data, the instruction should be replayed multiple times (at most 31 times?) and issue multiple memory transactions to fulfill the need of all the threads within the warp. However, the metric "inst_replay_overhead" seems to be confusing: when # tuples n = 16M, the replay overhead is 13.97, which makes sense to me. But when n = 600M, the replay overhead becomes 34.68. Even for larger data, say 700M and 800M, the replay overhead will reach 85.38 and 126.87.
The meaning of "inst_replay_overhead", according to document, is "Average number of replays for each instruction executed". Is that mean when n = 800M, on average each instruction executed has been replayed 127 times? How comes the replay time much larger than 31 here? Am I misunderstanding something or am I missing other factors that will also contribute greatly to the replay times? Thanks a lot!
You may be misunderstanding the fundamental meaning of an instruction replay.
inst_replay_overhead includes the number of times an instruction was issued, but wasn't able to be completed. This can occur for various reasons, which are explained in this answer. Pertinent excerpt from the answer:
If the SM is not able to complete the issued instruction due to
constant cache miss on immediate constant (constant referenced in the instruction),
address divergence in an indexed constant load,
address divergence in a global/local memory load or store,
bank conflict in a
shared memory load or store,
address conflict in an atomic or
reduction operation,
load or store operation require data to be
written to the load store unit or read from a unit exceeding the
read/write bus width (e.g. 128-bit load or store), or
load cache miss
(replay occurs to fetch data when the data is ready in the cache)
then
the SM scheduler has to issue the instruction multiple times. This is
called an instruction replay.
I'm guessing this happens because of scattered reads in your case. This concept of instruction replay also exists on the CPU side of things. Wikipedia article here.

Empirically determining how many threads are in a warp

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

How to count number of executed thread for whole the CUDA kernel execution?

I want to count the number of thread execution gradually for whole the kernel execution. Is there an native counter for this or is there any other method to do that? I know keeping a global variable and increment by each thread would not work since a variable in global memory does not guarantees the synchronized access by the threads.
There are numerous ways to measure thread level execution efficiency. This answer provides a list of different collection mechanisms. Robert Crovella's answer provides a manual instrumentation method that allows for accurately collection of information. A similar technique can be used to collect divergence information in the kernel.
Number of Threads Launched for Execution (static)
gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z
Number of Threads Launched
gridDim.x * gridDim.y * gridDim.z * ROUNDUP((blockDim.x * blockDim.y * blockDim.z), WARP_SIZE)
This number includes threads that are inactive for the life time of the warp.
This can be collected using the PM counter threads_launched.
Warp Instructions Executed
The counter inst_executed counts the number of warp instructions executed/retired.
Warp Instructions Issued
The counter inst_issued counts the number of instructions issued. inst_issued >= inst_executed. Some instructions will be issued multiple times per instruction executed in order to handle dispatch to narrow execution units or in order to handle address divergence in shared memory and L1 operations.
Thread Instructions Executed
The counter thread_inst_executed counts the number of thread instructions executed. The metrics avg_threads_executed_per_instruction can be derived using thread_inst_executed / inst_executed. The maximum value for this counter is WARP_SIZE.
Not Predicated Off Threads Instructions Executed
Compute capability 2.0 and above devices use instruction predication to disable write-back for threads in a warp as a performance optimization for short sequences of divergent instructions.
The counter not_predicated_off_thread_inst_executed counts the number of instructions executed by all threads. This counter is only available on compute capability 3.0 and above devices.
not_predicated_off_thread_inst_executed <= thread_inst_executed <= WARP_SIZE * inst_executed
This relationship will be off slightly on some chips due to small bugs in thread_inst_executed and not_predicated_off_thread_inst_executed counters.
Profilers
Nsight Visual Studio Edition 2.x support collecting the aforementioned counters.
Nsight VSE 3.0 supports a new Instruction Count experiment that can collect per SASS instruction statistics and show the data in table form or next to high level source, PTX, or SASS code. The information is rolled up from SASS to high level source. The quality of the roll up depends on the ability of the compiler to output high quality symbol information. It is recommended that you always look at both source and SASS at the same time. This experiment can collect the following per instruction statistics:
a. inst_executed
b. thread_inst_executed (or active mask)
c. not_predicated_off_thread_inst_executed (active predicate mask)
d. histogram of active_mask
e. histogram of predicate_mask
Visual Profiler 5.0 can accurately collect the aforementioned SM counters. nvprof can collect and show the per SM details. Visual Profiler 5.x does not support collection of per instruction statistics available in Nsight VSE 3.0. Older versions of the Visual Profiler and CUDA command line profiler can collect many of the aforementioned counters but the results may not be as accurate as the 5.0 and above version of the tools.
Maybe something like this:
__global__ void mykernel(int *current_thread_count, ...){
atomicAdd(current_thread_count, 1);
// the rest of your kernel code
}
int main() {
int tally, *dev_tally;
cudaMalloc((void **)&dev_tally, sizeof(int));
tally = 0;
cudaMemcpy(dev_tally, &tally, sizeof(int), cudaMemcpyHostToDevice);
....
// set up block and grid dimensions, etc.
dim3 grid(...);
dim3 block(...)
mykernel<<<grid, block>>>(dev_tally, ...);
cudaMemcpy(&tally, dev_tally, sizeof(int), cudaMemcpyDeviceToHost);
printf("total number of threads that executed was: %d\n", tally);
....
return 0;
}
You can read more about atomic functions here
Part of the reason for the confusion expressed by many in the comments, is that when mykernel is complete (assuming it ran successfully) everyone expects tally to end up with a value equal to grid.x*grid.y*grid.z*block.x*block.y*block.z
I don't think there is a way to calculate the number of threads in a specific path branch. for ex for an histogram, it would be nice to have the following:
PS: Histogram is about counting the pixels for each color.
for (i=0; i<256; i++) // 256 colors, 1 pixel = 1 thread
if (threadidx.x == i)
Histogramme[i] = CUDA_NbActiveThreadsInBranch() // Threads having i as color

Is there a performance penalty for CUDA method not running in sync?

If i have a kernel which looks back the last Xmins and calculates the average of all the values in a float[], would i experience a performance drop if all the threads are not executing the same line of code at the same time?
eg:
Say # x=1500, there are 500 data points spanning the last 2hr period.
# x = 1510, there are 300 data points spanning the last 2hr period.
the thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300, so the later thread will move onto the next position before the 1st thread is finished.
Is this typically an issue?
EDIT: Example code. Sorry but its in C# as i was planning to use CUDAfy.net. Hopefully it provides a rough idea of the type of programming structures i need to run (Actual code is more complicated but similar structure). Any comments regarding whether this is suitable for a GPU / coprocessor or just a CPU would be appreciated.
public void PopulateMeanArray(float[] data)
{
float lookFwdDistance = 108000000000f;
float lookBkDistance = 12000000000f;
int counter = thread.blockIdx.x * 1000; //Ensures unique position in data is written to (assuming i have less than 1000 entries).
float numberOfTicksInLookBack = 0;
float sum = 0; //Stores the sum of difference between two time ticks during x min look back.
//Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
//Thread 1 could be working here.
for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
{
sum = 0;
numberOfTicksInLookBack = 0;
//Thread 2 could be working here. Is this warp divergence?
for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
{
sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
numberOfTicksInLookBack++;
}
data[counter] = sum/numberOfTicksInLookBack;
counter++;
}
}
CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.
In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.
In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.
Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.
A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How are threads organized to be executed by a GPU?
Hardware
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
Software
threads are organized in blocks. A block is executed by a multiprocessing unit.
The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
being executed simultaneously).
Now a simple case: processing a 512x512 image
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
(so to have 512x512 threads = 4096*64)
It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
dim3 threadsPerBlock(8, 8); // 64 threads
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
dim3 numBlocks(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/
imageHeight/threadsPerBlock.y);
The kernel is launched like this:
myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );
Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way:
uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;
Suppose a 9800GT GPU:
it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)
https://www.tutorialspoint.com/cuda/cuda_threads.htm
A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:
func1();
__syncthreads();
func2();
__syncthreads();
then the kernel must run twice and the order of execution will be:
func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads
Note:
The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.
I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.
Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.
I would greatly appreciate if someone test this again for a more accurate piece of information.