Best way to "warm up" the GPU with CUDA?

Best way to "warm up" the GPU with CUDA? - cuda

I'm aware of the thread Is the warmup code necessary when measuring CUDA kernel running time?
But even with the great internet, this CUDA book and the code samples that go with it available, I can't find an example of a warmup kernel.
My question is: What is the best, cleanest way to "warm up" the GPU with CUDA before running experiments with timers?

In the examples of the book "Professional CUDA C Programming" a simple kernel similar to this was defined to be executed on each visible GPU as a warmup:
__global__ void warm_up_gpu(){
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
float ia, ib;
ia = ib = 0.0f;
ib += ia + tid;
}

Related

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.

Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

How to properly coalesce writes from global memory into global memory?

Please understand me, but I don't know English.
My Computing environment is
CPU : Intel Xeon x5690 3.46Ghz * 2EA
OS : CentOS 5.8
VGA : Nvidia Geforce GTX580 (CC is 2.0)
I read already the documents about "coalesced memory access" on CUDA C programming guide.
But I can't apply them in my case.
I've 32x32 blocks/grid and 16x16 threads/block.
That means as following code.
dim3 grid(32, 32);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
Then, How can I use that coalesced memory access?
I used code in below kernel.
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i*512+j] = ...;
I used the constant 512 because total amount of threads is 512x512 threads:It is grid_size x block_size.
But, I saw "Low Global Memory Store Efficiency[9.7% avg, for kernels accounting for 100% of compute]" from Visual Profiler.
Helper says using the coalesced memory access.
But, I cannot know what should I use the index context of the memory.
For more information for detail code, The result of an experiment different from CUDA Occupancy Calculator

Coalescing memory loads and stores in CUDA is a pretty straightforward concept - threads in the same warp need to load or store from/into suitably aligned, consecutive words in memory.
The warp size is 32 in CUDA, and warps are formed from threads within the same block, ordered so that the x dimension of threadIdx.{xyz} varies the fastest, the y the next fastest, and the z the slowest (functionally this is the same as column major ordering in arrays).
The code you have posted isn't achieving coalesced memory stores because threads within the same warp are storing with a pitch of 512 words, not within the required 32 consecutive words.
A simple hack to improve coalescing would be to address the memory in column major order, so:
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i+512*j] = ...;
A more general approach on a 2D block and grid to achieve coalescing in the spirit of what you showed in the question would be like this:
tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;
bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;
threads_per_block = blockDim.x * blockDim.y;
tid_in_grid = tid_in_block + thread_per_block * bid_in_grid;
global_memory[tid_in_grid] = ...;
The most appropriate solution will depend on other details of the code and data which you have not described.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}

Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.

this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage

When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.

GPGPU - CUDA: global store efficiency

I am trying to figure out how well the global memory write accesses of one of my kernels are coalesced, based on the "global store efficiency" value of NVidia's profiler (I am using CUDA 5 toolkit preview release, on a Fermi GPU).
As far as I understood, this value is the ratio of requested memory transactions to actual nb of transcations performed, therefore reflects whether accesses are all perfectly coalesced (100% efficiency) or not.
Now, for a thread block width of 32, and taking float values as input and output, the following test kernel gives 100% efficiency both for global load and for global store, as expected:
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = input[offset];
output[offset] = tmp;
}
What I don't understand is why when I start adding useful code in between the input read and the output write, the global store efficiency begins to drop, whereas I have not changed the memory write pattern or the thread block geometry ? The global load stays at 100%, as I expect, though.
Could someone please shed a light on why this happens ? I thought, since all 32 threads in a given warp execute the output store instruction simultaneously (by definition) and using a "coalescing-friendly" pattern, I should still get 100% whatever I do before, but obviously I must be misunderstanding something on either the meaning of global store efficiency, or on the conditions for global store coalescing.
Thx,
EDIT :
Here is an example: if I use this code (just adding a "round" operation on input), global store efficiency drops from 100% to 95%
__global__ void dummyKernel(float*output,float* input,size_t pitch)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
int offset = y*pitch+x;
float tmp = round(input[offset]);
output[offset] = tmp;
}

Unsure if this is the case, but round probably converts its argument to a double and if there is a register spilling, then each thread would access 8 bytes of memory, which would then be coerced into 4 bytes of tmp. Accessing 8 bytes would reduce the coalescing to half-warp.
However, I believe register spilling shouldn't happen since the number of local variables in your kernel is small. You could check with nvcc --ptxas-options=-v for the spill.

Ok, shame on me, I found the problem: I was profiling this simple test code in Debug mode, which gives completely wild numbers for most metrics. Re-profiling in Release mode gave me the expected result : 100% store efficiency in both cases.

CUDA - what is this loop doing

Hey
I've seen on a website this example kernel
__global__ void loop1( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for(i=i0;i<N;i+=blockDim.x*gridDim.x) {
y[i] = alpha*x[i] + y[i];
}
}
To compute this function in C
for(i=0;i<N;i++) {
y[i] = alpha*x[i] + y[i];
}
Surely the for loop inside the kernel isn't necessary? and you can just do y[i0] = alpha*x[i0] + y[i0] and remove the for loop altogether.
I'm just curious as to why it's there and what it's purpose is. This is assuming a kernel call such as loop1<<<64,256>>>> so presumably gridDim.x = 1

You need the for loop in the kernel if your vector has more entrys than you have started threads. If it's possible it is of course more efficent to start enough threads.

Interesting kernel. The loop inside the kernel is necessary, because N is greater than total number of threads, which is 16 384 (blockDim.x*gridDim.x), but I think it's not good practice to do it (the whole point of CUDA is to use SIMT concept). According to CUDA Programming Guide you can have at most 65535 thread blocks with one kernel. Futhermore starting from Compute Capability 2.x (Fermi) you can have at most 1024 threads per one block (512 before Fermi) Also you can (if possible) separate code into multiple (sequential) kernels.

Much as we would like to believe that CUDA GPUs have infinite execution resources, they do not, and authors of highly optimized code are finding that unrolled for loops, often with fixed numbers of blocks, give the best performance. Makes for painful coding, but optimized CPU code is also pretty painful.
btw a commenter mentioned that this code would have coalescing problems, and I don't see why. If the base addresses are correctly aligned (64B since those are floats), all of the memory transactions by this code will be coalesced, provided the threads/block is also divisible by 64.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008