Threads don’t stall on memory access
From the famous paper http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf by Vasily Volkov
I am assuming based on this statement that this:
__device__ int a;
int b, c, d;
a = b * c;
// Do some work that is independent of 'a'
// ...
d = a + 1;
Is faster than this
__device__ int a;
int b, c, d;
a = b * c;
d = a + 1;
// Do some work that is independent of 'a'
// ...
I am only assuming that because I am giving the chance to the thread to execute different instructions while writing to the global memory, while in the second approach I am not.
Is my assumption right?
And if my assumption is right, then is it a good practice to set all variables that are going to be used later, in the beginning of the kernel? Given that they are independent from each other, also assuming that a is not cached.
Really the stall referenced is a memory read.
It is pointing out that a memory read does not generate a stall, using the value that is read assuming it's not available, causes the stall.
Suppose I have:
__device__ int a[32];
Then this thread code does not cause a stall (although it generates a memory transaction):
int b = a[0];
But if I do this, I will get a stall:
int b = a[0];
int c = a[1];
int d = b * c; // stall occurs here
Therefore, if I can do this:
int b = a[0];
int c = a[1];
// do lots of other work here
int d = b * c; // this might not stall
For Fermi and Kepler GPUs, writes (and reads from values previously written, assuming they have not been evicted from the cache) to global memory are serviced by caches, so thread code that appears to be writing to global memory is usually writing to the L1 or L2 cache, and the actual write transaction to global memory will occur later, and does not necessarily cause a stall of any kind.
So in your example, ordinarily a will be serviced by a cache:
__device__ int a;
int b, c, d;
a = b * c; // a gets written to cache
d = a + 1; // a is serviced from cache
Note that servicing from the cache is still slower than the fastest access mechanisms (e.g. registers and shared mem) but it's much much faster than a global memory stall.
Having said all this, the compiler will ordinarily do a number of things that may affect this. First of all, rather than you manually re-ordering your code, the compiler may spot independent work, and, to some degree, re-order your code for you. Secondly, in your example, the compiler will spot that a is re-used and most likely assign it to a register variable, in addition to updating the value in global memory at some point. The fact that it is in a register means using a in the last line of your example above will most likely get serviced out of the register, not global memory or the cache.
So to answer your questions, I would say that generally, your assumption will not be correct. The compiler will spot the re-use of a and assign it to a register, completely eliminating the hazard that you think exists. In theory, if there were no caches (true for compute 1.x devices) and no registers, then the compiler might be forced to use global memory as you suggest, but in practice it won't happen.
Related
I have a CUDA kernel of following form:
Void launch_kernel(..Arguments...)
{
int i = threadIdx.x
//Load required data
int temp1 = A[i];
int temp2 = A[i+1];
int temp3= A[i+2];
// compute step
int output1 = temp1 + temp2 + temp3;
int output2 = temp1 + temp3;
// Store the result
B[i] = output1;
C[i] = output2;
}
As discussed in CUDA manual, the consistency model for GPU global memory is not sequential. As a result, the memory operations may appear to be performed in order different than original program order. To enforce memory ordering CUDA offers __threadfence() functions. However, as per the manual, such function enforces relative ordering across reads and relative ordering across writes. Quoting a line from manual:
All writes to shared and global memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to shared memory and global memory made by the calling thread after the call to __threadfence_block();
So it is clear that __threadfence() is insufficient to enforce ordering among reads and writes.
How do I enforce the ordering across reads and writes to global memory. Alternatively, how do I make sure that all the reads are guaranteed to be completed before executing the compute and store section of above kernel.
Like #RobertCrovella said in his comment, your code will work fine as it is.
temp1, temp2, and temp3 are local (which will use either registers or local memory {per thread global memory}). These aren't shared between threads, so there's no concurrency concerns whatsoever. They will work just like regular C/C++.
A, B, and C are global. These will be subject to synchronization concerns. A is used as read only so access order doesn't matter. B and C are written, but each thread only writes to it's own index so the order they are written doesn't matter. Your concern about guaranteeing global memory reads are finished is unnecessary. Within a thread, your code will execute in the order written with appropriate stalls for global memory access. You wouldn't want to for performance reasons, but you can do things like B[i] = 0; B[i] = 5; temp1 = B[i]; and have temp1 guaranteed to be 5.
You don't use shared memory in this example, however it is local to thread blocks, and you can synchronize within the thread block using __syncthreads();
Synchronization of global memory across different thread blocks requires ending one kernel and beginning another. NVidia claims they are working on a better way in one of their future directions videos on youtube.
Because my pointers are all pointing to non-overlapping memory I've went all out and replaced my pointers passed to kernels (and their inlined functions) to be restricted, and to made them const too, where ever possible. This however increased the register usage of some kernels and decreased it for others. This doesn't make make much sense to me.
Does anybody know why this can be the case?
Yes, it can increase register usage.
Referring to the programming guide for __restrict__:
The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to "cached" loads and common sub-expressions.
Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy.
const __restrict__ may be beneficial for at least 2 reasons:
On architectures that support it, it may enable the compiler to discover uses for the constant cache which may be a performance-enhancing feature.
As indicated in the above linked programming guide section, it may enable other optimizations to be made by the compiler (e.g. reducing instructions and memory accesses) which also may improve performance if the corresponding register pressure does not become an issue.
Reducing instructions and memory accesses leading to increased register pressure may be non-intuitive. Let's consider the example given in the above programming guide link:
void foo(const float* a, const float* b, float* c) {
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0]; ... }
If we allow for pointer aliasing in the above example, then the compiler can't make many optimizations, and the compiler is essentially reduced to performing the code exactly as written. The first line of code:
c[0] = a[0] * b[0];
will require 3 registers. The next line of code:
c[1] = a[0] * b[0];
will also require 3 registers, and because everything is being generated as-written, they can be the same 3 registers, reused. Similar register reuse can occur for the remainder of the example, resulting in low overall register usage/pressure.
But if we allow the compiler to re-order things, then we must have registers assigned for each value loaded up front, and reserved until that value is retired. This re-ordering can increase register usage/pressure, but may ultimately lead to faster code (or it may lead to slower code, if the register pressure becomes a performance limiter.)
Suppose simple kernel like this:
__global__ void fg(struct s_tp tp, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double3 r = tp.rh[idx];
double d = sqrt(r.x*r.x + r.y*r.y + r.z*r.z);
tp.d[idx] = d;
}
Is this true ?:
double3 r = tp.rh[idx];
data are loaded from global memory into r variables.
r are stored in registers or if there is many variables, in local memory.
r are not stored in shared memory.
d are calculated and after that written back into global memory.
registers are faster than other memories.
if the space of registers is full (some big kernels), local memory is used, and the access is slower
when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Thanks to all.
Yes, it's pretty much all true.
•when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Using shared memory is useful when there is either data reuse (loading the same data item more than once, usually by more than one thread in a threadblock), or possibly when you are making a specialized use of shared memory to aid in global coalescing, such as during an optimized matrix transpose.
Data reuse means that you are using (loading) the data more than once, and for shared memory to be useful, it means you are loading it more than once by more than one thread. If you are using it more than once in a single thread, then the single load plus the compiler (automatic) "optimization" of storing it in a register is all you need.
EDIT
The answer given by #Jez has some good ideas for optimal loading. I would suggest another idea is to convert your AoS data storage scheme to a SoA scheme. Data storage transformation is a common approach to improving speed of CUDA codes.
Your s_tp struct, which you haven't shown, appears to have storage for several double quantities per item/struct. If you instead create separate arrays for each of these quantities, you'll have opportunities for optimal loading/storage. Something like this:
__global__ void fg(struct s_tp tp, double* s_tp_rx, double* s_tp_ry, double* s_tp_rz, double* s_tp_d, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double rx = s_tp_rx[idx];
double ry = s_tp_ry[idx];
double rz = s_tp_rz[idx];
double d = sqrt(rx*rx + ry*ry + rz*rz);
s_tp_d[idx] = d;
}
This approach is likely to have benefits elsewhere in your device code also, for similar types of usage patterns.
It's all true.
when I need doubles, is there any way to speed it up? For example load
data firstly into shared memory and then operate them?
For the example you gave, your implementation is possibly not optimal. The first thing you should do is compare the bandwidth acheived to that of a reference kernel, for example, a cudaMemcpy. If the gap is large, and the speedup you'll gain from closing this gap is significant, optimisations may be possible.
Looking at your kernel there are two things that strike me as potentially suboptimal:
There's not much work per thread. If possible, processing mulitple elements per thread can improve performance. This is, in part, because it avoids thread intialisation/removal overheads.
Loading from a double3 isn't as efficient as loading from other types. The best way to load data is usually using 128-bit loads per thread. Loading three consective 64-bit values will be slower, perhaps not by a lot, but slower all the same.
EDIT: Robert Crovella's answer below gives a good solution to the second point which requires changing around your data type. For some reason I had originally thought this wasn't an option, so the below solution is probably over-the-top if you cna just change your data type!
While adding more work per thread is a fairly simple thing to try, optimising your memory access pattern (without changing your datatype) for a solution is less so. Fortunately there are libraries that can help. I think that using CUB, and in particular, the BlockLoad collective, should allow you to load more efficently. By loading, say, 6 double items per thread using a transpose operator, you can process two elements per thread, pack them into a double2, and store them normally.
It is my understanding (see e.g. How can I enforce CUDA global memory coherence without declaring pointer as volatile?, CUDA block synchronization differences between GTS 250 and Fermi devices and this post in the nvidia Developer Zone) that __threadfence() guarantees that a global writes will be visible to other threads before the thread continues. However, another thread could still read a stale value from its L1 cache even after the __threadfence() has returned.
That is:
Thread A writes some data to global memory, then calls __threadfence(). Then, at some time after __threadfence() has returned, and the writes are visible to all other threads, Thread B is asked to read from this memory location. It finds it has the data in L1, so loads that. Unfortunately for the developer, the data in Thread B's L1 is stale (i.e. it is as before Thread A updated this data).
First of all: is this correct?
Supposing it is, then it seems to me that __threadfence() is only useful if either one can be certain that data will not be in L1 (somewhat unlikely?) or if e.g. the read always bypasses L1 (e.g. volatile or atomics). Is this correct?
I ask because I have a relatively simple use-case - propagating data up a binary tree - using atomically-set flags and __threadfence(): the first thread to reach a node exits, and the second writes data to it based on its two children (e.g. the minimum of their data). This works for most nodes, but usually fails for at least one. Declaring the data volatile gives consistently correct results, but induces a performance hit for the 99%+ of cases where no stale value is grabbed from L1. I want to be sure this is the only solution for this algorithm. A simplified example is given below. Note that the node array is ordered breadth-first, with the leaves beginning at index start and already populated with data.
__global__ void propagate_data(volatile Node *nodes,
const unsigned int n_nodes,
const unsigned int start,
unsigned int* flags)
{
int tid, index, left, right;
float data;
bool first_arrival;
tid = start + threadIdx.x + blockIdx.x*blockDim.x;
while (tid < n_nodes)
{
// We start at a node with a full data section; modify its flag
// accordingly.
flags[tid] = 2;
// Immediately move up the tree.
index = nodes[tid].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
// If we are the second thread to reach this node then process it.
while (!first_arrival)
{
left = nodes[index].left;
right = nodes[index].right;
// If Node* nodes is not declared volatile, this occasionally
// reads a stale value from L1.
data = min(nodes[left].data, nodes[right].data);
nodes[index].data = data;
if (index == 0) {
// Root node processed, so all nodes processed.
return;
}
// Ensure above global write is visible to all device threads
// before setting flag for the parent.
__threadfence();
index = nodes[index].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
}
tid += blockDim.x*gridDim.x;
}
return;
}
First of all: is this correct?
Yes, __threadfence() pushes data into L2 and out to global memory. It has no effect on the L1 caches in other SMs.
Is this correct?
Yes, if you combine __threadfence() with volatile for global memory accesses, you should have confidence that values will eventually be visible to other threadblocks. Note, however that synchronization between threadblocks is not a well-defined concept in CUDA. There are no explicit mechanisms to do so and no guarantee of the order of threadblock execution, so just because you have code that has a __threadfence() somewhere operating on a volatile item, still does not really guarantee what data another threadblock may pick up. That is also dependent on the order of execution.
If you use volatile, the L1 (if enabled -- current Kepler devices don't really have L1 enabled for general global access) should be bypassed. If you don't use volatile, then the L1 for the SM that is currently executing the __threadfence() operation should be consistent/coherent with L2 (and global) at the completion of the __threadfence() operation.
Note that the L2 cache is unified across the device and is therefore always "coherent". For your use case, at least from the device code perspective, there is no difference between L2 and global memory, regardless of which SM you are on.
And, as you indicate, (global) atomics always operate on L2/global memory.
I present here some code
__constant__ int array[1024];
__global__ void kernel1(int *d_dst) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = array[tId];
}
__global__ void kernel2(int *d_dst, int *d_src) {
int tId = threadIdx.x + blockIdx.x * blockDim.x;
d_dst[tId] = d_src[tId];
}
int main(int argc, char **argv) {
int *d_array;
int *d_src;
cudaMalloc((void**)&d_array, sizeof(int) * 1024);
cudaMalloc((void**)&d_src, sizeof(int) * 1024);
int *test = new int[1024];
memset(test, 0, sizeof(int) * 1024);
for (int i = 0; i < 1024; i++) {
test[i] = 100;
}
cudaMemcpyToSymbol(array, test, sizeof(int) * 1024);
kernel1<<< 1, 1024 >>>(d_array);
cudaMemcpy(d_src, test, sizeof(int) * 1024, cudaMemcpyHostToDevice);
kernel2<<<1, 32 >>>(d_array, d_src),
free(test);
cudaFree(d_array);
cudaFree(d_src);
return 0;
}
Which simply shows constant memory and global memory usage. On its execution the "kernel2" executes about 4 times faster (in terms of time) than "kernel1"
I understand from the Cuda C programming guide, that this this because accesses to constant memory are getting serialized. Which brings me to the idea that constant memory can be best utilized if a warp accesses a single constant value such as integer, float, double etc. but accessing an array is not beneficial at all. In other terms, I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more? I mean a structure might contain multiple simple types and array for example; when accessing these simple types, are these accesses also serialized or not?
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Anyone can refer me some example code where an efficient constant memory usage is shown.
regards,
I can say a warp must access a single address in order to have any beneficial optimization/speedup gains from constant memory access. Is this correct?
Yes this is generally correct and is the principal intent of usage of constant memory/constant cache. The constant cache can serve up one quantity per SM "at a time". The precise wording is as follows:
The constant memory space resides in device memory and is cached in the constant cache.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
An important takeaway from the text above is the desire for uniform access across a warp to achieve best performance. If a warp makes a request to __constant__ memory where different threads in the warp are accessing different locations, those requests will be serialized. Therefore if each thread in a warp is accessing the same value:
int i = array[20];
then you will have the opportunity for good benefit from the constant cache/memory. If each thread in a warp is accessing a unique quantity:
int i = array[threadIdx.x];
then the accesses will be serialized, and the constant data usage will be disappointing, performance-wise.
I also want to know, if I keep a structure instead of a simple type in my constant memory. Any access to the structure by a thread with in a warp; is also considered as single memory access or more?
You can certainly put structures in constant memory. The same rules apply:
int i = constant_struct_ptr->array[20];
has the opportunity to benefit, but
int i = constant_struct_ptr->array[threadIdx.x];
does not. If you access the same simple type structure element across threads, that is ideal for constant cache usage.
Last question would be, in case I do have an array with constant values, which needs to be accessed via different threads within a warp; for faster access it should be kept in global memory instead of constant memory. Is that correct?
Yes, if you know that in general your accesses will break the constant memory one 32-bit quantity per cycle rule, then you'll probably be better off leaving the data in ordinary global memory.
There are a variety of cuda sample codes that demonstrate usage of __constant__ data. Here are a few:
graphics volumeRender
imaging bilateralFilter
imaging convolutionTexture
finance MonteCarloGPU
and there are others.
EDIT: responding to a question in the comments, if we have a structure like this in constant memory:
struct Simple { int a, int b, int c} s;
And we access it like this:
int p = s.a + s.b + s.c;
^ ^ ^
| | |
cycle: 1 2 3
We will have good usage of the constant memory/cache. When the C code gets compiled, under the hood it will generate machine code accesses corresponding to 1,2,3 in the diagram above. Let's imagine that access 1 occurs first. Since access 1 is to the same memory location independent of which thread in the warp, during cycle 1, all threads will receive the value in s.a and it will take advantage of the cache for best possible benefit. Likewise for accesses 2 and 3. If on the other hand we had:
struct Simple { int a[32], int b[32], int c[32]} s;
...
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int p = s.a[idx] + s.b[idx] + s.c[idx];
This would not give good usage of constant memory/cache. Instead, if this were typical of our accesses to s, we'd probably have better performance locating s in ordinary global memory.