Optimizing random access read and random access write in CUDA - cuda

I'd like to optimize the random access read, and random access write in the following code:
__global__ void kernel(float* input, float* output, float* table, size_t size)
{
int x_id = blockIdx.x * blockDim.x + threadIdx.x;
if (x_id > size)
return;
float in_f = input[x_id];
int in_i = (int)(floor(in_f));
int table_index = (int)((in_f - float(in_i)) * 1024000.0f );
float* t = table + table_index;
output[table_index] = t[0] * in_f;
}
As you can see, the index to the table and to the output are determined at run-time, and completely random.
I understand that I can use texture memory or __ldg() for reading such data.
So, my questions are:
Is there better way to read a randomly indexed data than using the texture memory or __ldg()?
What about the random access write as the case of output[table_index] above?
Actually, I'm adding the code here to give an example of random access read and write. I do not need code optimization, I just need a high level description of the best way to deal with such situation.

There are no magic bullets for random access of data on the GPU.
The best advice is to attempt to perform data reorganization or some other method to regularize the data access. For repeated/heavy access patterns, even such intensive methods as sorting operations on the data may result in an overall net improvement in performance.
Since your question implies that the random access is unavoidable, the main thing you can do is intelligently make use of caches.
The L2 is a device-wide cache, and all DRAM accesses go through it. So thrashing of the L2 may be unavoidable if you have large-scale random accesses. There aren't any functions to disable (selectively or otherwise) the L2 for either read or write accesses (*).
For smaller scale cases, the main thing you can do is route the accesses through one of the "non-L1" caches, i.e. the texture cache (on all GPUs) and the Read-Only cache (i.e. __ldg()) on cc3.5 and higher GPUs. The use of these caches may help in 2 ways:
For some access patterns that would thrash the linear-organized L1, you may get some cache hits in the texture or read-only cache, due to a different caching strategy employed by those caches.
On devices that also have an L1 cache in use, routing the "random" traffic through an alternate cache will keep the L1 "unpolluted" and therefore less likely to thrash. In other words, the L1 may still provide caching benefit for other accesses, since it is not being thrashed by the random accesses.
Note that the compiler may route traffic through the read-only cache for you, without explicit use of __ldg(), if you decorate appropriate pointers with const __restrict__ as described here
You can also use cache control hints on loads and stores.
Similar to the above advice for protecting the L1, it may make sense on some devices, in some cases to perform loads and stores in an "uncached" fashion. You can generally get the compiler to handle this for you through the use of the volatile keyword. You can keep both an ordinary and a volatile pointer to the same data, so that accesses that you can regularize can use the "ordinary" pointer, and "random" accesses can use the volatile version. Other mechanisms to pursue uncached access would be to use the ptxas compiler switches (e.g. -Xptxas dlcm=cg) or else manage the load/store operations via appropriate use of inline PTX with appropriate caching modifiers.
The "uncached" advice is the main advice I can offer for "random" writes. Use of the surface mechanism might provide some benefit for some access patterns, but I think it is unlikely to make any improvement for random patterns.
(*) This has changed in recent versions of CUDA and for recent GPU families such as Ampere (cc 8.x). There is a new capability to reserve a portion of the L2 for data persistence. Also see here

Related

Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.
I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?
Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.
However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.
Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.
Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.
When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
bool *d_continue;
cudaMalloc(&d_continue, sizeof(bool));
bool h_continue = true;
while (h_continue){
h_continue = false;
cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice);
my_search_kernel<<<...>>>(..., d_continue);
cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
}
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
bool *z_continue;
cudaHostAlloc(&z_continue, sizeof(bool), ...);
*z_continue = true;
while (*z_continue){
*z_continue = false;
my_search_kernel<<<...>>>(..., z_continue);
cudaDeviceSynchronize();
}
For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.
Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.
Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

Writing large unknown size array in Cuda?

I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?
I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.
My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).
Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).
Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).
Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.
I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.
Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.
But in kernels, use always everything known size.
Kernel must not do malloc... it is very very weird from the concept of the platform.
Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ?
You would not believe what problems could come if you don't )))
Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there.
Ones, your array of data will be finished and last buffer will be not full.
You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds.
Only the last block will have divergence.
Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
You can do in your kernel function something like this, using shared memory:
__global__ void dynamicReverse(int *d, int n)
{
extern __shared__ int s[];
.....
}
and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int):
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.

float vs int in cuda

Is it better to use a float instead of an int in CUDA?
Does a float decrease bank conflicts and insure coalescence? (or has it nothing to do with this?)
Bank conflicts when reading shared memory are all about the amount of data read. So, since int and float are the same size (at least I think they are on all CUDA platforms), there's no difference.
Coalescence usually refers to global memory accesses - and again, this is to do with the number of bytes read, not the datatype.
Both int and float are four bytes, so it doesn't make any difference (if you're accessing them both the same way) which you use in terms of coalescing your global memory accesses or bank conflicts on shared memory accesses.
Having said that, you may have better performance with floats since the devices are designed to crunch them as fast as possible, ints are often used for control and indexes etc. and hence have lower performance. Of course it's really more complicated than that - if you had nothing but floats then the integer hardware would sit idle which would be a waste.
Bank conflicts and coalescence are all about memory access patterns (whether the threads within a warp all read/write to different locations with uniform stride). Thus, these concerns are independent of data type (float, int, double, etc.)
Note that data type does have an impact on the computation performance. Single precision float is faster than double precision etc. The beefy FPUs in the GPUs generally means that doing calculations in fixed point is unnecessary and may even be detrimental.
Take a look at the "Mathematical Functions" section of CUDA Developers Guide. Using device runtime functions (intrinsic functions) may provide better performance for various types. You may perform multiple operations in one operation within less clock cycles.
For some of the functions of SectionC.1,a less accurate, but faster version exists inthe device runtime component; it has the same name
prefixed with __ (such as __sinf(x)).. The compiler has an option
(-use_fast_math ) that forces each function in Table to compile to its intrinsic counterpart... selectively replace mathematical function
calls by calls to intrinsic functions only where it is merited by the
performance gains and where changed properties such as reduced
accuracy and different special case handling can be tolerated.
For example instead of using => use: x/y => __fdividef(x, y); sinf(x) => __sinf(x)
And you may find more methods like x+c*y being performed with one function..

How to have atomic load in CUDA

My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner?
I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read.
In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering.
While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. There's more to atomics than just tearing:
A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. For instance, int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. If you're waiting for another thread to change the flag's value, you'll never observe the change. The volatile modifier ensures that the value is actually read from memory on every access. See the CUDA documentation on volatile for more info.
Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication.
For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written.
Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. Out-of-order and/or speculative execution may load the data before the flag's read is completed. The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read.
Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. __threadfence() and friends are also covered in the CUDA docs.
Putting all of this together, writing your own atomic load method in CUDA looks something like:
// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
const volatile unsigned int *vaddr = addr; // volatile to bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const unsigned int value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
volatile unsigned int *vaddr = addr; // volatile to bypass cache
// fence to ensure that previous non-atomic stores are visible to other threads
__threadfence();
*vaddr = value;
}
This can be written similarly for other non-tearing load/store sizes.
From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. They're hoping to add them in sometime this year. Once those are in place, a full std::atomic implementation shouldn't be far behind.
To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have.
There are two quasi-alternatives, with their advantages and drawbacks:
Use a no-op atomic read-modify-write as you suggest. I have provided a similar answer in the past. Guaranteed atomicity and memory consistency but you pay the cost of a needless write.
In practice, the second closest thing to an atomic load could be marking a variable volatile, although strictly speaking the semantics are completely different. The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. But in practice, as indicated in the comments by #Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic.
Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language.