data per block in CUDA -- does it broadcast in one transaction? - cuda

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards

If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).

If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

Related

Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.
I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?
Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.
However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.
Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.
Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.
When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
bool *d_continue;
cudaMalloc(&d_continue, sizeof(bool));
bool h_continue = true;
while (h_continue){
h_continue = false;
cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice);
my_search_kernel<<<...>>>(..., d_continue);
cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
}
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
bool *z_continue;
cudaHostAlloc(&z_continue, sizeof(bool), ...);
*z_continue = true;
while (*z_continue){
*z_continue = false;
my_search_kernel<<<...>>>(..., z_continue);
cudaDeviceSynchronize();
}
For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.
Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.
Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.

Cuda: async-copy vs coalesced global memory read atomicity

I was reading something about the memory model in Cuda. In particular, when copying data from global to shared memory, my understanding of shared_mem_data[i] = global_mem_data[i] is that it is done in a coalesced atomic fashion, i.e each thread in the warp reads global_data[i] in a single indivisible transaction. Is that correct?
tl;dr: No.
It is not guaranteed, AFAIK, that all values are read in a single transaction. In fact, a GPU's memory bus is not even guaranteed to be wide enough for a single transaction to retrieve a full warp's width of data (1024 bits for a full warp read of 4 bytes each). It is theoretically for some values in the read-from locations in memory to change while the read is underway.

Optimizing random access read and random access write in CUDA

I'd like to optimize the random access read, and random access write in the following code:
__global__ void kernel(float* input, float* output, float* table, size_t size)
{
int x_id = blockIdx.x * blockDim.x + threadIdx.x;
if (x_id > size)
return;
float in_f = input[x_id];
int in_i = (int)(floor(in_f));
int table_index = (int)((in_f - float(in_i)) * 1024000.0f );
float* t = table + table_index;
output[table_index] = t[0] * in_f;
}
As you can see, the index to the table and to the output are determined at run-time, and completely random.
I understand that I can use texture memory or __ldg() for reading such data.
So, my questions are:
Is there better way to read a randomly indexed data than using the texture memory or __ldg()?
What about the random access write as the case of output[table_index] above?
Actually, I'm adding the code here to give an example of random access read and write. I do not need code optimization, I just need a high level description of the best way to deal with such situation.
There are no magic bullets for random access of data on the GPU.
The best advice is to attempt to perform data reorganization or some other method to regularize the data access. For repeated/heavy access patterns, even such intensive methods as sorting operations on the data may result in an overall net improvement in performance.
Since your question implies that the random access is unavoidable, the main thing you can do is intelligently make use of caches.
The L2 is a device-wide cache, and all DRAM accesses go through it. So thrashing of the L2 may be unavoidable if you have large-scale random accesses. There aren't any functions to disable (selectively or otherwise) the L2 for either read or write accesses (*).
For smaller scale cases, the main thing you can do is route the accesses through one of the "non-L1" caches, i.e. the texture cache (on all GPUs) and the Read-Only cache (i.e. __ldg()) on cc3.5 and higher GPUs. The use of these caches may help in 2 ways:
For some access patterns that would thrash the linear-organized L1, you may get some cache hits in the texture or read-only cache, due to a different caching strategy employed by those caches.
On devices that also have an L1 cache in use, routing the "random" traffic through an alternate cache will keep the L1 "unpolluted" and therefore less likely to thrash. In other words, the L1 may still provide caching benefit for other accesses, since it is not being thrashed by the random accesses.
Note that the compiler may route traffic through the read-only cache for you, without explicit use of __ldg(), if you decorate appropriate pointers with const __restrict__ as described here
You can also use cache control hints on loads and stores.
Similar to the above advice for protecting the L1, it may make sense on some devices, in some cases to perform loads and stores in an "uncached" fashion. You can generally get the compiler to handle this for you through the use of the volatile keyword. You can keep both an ordinary and a volatile pointer to the same data, so that accesses that you can regularize can use the "ordinary" pointer, and "random" accesses can use the volatile version. Other mechanisms to pursue uncached access would be to use the ptxas compiler switches (e.g. -Xptxas dlcm=cg) or else manage the load/store operations via appropriate use of inline PTX with appropriate caching modifiers.
The "uncached" advice is the main advice I can offer for "random" writes. Use of the surface mechanism might provide some benefit for some access patterns, but I think it is unlikely to make any improvement for random patterns.
(*) This has changed in recent versions of CUDA and for recent GPU families such as Ampere (cc 8.x). There is a new capability to reserve a portion of the L2 for data persistence. Also see here

CUDA memory for lookup tables

I'm designing a set of mathematical functions and implementing them in both CPU and GPU (with CUDA) versions.
Some of these functions are based upon lookup tables. Most of the tables take 4KB, some of them a bit more. The functions based upon lookup tables take an input, pick one or two entry of the lookup table and then compute the result by interpolating or applying similar techniques.
My question is now: where should I save these lookup tables? A CUDA device has many places for storing values (global memory, constant memory, texture memory,...). Provided that every table could be read concurrently by many threads and that the input values, and therefore the lookup indices, can be completely uncorrelated among the threads of every warp (resulting in uncorrelated memory accesses), which memory provides the fastest access?
I add that the contents of these tables are precomputed and completely constant.
EDIT
Just to clarify: I need to store about 10 different 4KB lookup tables. Anyway it would be great to know wether the solution as for this case would be the same for the case with e.g. 100 4KB tables or with e.g. 10 16KB lookup tables.
Texture memory (now called read only data cache) would probably be a choice worth exploring, although not for the interpolation benefits. It supports 32 bit reads without reading beyond this amount. However, you're limited to 48K in total. For Kepler (compute 3.x) this is quite simple to program now.
Global memory, unless you configure it in 32 bit mode, will often drag in 128 bytes for each thread, hugely multiplying what is actually data needed from memory as you (apparently) can't coalesce the memory accesses. Thus the 32 bit mode is probably what you need if you want to use more than 48K (you mentioned 40K).
Thinking of coalescing, if you were to access a set of values in series from these tables, you might be able to interleave the tables such that these combinations could be grouped and read as a 64 or 128 bit read per thread. This would mean the 128 byte reads from global memory could be useful.
The problem you will have is that you're making the solution memory bandwidth limited by using lookup tables. Changing the L1 cache size (on Fermi / compute 2.x) to 48K will likely make a significant difference, especially if you're not using the other 32K of shared memory. Try texture memory and then global memory in 32 bit mode and see which works best for your algorithm. Finally pick a card with a good memory bandwidth figure if you have a choice over hardware.

Writing large unknown size array in Cuda?

I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?
I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.
My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).
Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).
Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).
Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.
I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.
Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.
But in kernels, use always everything known size.
Kernel must not do malloc... it is very very weird from the concept of the platform.
Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ?
You would not believe what problems could come if you don't )))
Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there.
Ones, your array of data will be finished and last buffer will be not full.
You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds.
Only the last block will have divergence.
Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
You can do in your kernel function something like this, using shared memory:
__global__ void dynamicReverse(int *d, int n)
{
extern __shared__ int s[];
.....
}
and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int):
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.