Writing large unknown size array in Cuda? - cuda

I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?
I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.
My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).
Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).

Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).
Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.
I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.

Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.
But in kernels, use always everything known size.
Kernel must not do malloc... it is very very weird from the concept of the platform.
Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ?
You would not believe what problems could come if you don't )))
Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there.
Ones, your array of data will be finished and last buffer will be not full.
You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds.
Only the last block will have divergence.

Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
You can do in your kernel function something like this, using shared memory:
__global__ void dynamicReverse(int *d, int n)
{
extern __shared__ int s[];
.....
}
and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int):
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.

Related

Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.
I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?
Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.
However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.
Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.
Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.
When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
bool *d_continue;
cudaMalloc(&d_continue, sizeof(bool));
bool h_continue = true;
while (h_continue){
h_continue = false;
cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice);
my_search_kernel<<<...>>>(..., d_continue);
cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
}
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
bool *z_continue;
cudaHostAlloc(&z_continue, sizeof(bool), ...);
*z_continue = true;
while (*z_continue){
*z_continue = false;
my_search_kernel<<<...>>>(..., z_continue);
cudaDeviceSynchronize();
}
For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.
Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.
Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.

CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.
My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.
I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:
int *device_A;
int *device_B;
int *device_C;
cudaMalloc(device_A,sizeof(int)*N);
cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);
KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);
cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU
The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:
by reusing the data that a kernel may have populated either in registers or shared memory
by reducing (i.e. eliminating) "redundant" loads and stores
Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:
Load my element of vector A from global memory
Add a constant to, or multiply by a constant, my vector element
Store my element back out to vector A (in global memory)
This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:
Add kernel:
Load my element of vector A from global memory
Add a value to my vector element
Store my element back out to vector A (in global memory)
Multiply kernel:
Load my element of vector A from global memory
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.
But if we combine the two kernels together, we could write a kernel like this:
Load my element of vector A from global memory
Add a value to my vector element
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.
This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.
Here is a trivial code example.

CUDA shared memory - sum reduction from kernel

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?
What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.
This seems unlikely:
I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
I can't imagine how you have enough space to store your data in shared memory but not in global memory.
Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.
Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one.
Or you could adapt the cuda sample code.
Update
After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.
In this case there's simpler solution.
You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.
You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.
In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.
That finishes the kernel design. And no shared mem is needed.
You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.
In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.
Have a look at this that explains parallel reduction in CUDA thoroughly.
If I understand it correctly, each thread should sum up "only" 6 floats.
I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.
If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.
As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

How to make cudaMalloc dynamically

I need to perform cudaMalloc dynamically to allocate memory for a dynamically expanding array, which size can vary in a wide range. This array represents the result of join operation over two tables, so it can be zero size or come up to maximal amount of data (in case when the tables contain totally similar data).
If I allocate memory due to expectation that the tables' data is almost similar, I can get a huge amount of memory that's not used at all
So, is there some way to perform memory allocation dynamically with CUDA to make memory usage efficient?
There is no way to dynamically expand previously allocated memory inside a kernel. The closest you get is 'new' and 'delete' on Fermi. But those allocate new chunks, they don't expand your existing chunk. However, I don't see any point in attempting to expand the allocated memory inside a kernel. Just allocate the maximum amount of memory that could be used by the kernels up front. If that means that you don't have enough memory to complete the processing of the data afterwards, then the program would not have been able to handle that case anyway, if you had been able to dynamically expand the memory.
Also, a scheme where you would continously expand the allocated memory to hold new results would require a lot of communication between the threads (since all threads would have to know how many results have currently been found). Instead, don't attempt to create a result set without gaps in it. Let the results of your join be stored throughout the allocated area, in locations that correspond with the thread indexes. Then, scan through the result with a second kernel or with Thrust to gather the results together.