How to make cudaMalloc dynamically - cuda

I need to perform cudaMalloc dynamically to allocate memory for a dynamically expanding array, which size can vary in a wide range. This array represents the result of join operation over two tables, so it can be zero size or come up to maximal amount of data (in case when the tables contain totally similar data).
If I allocate memory due to expectation that the tables' data is almost similar, I can get a huge amount of memory that's not used at all
So, is there some way to perform memory allocation dynamically with CUDA to make memory usage efficient?

There is no way to dynamically expand previously allocated memory inside a kernel. The closest you get is 'new' and 'delete' on Fermi. But those allocate new chunks, they don't expand your existing chunk. However, I don't see any point in attempting to expand the allocated memory inside a kernel. Just allocate the maximum amount of memory that could be used by the kernels up front. If that means that you don't have enough memory to complete the processing of the data afterwards, then the program would not have been able to handle that case anyway, if you had been able to dynamically expand the memory.
Also, a scheme where you would continously expand the allocated memory to hold new results would require a lot of communication between the threads (since all threads would have to know how many results have currently been found). Instead, don't attempt to create a result set without gaps in it. Let the results of your join be stored throughout the allocated area, in locations that correspond with the thread indexes. Then, scan through the result with a second kernel or with Thrust to gather the results together.

Related

Trying to understand in CUDA why is zero copy faster if it also travel through PCIe?

It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.
I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?
Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.
However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.
Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.
Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.
When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
bool *d_continue;
cudaMalloc(&d_continue, sizeof(bool));
bool h_continue = true;
while (h_continue){
h_continue = false;
cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice);
my_search_kernel<<<...>>>(..., d_continue);
cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
}
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
bool *z_continue;
cudaHostAlloc(&z_continue, sizeof(bool), ...);
*z_continue = true;
while (*z_continue){
*z_continue = false;
my_search_kernel<<<...>>>(..., z_continue);
cudaDeviceSynchronize();
}
For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.
Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.
Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.

How to avoid race condition in different blocks in CUDA

I am writing a function in CUDA that divides set of unsorted points in a 3D grid. Based on the bounds of the point set, I can find the coordinate of every point and write it in an array within the grid cell.
I launch kernal with threads equal to number of points by dividing them in different blocks for max thread count.
Now each thread finds its coordinate and write the point in the cell, but other threads within same or different block can also compute same coordinate at same time. The code fails here because of race condition.
I read about atomics, locks and critical section but these synchronizations are used within a thread block only, that is unlikely in my case.
Any suggestions please ?
My initial guess is I need to sort the points based on distance of grid cell size, and launch kernal with each block equal to size of grid cell
Atomics can work on the global memory and synchronize between blocks. The only issue here is performance. Depending on how much of the run time is taken up by just performing the writes to memory you may get slower code than just doing it in serial on the CPU. Atomics are slow. Maybe try to rethink the problem.

Is there a way to know what's the extra space that cudaMalloc is going to reserve?

When I use cudaMalloc (100) it reserves more than 100 B (According to some users here it's due to granularity issues and housekeeping information.
Is it possible to determine how big this space will be based on the Bytes I need to reserve?
Thank you so much.
EDIT: I'll explain why I need to know.
I want to apply the convolution algorithm over huge images on the GPU. To do so, since there isn't enough memory on the GPU to hold it, I need to split the image in batches of rows an call the kernel several times.
In fact, I need to send 2 images, the OnlyRead matrix and the Results matrix.
I want to calcule a priori the max number of rows I can send to the device according to the amount of free memory.
The first cudaMalloc executes successfully, but the problem appears when trying to execute the second CudaMalloc since the first reserve took more Bytes than expected.
What I'm doing now is considering the free memory amount a 10% less than what it is... but that's just a magical number that came from nowhere..
"Is there a way to know what's the extra space that cudaMalloc is going to reserve?"
Not without violating CUDA's platform guarantees, no. cudaMalloc() returns a pointer to the requested amount of memory. You can't make any assumptions about the amount of memory that happens to be valid after the end of the requested amount - the CUDA allocator already makes use of suballocators, and unlike CPU-based memory allocators, the data structures to track free lists etc. are not interleaved with the allocated memory. So for example, it would be unwise to assume that the CUDA runtime's guarantees about the alignment of the returned pointers mean anything other than that returned pointers will have a certain alignment.
If you study the CUDA runtime's behavior, that will shed light on the behavior of that particular CUDA runtime, but the behavior may change with future releases and break your code.

data per block in CUDA -- does it broadcast in one transaction?

i have a data array that is per-block.
i have N blocks inside a cuda Grid and a constant array of data "block_data[]" with size N.
so, all threads in a given block 'X' access block_data[X] just one time, and do something with that value.
my question is: does this broadcast scheme work efficiently?
if not, what approach should i take?
edit after comments: my only problem with constant memory is its limited size, since i could have more than 64K blocks. That would mean more than 64KB
regards
If you just use a normal global memory access then the transaction is fairly inefficient, although depending on how much work your kernel is doing the impact is probably quite small.
I'm assuming sizeof(block_data) is one byte (inferred from your question "...could have more than 64K blocks. That would mean more than 64KB").
If the operation is cached in L1 then you will fetch 128 bytes for the one bit of info you need (sizeof(block_data)), if other warps in the block request the same data then they should get from L1. The efficiency of the load is 1/128 but you should only pay that once for the block.
If the operation is not cached in L1 (e.g. you pass "-dlcm=cg" to the assembler) then you will fetch 32 bytes. The efficiency is 1/32 but you pay that once for each warp.
Once the data is loaded, it is broadcast to all threads in the warp.
An alternative would be to mark the data as const __restrict__ which indicates to the compiler that the data is a) read-only and b) not aliased by any other pointer. Since the compiler can detect that the access is uniform then it can optimise the access to use one of the read-only caches (e.g. constant cache or, on compute capability >=3.5, read-only data cache aka texture cache).
If you want to change the values in block_data[N] array, better use the concept of shared memory __shared__. If you are not changing the value of block_data[N], use __const__ or use the concept of cache. By using L2 Cache, you can get 1536KB of memory (Kepler).

Writing large unknown size array in Cuda?

I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?
I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.
My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).
Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).
Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).
Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.
I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.
Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.
But in kernels, use always everything known size.
Kernel must not do malloc... it is very very weird from the concept of the platform.
Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ?
You would not believe what problems could come if you don't )))
Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there.
Ones, your array of data will be finished and last buffer will be not full.
You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds.
Only the last block will have divergence.
Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
You can do in your kernel function something like this, using shared memory:
__global__ void dynamicReverse(int *d, int n)
{
extern __shared__ int s[];
.....
}
and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int):
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.