How best to transfer a large number of arrays of chars to the GPU? - cuda

I am new to CUDA and am trying to do some processing of a large number of arrays. Each array is an array of about 1000 chars (not a string, just stored as chars) and there can be up to 1 million of them, so about 1 gb of data to be transfered. This data is already all loaded into memory and I have a pointer to each array, but I don't think I can rely on all the data being sequential in memory, so I can't just transfer it all with one call.
I currently made a first go at it with thrust, and based my solution kind of on this message ... I made a struct with a static call that allocates all the memory, and then each individual constructor copies that array, and I have a transform call which takes in the struct with the pointer to the device array.
My problem is that this is obviously extremely slow, since each array is copied individually. I'm wondering how to transfer this data faster.
In this question (the question is mostly unrelated, but I think the user is trying to do something similar) talonmies suggests that they try and use a zip iterator but I don't see how that would help transfer a large number of arrays.
I also just found out about cudaMemcpy2DToArray and cudaMemcpy2D while writing this question, so maybe those are the answer, but I don't see immediately how these would work, since neither seem to take pointers to pointers as input...
Any suggestions are welcome...

One way to do this is as marina.k suggested, batching your transfers only as you need them. Since you said each array only contains about 1000 chars, you could assign each char to a thread (since on Fermi we can allocate 1024 threads per block) and have each array handled by one block. In this case you may be able to transfer all the arrays for one "round" in one call - can you use a FORTRAN style, where you make one gigantic array and to get the 5th element of the "third" 1000 char array you would go:
third_array[5] = big_array[5 + 2*1000]
so that the first 1000 char array makes up the first 1000 elements of big_array, the second 1000 char array makes up the second 1000 elements of big_array, etc. ? In this case your chars would be continuous in memory and you could move the set you were going to process with one kernel launch in only one memcpy. Then as soon as you launch one kernel, you refill big_array on the CPU side and copy it asynchronously to the GPU.
Within each kernel, you could simply handle each array within 1 block, so that block N handles the (N-1)-thousandth element up to the N-thousandth of d_big_array (where you copied all those chars to).

Did you try pinned memory? This may provide a considerable speed-up on some hardware configurations.

Take try of async, you can assign the same job to different streams, each stream process a small part of date, make tranfer and computation at the same time
here is code:
cudaMemcpyAsync(
inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]
);
MyKernel<<<100, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(
hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]
);

Related

how to use CUDA to create an array of indices of matches from another array? With atomic?

I have a array in device ,the array is like[0,1,0,0,1...], only have 0,1. I want to create a new array and put the number 1's index in the new array.
I think it should use atomic. I have no idea. How to implement?
This can be achieved using stream-compaction. With Thrust it could look like this
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
#include <vector>
struct IsOne{
__host__ __device__
bool operator()(int i) const{
return i == 1;
}
};
int main(){
std::vector<int> h_array{1,0,1,1,0};
thrust::device_vector<int> d_array = h_array;
thrust::device_vector<int> d_indicesOfOnes(d_array.size());
auto end = thrust::copy_if(
thrust::device,
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(5),
d_array.begin(),
d_indicesOfOnes.begin(),
IsOne{}
);
int numIndices = thrust::distance(d_indicesOfOnes.begin(), end);
for(int i = 0; i < numIndices; i++){
std::cout << d_indicesOfOnes[i] << "\n";
}
}
Output:
0
2
3
So you want to find the indices of all the 1s, producing an output array whose length is the count of ones?
Almost certainly better to work in private chunks than to have all threads (whatever you call them in CUDA) serialized by incrementing a shared atomic output position. That strategy would be disastrously slow, getting no parallelism except when scanning through 0s. Maybe you had some better idea for how to use atomic, but you didn't mention it.
Have each "thread" count 1s in its chunk, then prefix-sum those results across chunks to find the starting point in the output array for each chunk.
Then (after all earlier chunks are done) copy its local array of indices to the right position in the final shared array.
Collecting results between chunks might be done with atomic, but I think you'd want an array of start-positions instead of serializing things with atomic for only 1 chunk start-point at a time.
IDK whether it would be faster to generate a local array of indices during the first pass as you count, or to re-scan the original array to generate indices. That would let the first pass just be a simple sum (which is the same thing as counting the matches in this case). If 1s are very rare, writing a temporary array and copying it might be good. If they're common, keeping the first pass simple and light-weight is probably good, not costing a lot more memory access / cache footprint.
The overall problem is somewhat similar to the parallel prefix-sum problem for dependencies between chunks, but it's the starting position that isn't known here, instead of what offset you need to add.
The prefix-sum part of what I'm suggesting is just over chunks, one value per thread, so this step doesn't have to do a lot of work between waiting for earlier chunks and starting up the 2nd phase.
I've never actually used CUDA so I'm not going to attempt code, but this is how you can parallelize the dependencies inherent in this problem in a way that's probably friendly for what GPUs can do.
It would work well on a multi-core CPU where you'd maybe have an array of atomic<ssize_t> output counts / positions. Perhaps starting zero-initialized, and when a thread finishes its chunk it writes a count to the array biased by 1, so it's non-zero. Or perhaps chunk_counts[thread] = -(count+1); so it's definitely non-zero, and negative. Then it checks if it's the first thread, or if chunk_counts[thread-1] is non-zero and positive (else .wait(0) on it?), and if so writes its chunk_counts[thread] = chunk_counts[thread-1] + count; Actually you wouldn't need to store both the negative local count and the positive prefix-summed count, just one or the other depending on how you collect.
And on a CPU, probably better to have one thread responsible for collecting the results, prefix summing, and notifying all the other threads to wake again. Instead of serializing with a chain of each thread waiting for a previous thread to write its value. Perhaps also on a GPU. That one collector thread can notify workers to start phase 2 as it goes along in the prefix sum.

Is it safe to use cudaHostRegister on only part of an allocation?

I have a C++ class container that allocates, lets say, 1GB of memory of plain objects (e.g. built-ins).
I need to copy part of the object to the GPU.
To accelerate and simplify the transfer I want to register the CPU memory as non-pageable ("pinning"), e.g. with cudaHostRegister(void*, size, ...) before copying.
(This seems to be a good way to copy further subsets of the memory with minimal logic. For example if plain cudaMemcpy is not enough.)
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
I may want to register only part because of efficiency, but also because deep down in the call trace I might have lost information of the original allocated pointer.
In other words, can the pointer argument to cudaHostRegister be the something else other than an allocated pointer? in particular an arithmetic result deriving from allocated memory, but still within the allocated range.
It seems to work but I don't understand if, in general, "pinning" part of an allocation can corrupt somehow the allocated block.
UPDATE: My concern is that allocation is actually mentioned in the documentation for the cudaHostRegister flag options:
cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system
with no unified virtual addressing, the memory will be neither mapped
nor portable.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one
that performed the allocation.
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling
cudaHostGetDevicePointer().
cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a
third-party PCIe device, and it will marked as non cache-coherent and
contiguous.
cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On
platforms without cudaDevAttrPageableMemoryAccessUsesHostPageTables,
this flag is required in order to register memory mapped to the CPU as
read-only. Support for the use of this flag can be queried from the
device attribute cudaDeviceAttrReadOnlyHostRegisterSupported. Using
this flag with a current context associated with a device that does
not have this attribute set will cause cudaHostRegister to error with
cudaErrorNotSupported.
This is a rule-of-thumb answer rather than a proper one:
When the CUDA documentation does not guarantee something is guaranteed to work - you'll need to assume it doesn't. Because if it does happen to work - for you, right now, on the system you have - it might stop working in the future; or on another system; or in another usage scenario.
More specifically - memory pinning happens at page resolution, so unless the part you want to pin starts and ends on a physical page boundary, the CUDA driver will need to pin some more memory before and after the region you asked for - which it could do, but it's going an extra mile to accommodate you, and I doubt that would happen without documentation.
I also suggest you file a bug report via developer.nvidia.com , asking that they clarify this point in the documentation. My experience is that there's... something like a 50% chance they'll do something about such a bug report.
Finally - you could just try it: Write a program which copies to the GPU with and without the pinning of the part-of-the-region, and see whether there's a throughput difference.
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
While I agree that the documentation could be clearer, I think the answer to the question is 'Yes'.
Here's why: The alternative interpretation would be that only whole memory sections returned by, say, malloc should be allowed to be registered. However, this is unworkable, because malloc could, behind the scenes, have one big section allocated, and only give the user parts of it. So even if you (the user) were cudaHostRegistering those sections returned by malloc, they'd actually be fragments of some bigger previously allocated chunk of memory anyway.
By the way, Linux has a similar kernel call to lock memory called mlock. It accepts arbitrary memory ranges.
One of the other answers claimed (until this test was posted):
If you need to copy the part-of-the-object just once to the GPU - there's no use in using cudaHostRegister(), because it will likely itself copy the data, physically, elsewhere - so you won't be saving anything
But this is incorrect: registering is worth it, if the chunk of memory being copied is big enough, even if the copying is done only once. I'm seeing about a 2x speed-up with this code (comment out the line indicated), or about 50% if unregistering is also done between the timers.
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
int main()
{
std::size_t giga = 1024*1024*1024;
std::vector<char> src(giga, 3);
char* dst = 0;
if(cudaMalloc((void**)&dst, giga)) return 1;
cudaDeviceSynchronize();
auto t0 = std::chrono::system_clock::now();
if(cudaHostRegister(src.data() + src.size()/2, giga/8, cudaHostRegisterDefault)) return 1; // comment out this line
if(cudaMemcpy(dst, src.data() + src.size()/2, giga/8, cudaMemcpyHostToDevice)) return 1;
cudaDeviceSynchronize();
auto t1 = std::chrono::system_clock::now();
auto d = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
std::cout << (d / 1e6) << " seconds" << std::endl;
// un-register and free
}

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

Launch CUDA kernels sequence and data transfer between them

i have two kernel in same file, the code should run the first kernel to generate an array. then i need to send the generated array to the second kernel.
however, when i do this the second kernel see all the array elements are 0.
here is simplification (not a runnable code ) just a psyducode.
cudaMalloc(device input array)
cudaMalloc(result array)
cudaMemcpy(device_input_array,inputarray,size,hosttodevice)
kernel1<<<1,n>>(device_input_array,device_result_array)
cudaMemcpy(host_result_array,device_result_array ... )
cudaMalloc(dev_secndarray)
kernel2<<<1,n>>>(dev_secndarray,device_result_array )
for testing.. in kernel2 i create a loop on device_result_array, how ever it prints all its elements as zero.
what is the proper way to send data between kernels. should i reserve space for the result array again ? what should i do?
Memory allocated through cudaMalloc exists till the end of application, or until you explicitly free the memory. Thus, the device_result_array can be passed directly to the second kernel as an input. I would recommend the following pattern:
cudaMalloc(device_input_array)
cudaMalloc(device_intermediate_result_array)
cudaMalloc(device_final_result_array)
cudaMemcpy(device_input_array,host_input_array,size,hosttodevice)
kernel1<<<G,B>>>(device_input_array,device_intermediate_result_array)
kernel2<<<G,B>>>(device_intermediate_result_array,device_final_result_array)
cudaMemcpy(host_result_array,device_final_result_array,size,devicetohost)
If for some reason you actually need to make a copy of the intermediate result in the device, you have an option to call cudaMemcpy(...,cudaMemcpyDeviceToDevice).
In either case, don't copy the intermediate result to host (unless you really need it for other reasons). Host<->Device copies are expensive.

CUDA: streaming the same memory location to all threads

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??
This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.