Is it safe to use cudaHostRegister on only part of an allocation? - cuda

I have a C++ class container that allocates, lets say, 1GB of memory of plain objects (e.g. built-ins).
I need to copy part of the object to the GPU.
To accelerate and simplify the transfer I want to register the CPU memory as non-pageable ("pinning"), e.g. with cudaHostRegister(void*, size, ...) before copying.
(This seems to be a good way to copy further subsets of the memory with minimal logic. For example if plain cudaMemcpy is not enough.)
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
I may want to register only part because of efficiency, but also because deep down in the call trace I might have lost information of the original allocated pointer.
In other words, can the pointer argument to cudaHostRegister be the something else other than an allocated pointer? in particular an arithmetic result deriving from allocated memory, but still within the allocated range.
It seems to work but I don't understand if, in general, "pinning" part of an allocation can corrupt somehow the allocated block.
UPDATE: My concern is that allocation is actually mentioned in the documentation for the cudaHostRegister flag options:
cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system
with no unified virtual addressing, the memory will be neither mapped
nor portable.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one
that performed the allocation.
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling
cudaHostGetDevicePointer().
cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a
third-party PCIe device, and it will marked as non cache-coherent and
contiguous.
cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On
platforms without cudaDevAttrPageableMemoryAccessUsesHostPageTables,
this flag is required in order to register memory mapped to the CPU as
read-only. Support for the use of this flag can be queried from the
device attribute cudaDeviceAttrReadOnlyHostRegisterSupported. Using
this flag with a current context associated with a device that does
not have this attribute set will cause cudaHostRegister to error with
cudaErrorNotSupported.

This is a rule-of-thumb answer rather than a proper one:
When the CUDA documentation does not guarantee something is guaranteed to work - you'll need to assume it doesn't. Because if it does happen to work - for you, right now, on the system you have - it might stop working in the future; or on another system; or in another usage scenario.
More specifically - memory pinning happens at page resolution, so unless the part you want to pin starts and ends on a physical page boundary, the CUDA driver will need to pin some more memory before and after the region you asked for - which it could do, but it's going an extra mile to accommodate you, and I doubt that would happen without documentation.
I also suggest you file a bug report via developer.nvidia.com , asking that they clarify this point in the documentation. My experience is that there's... something like a 50% chance they'll do something about such a bug report.
Finally - you could just try it: Write a program which copies to the GPU with and without the pinning of the part-of-the-region, and see whether there's a throughput difference.

Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
While I agree that the documentation could be clearer, I think the answer to the question is 'Yes'.
Here's why: The alternative interpretation would be that only whole memory sections returned by, say, malloc should be allowed to be registered. However, this is unworkable, because malloc could, behind the scenes, have one big section allocated, and only give the user parts of it. So even if you (the user) were cudaHostRegistering those sections returned by malloc, they'd actually be fragments of some bigger previously allocated chunk of memory anyway.
By the way, Linux has a similar kernel call to lock memory called mlock. It accepts arbitrary memory ranges.
One of the other answers claimed (until this test was posted):
If you need to copy the part-of-the-object just once to the GPU - there's no use in using cudaHostRegister(), because it will likely itself copy the data, physically, elsewhere - so you won't be saving anything
But this is incorrect: registering is worth it, if the chunk of memory being copied is big enough, even if the copying is done only once. I'm seeing about a 2x speed-up with this code (comment out the line indicated), or about 50% if unregistering is also done between the timers.
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
int main()
{
std::size_t giga = 1024*1024*1024;
std::vector<char> src(giga, 3);
char* dst = 0;
if(cudaMalloc((void**)&dst, giga)) return 1;
cudaDeviceSynchronize();
auto t0 = std::chrono::system_clock::now();
if(cudaHostRegister(src.data() + src.size()/2, giga/8, cudaHostRegisterDefault)) return 1; // comment out this line
if(cudaMemcpy(dst, src.data() + src.size()/2, giga/8, cudaMemcpyHostToDevice)) return 1;
cudaDeviceSynchronize();
auto t1 = std::chrono::system_clock::now();
auto d = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
std::cout << (d / 1e6) << " seconds" << std::endl;
// un-register and free
}

Related

What is the fastest way to update a single float value to the GPU to access it in a CUDA kernel?

I have a opengl particle simulation, where the position of each particle is calculated in a CUDA kernel. Most memory resides within the GPU memory, but there is a single float value, I have to update from the CPU each frame.
At the moment I use cudaMemcpyAsync() to copy the float value to the GPU, but (at least from what I can tell), this slows down the performance quite a bit. I tried to use nvproof to see, which calls take the longest, with these results:
Calls Avg Min Max Name
477 2.9740us 2.8160us 4.5440us simulation(float3*, float*, float3*, float*)
477 89.033us 18.600us 283.00us cudaLaunchKernel
477 47.819us 10.200us 120.70us cudaMemcpyAsync
I think I can't really do much about the kernel launch itself, but from the calls, that happen every frame cudaMemcpyAsync() seems to be taking the longest.
I have also tried to use pinned memory and cudaHostGetDevicePointer() as described here, however for some reason this increases the kernel launch times even more, making more than up for the time saved for not needing the memcopy function.
I guess there has to be a better/faster way to update my single float variable to the GPU?
Easiest way is, that you can add an extra parameter to the simulation kernel function as a value of simple float but not as a pointer to float so that the data goes directly by the kernel launch parameters structure that CUDA sends to GPU when you launch the kernel. Then you evade that data copy command altogether. (I'm assuming CUDA packs whole function parameter descriptor data of kernel into a single copy command because kernel parameter descriptor space is limited by a few kBs or less).
simulation(fooPointer,
barPointer,
fooBarPointer,
floatVariable
);
Or, try double buffering between data update and rendering or between data update and compute so that simulation image follows the simulation calculation by 1-2 frames behind (and per-frame time gets worse) but "frames per second" increases.
If its not an interactive simulation, hiding compute/render/data latencies by double or triple buffering should work.
If you are after minimizing per-frame timing (quicker response to a user-input into simulation?) then you should embed the float variable to the end of an array that you already send/use in simulation or whatever structure you are using. If you already have a 1MB+ float buffer to send to GPU, then appending 4B(float) to end of it should not make much difference then you can access it from there. 1 copy operation should be faster than 2 copy operations with same total size.
If you are literally sending just 4B to GPU at each frame (with a simple function to generate that data), then (as 3Dave said in comments) you can try adding an extra kernel function to update the value in the GPU and just have the overhead of kernel launch command instead of both copy command overhead and data copy overhead. On a positive side, that extra kernel overhead might be hidden if there is a "graph" of kernels running for each frame automatically without enqueueing all of them again and again.
Here,
https://devblogs.nvidia.com/cuda-graphs/
The part
We are going to create a simple code which mimics this pattern. We will then use this to demonstrate the overheads involved with the standard launch mechanism and show how to introduce a CUDA Graph comprising the multiple kernels, which can be launched from the application in a single operation.
cudaGraphLaunch(instance, stream);
They say per-kernel launch overhead in this "graph" feature is only 3-4 microseconds when there are many(20) kernels in the algorithm.
Since graph supports other commands too, you can try both copy and compute parts in parallel cuda-streams within a graph and switch their inputs with double buffering so all CUDA things can stay within CUDA's context before sending output to rendering.
(Maybe)You don't even have to change the data mechanism at all. Just try sending data of float as binary representation into the pointer value and only read the pointer value (not data value) from kernel and convert it back to float. I don't know if CUDA returns an error for this if you don't try reaching the (wrong) pointer address that the float data represents, in the kernel.
simulation(fooPointer,
barPointer,
fooBarPointer,
toPtr(floatData) // <----- float to 64/32 bit pointer value
);
and in kernel
float val = fromPtrToFloat(parameter4); // converts pointer itself, not the data
But this may not be a preferred practice if you can simply use "value" type parameters.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

CUDA: Move content of a volume texture using only one kernel and a threadfence

I want to move the content of a volume texture along the vector vecShift. I think of a kernel like this:
__global__ void
moveVolume(int* vecShift)
{
// Determine position of current voxel as ptDest
// Determine position of voxel we copy the content from as ptSrc
// Read value at ptSrc and store it to voxelColor
// __threadfence()
// Write voxelColor to voxel at position ptDest
}
The threadfence will ensure that ALL voxels have read the contents of their "partner" and there will be no write to ptDest before every voxel has done the read-operation, does it?
If this is true, why I (sometimes) get artifacts of a blurry kind? Or do I have a wrong opinion on the functionality of threadfence?
As talonmies explains in the comments, using __threadfence() here is neither necessary nor sufficient. __threadfence() does not provide global barrier synchronization, it simply ensures that before the thread that calls __threadfence() proceeds, all writes by that thread before the fence are visible to all other active threads in the kernel launch.
What you really want here is to double buffer your volume data (i.e. write to a different array than you read). You cannot overwrite other parts of the array unless you can guarantee that they are only read by other threads in the same thread block. Otherwise you have a race condition and your program is incorrect.
Note: even in a sequential (CPU) implementation, you would need to double buffer your data for this type of computation!
What you are implementing is very similar to an advection kernel, as would be used in fluid dynamics simulations, and I'm sure there are multiple examples of what you want on the web (parallel or sequential).
Mark