Do CUDA 3D memory copy parameters need to be kept alive? - cuda

Consider the CUDA API function
CUresult cuMemcpy3DAsync (const CUDA_MEMCPY3D* pCopy, CUstream hStream);
described here.
It takes a CUDA_MEMCPY3D structure by pointer ; and this pointer is not to some CUDA-driver-created entity - it's to a structure the user has created.
My question: Do we need to keep the pointed-to structure alive past the call to this function returning? e.g. until after we've sycnrhonized the stream we've enqueued the copy on? Or - can we just discard it immediately? I'm guessing it should be the former, but the documentation doesn't really say.

Related

Cuda Async MemCpy behavior when 'lapping' itself

Say you have a fairly typical sequence of async compute:
Generate data on host, do other CPU stuff
Put data in async stream to device (via pinned memory)
compute data in (same) async stream
Now if you're running in a loop, the host will loop back to step 1 as 2&3 are non-blocking for the host.
The question is:
"What happens for the host if it gets to step 2 again and the system isn't yet done transferring data to the device?"
Does the host MemCpyAsync block until the previous copy is complete?
Does it get launched like normal with the outbound data being put in
a buffer?
If the latter, presumably this buffer can run out of space if your host is running too fast wrt the device operations?
I'm aware that modern devices have multiple copy engines, but I'm not sure if those would be useful for multiple copies on the same stream and to the same place.
I get that a system that ran into this wouldn't be a well designed one - asking as a point of knowledge.
This isn't something I have encountered in code yet - looking for any pointers to documentation on how this behavior is supposed to work. Already looked at the API page for the copy function and async behavior and didn't see anything I could recognize as relevant.
Does the host MemCpyAsync block until the previous copy is complete?
No, generally not, assuming by "block" you mean block the CPU thread. Asynchronous work items issued to the GPU go into a queue, and control is immediately returned to the CPU thread, before the work has begun. The queue can hold "many" items. Issuance of work can proceed until the queue is full without any other hindrances or dependencies.
It's important to keep in mind one of the two rules of stream semantics:
Items issued into the same stream execute in issue order. Item B, issued after item A, will not begin until A has finished.
So let's say we had a case like this (and assume h_ibuff and h_obuff point to pinned host memory):
cudaStream_t stream;
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
}
on the second pass of the loop, the cudaMemcpyAsync operations will be inserted into a queue, but will not begin to execute (or do anything, really) until stream semantics say they can begin. This is really true for each and every op issued by this loop.
A reasonable question might be "what if on each pass of the loop, I wanted different contents in h_ibuff ?" (quite sensible). Then you would need to address that specifically. Inserting a simple memcpy operation for example, by itself, to "reload" h_ibuff isn't going to work. You'd need some sort of synchronization. For example you might decide that you wanted to "refill" h_ibuff while the kernel and subsequent cudaMemcpyAsync D->H operation are happening. You could do something like this:
cudaStream_t stream;
cudaEvent_t event;
cudaEventCreat(&event);
cudaStreamCreate(&stream);
for (int i = 0; i < frames; i++){
cudaMemcpyAsync(d_ibuff, h_ibuff, cudaMemcpyHostToDevice, stream);
cudaEventRecord(event, stream);
kernel<<<...,stream>>>(...);
cudaMemcpyAsync(h_obuff, d_obuff, cudaMemcpyDeviceToHost, stream);
cudaEventSynchronize(event);
memcpy(h_ibuff, databuff+i*chunksize, chunksize); // "refill"
}
This refactoring would allow the asynchronous work to be issued to keep the GPU busy, and "overlap" the copy to "refill" the h_ibuff. It will also prevent the "refill" operation from beginning until the previous buffer contents are safely transferred to the device, and will also prevent the next buffer copy from beginning until the new contents are "reloaded".
This isn't the only way to do this; it's one possible approach.
For this last question asked/answered above, you might ask a similar question: "how about handling the output buffer side?" The mechanism could be very similar, left to the reader.
For structured learning on this topic, you might wish to study the CUDA concurrency section of this lecture series.

Global device variables in CUDA: bad practice?

I am designing a library that has a large contingent of CUDA kernels to perform parallel computations. All the kernels will be acting on a common object, say a computational grid, which is defined using C++ style objects. The computational domain doesn't necessarily need to be accessed from the host side, so creating it on the device side and keeping it there makes sense for now. I'm wondering if the following is considered "good practice":
Suppose my computational grid class is called Domain. First I Define a global device-side variable to store the computational domain:
__device__ Domain* D
Then I Initialize the computational domain using a CUDA kernel
__global__ void initDomain(paramType P){
D = new Domain(P);
}
Then, I perform computations using this domain with other kernels:
__global__ void doComputation(double *x,double *y){
D->doThing(x,y);
//...
}
If my domain remains fixed (i.e. kernels don't modify the domain once it's created), is this OK? Is there a better way? I initially tried creating the Domain object on the host side and copying it over to the device, but this turned out to be a hassle because Domain is a relatively complex type that makes it a pain to copy over using e.g. cudaMemCpy or even Thrust::device_new (at least, I couldn't get it to work nicely).
Yes it's ok.
Maybe you can improve performance using
__constant__
using this keyword, your object will be available in all your kernels in a very fast memory.
In order to copy your object, you must use : cudaMemcpyToSymbol, please note there is come restriction : your object will be read-only in your device code, and it must don't have default constructor.
You can find informations here
If your object is complex and hard to copy, maybe you can look for : Unified memory, then just pass your variable by value to your kernel.

Given a pointer to a __global__ function, can I retrieve its name?

Suppose I have a pointer to a __global__ function in CUDA. Is there a way to programmatically ask CUDART for a string containing its name?
I don't believe this is possible by any public API.
I have previously tried poking around in the driver itself, but that doesn't look too promising. The compiler emitted code for <<< >>> kernel invocation clearly registers the mangled function name with the runtime via __cudaRegisterFunction, but I couldn't see any obvious way to perform a lookup by name/value in the runtime library. The driver API equivalent cuModuleGetFunction leads to an equally opaque type from which it doesn't seem possible to extract the function name.
Edited to add:
The host compiler itself doesn't support reflection, so there are no obvious fancy language tricks that could be pulled at runtime. One possibility would be to add another preprocessor pass to the compilation trajectory to build a static kernel function lookup table before the final build. That would be rather a lot of work, but it could be done, at least for "classic" compilation where everything winds up in a single translation unit.

cudaMemcpy() vs cudaMemcpyFromSymbol()

I'm trying to figure out why cudaMemcpyFromSymbol() exists. It seems everything that 'symbol' func can do, the nonSymbol cmds can do.
The symbol func appears to make it easy for part of an array or index to be moved, but this could just as easily be done with the nonSymbol function. I suspect the nonSymbol approach will run faster as there is no symbol-lookup needed. (It is not clear if the symbol look up calculation is done at compile or run time.)
Why would I use cudaMemcpyFromSymbol() vs cudaMemcpy()?
cudaMemcpyFromSymbol is the canonical way to copy from any statically defined variable in device memory.
cudaMemcpy can't be directly use to copy to or from a statically defined device variable because it requires a device pointer, and that isn't known to host code at runtime. Therefore, an API call which can interrogate the device context symbol table is required. The two choices are either, cudaMemcpyFromSymbol which does the symbol lookup and copy in one operation, or cudaGetSymbolAddress which returns an address which can be passed to cudaMemcpy. The former is probably more efficient if you only want to do one copy, the latter if you want to use the address multiple times in host code.

Advantage/Disadvantage of function pointers

So the problem I am having has not actually happened yet. I am planning out some code for a game I am currently working on and I know that I am going to be needing to conserve memory usage as much as possible from step one. My question is, if I have for example, 500k objects that will need to constantly be constructed and deconstructed. Would is save me any memory to have the functions those classes are going to use as function pointers.e.g. without function pointers class MyClass{public:void Update();void Draw();...}e.g. with function pointersclass MyClass{public:void * Update;void * Draw;...}Would this save me any memory or would any new creation of MyClass just access the same functions that were defined for the rest of them? If it does save me any memory would it be enough to be worthwhile?
Assuming those are not virtual functions, you'd use more memory with function pointers.
The first example
There is no allocation (beyond the base amount required to make new return unique pointers, or your additional implementation that you ... ellipsized).
Non-virtual member functions are basically static functions taking a this pointer.
The advantage is that your objects are really simple, and you'll have only one place to look to find the corresponding code.
The disadvantage is that you lose some flexibility, and have to cram all your code into single update/draw functions. This will be hard to manage for anything but a tiny game.
The second example
You allocate 2x pointers per object instance. Pointers are usually 4x to 8x bytes each (depending on your target platform and your compiler).
The advantage you gain a lot of flexibility of being able to change the function you're pointing to at runtime, and can have a multitude of functions that implement it - thus supporting (but not guaranteeing) better code organization.
The disadvantage is that it will be harder to tell which function each instance will point to when you're debugging your application, or when you're simply reading through the code.
Other options
Using function pointers this specific way (instance data members) usually makes more sense in plain C, and less sense in C++.
If you want those functions to be bound at runtime, you may want to make them virtual instead. The cost is compiler/implementation dependent, but I believe it is (usually) going to be one v-table pointer per object instance.
Plus you can take full advantage of virtual function syntax to make your functions based on type, rather than whatever you bound them to. It will be easier to debug than the function pointer option, since you can simply look at the derived type to figure out what function a particular instance points to.
You also won't have to initialize those function pointers - the C++ type system would do the equivalent initialization automatically (by building the v-table, and initializing each instance's v-table pointer).
See: http://www.parashift.com/c++-faq-lite/virtual-functions.html