cudaMemcpyToSymbol giving cudaErrorUnknown [duplicate] - cuda

As stated by other questions and according to the link, you can no longer use the symbol name for this function. Now that the feature is gone, when would ever want to use this over cudaMemCpy? When would you ever want to use it at all? What is the tradeoff or benefit?
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g9bcf02b53644eee2bef9983d807084c7

In short, because cudaMemcpy can't do the same thing as cudaMemcpyToSymbol without an additional API call. Consider a constant memory array:
__constant__ float coeffs[8];
To copy values to this array using cudaMemcpyToSymbol, just do
cudaMemcpyToSymbol(coeffs, hostData, 8*sizeof(float));
To do the same with cudaMemcpy requires this:
float *dcoeffs;
cudaGetSymbolAddress((void **)&dcoeffs, coeffs);
cudaMemcpy(dcoeffs, hostData, 8*sizeof(float), cudaMemcpyHostToDevice);
A direct call to cudaMemcpy is illegal without prior symbol lookup.
[standard disclaimer: all code written in browser without access to documentation or compiler, use at own risk]

Related

Is it safe to use cudaHostRegister on only part of an allocation?

I have a C++ class container that allocates, lets say, 1GB of memory of plain objects (e.g. built-ins).
I need to copy part of the object to the GPU.
To accelerate and simplify the transfer I want to register the CPU memory as non-pageable ("pinning"), e.g. with cudaHostRegister(void*, size, ...) before copying.
(This seems to be a good way to copy further subsets of the memory with minimal logic. For example if plain cudaMemcpy is not enough.)
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
I may want to register only part because of efficiency, but also because deep down in the call trace I might have lost information of the original allocated pointer.
In other words, can the pointer argument to cudaHostRegister be the something else other than an allocated pointer? in particular an arithmetic result deriving from allocated memory, but still within the allocated range.
It seems to work but I don't understand if, in general, "pinning" part of an allocation can corrupt somehow the allocated block.
UPDATE: My concern is that allocation is actually mentioned in the documentation for the cudaHostRegister flag options:
cudaHostRegisterDefault: On a system with unified virtual addressing, the memory will be both mapped and portable. On a system
with no unified virtual addressing, the memory will be neither mapped
nor portable.
cudaHostRegisterPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one
that performed the allocation.
cudaHostRegisterMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling
cudaHostGetDevicePointer().
cudaHostRegisterIoMemory: The passed memory pointer is treated as pointing to some memory-mapped I/O space, e.g. belonging to a
third-party PCIe device, and it will marked as non cache-coherent and
contiguous.
cudaHostRegisterReadOnly: The passed memory pointer is treated as pointing to memory that is considered read-only by the device. On
platforms without cudaDevAttrPageableMemoryAccessUsesHostPageTables,
this flag is required in order to register memory mapped to the CPU as
read-only. Support for the use of this flag can be queried from the
device attribute cudaDeviceAttrReadOnlyHostRegisterSupported. Using
this flag with a current context associated with a device that does
not have this attribute set will cause cudaHostRegister to error with
cudaErrorNotSupported.
This is a rule-of-thumb answer rather than a proper one:
When the CUDA documentation does not guarantee something is guaranteed to work - you'll need to assume it doesn't. Because if it does happen to work - for you, right now, on the system you have - it might stop working in the future; or on another system; or in another usage scenario.
More specifically - memory pinning happens at page resolution, so unless the part you want to pin starts and ends on a physical page boundary, the CUDA driver will need to pin some more memory before and after the region you asked for - which it could do, but it's going an extra mile to accommodate you, and I doubt that would happen without documentation.
I also suggest you file a bug report via developer.nvidia.com , asking that they clarify this point in the documentation. My experience is that there's... something like a 50% chance they'll do something about such a bug report.
Finally - you could just try it: Write a program which copies to the GPU with and without the pinning of the part-of-the-region, and see whether there's a throughput difference.
Is it safe to pass a pointer that points to only part of the original allocated memory, for example a contiguous 100MB subset of the original 1GB.
While I agree that the documentation could be clearer, I think the answer to the question is 'Yes'.
Here's why: The alternative interpretation would be that only whole memory sections returned by, say, malloc should be allowed to be registered. However, this is unworkable, because malloc could, behind the scenes, have one big section allocated, and only give the user parts of it. So even if you (the user) were cudaHostRegistering those sections returned by malloc, they'd actually be fragments of some bigger previously allocated chunk of memory anyway.
By the way, Linux has a similar kernel call to lock memory called mlock. It accepts arbitrary memory ranges.
One of the other answers claimed (until this test was posted):
If you need to copy the part-of-the-object just once to the GPU - there's no use in using cudaHostRegister(), because it will likely itself copy the data, physically, elsewhere - so you won't be saving anything
But this is incorrect: registering is worth it, if the chunk of memory being copied is big enough, even if the copying is done only once. I'm seeing about a 2x speed-up with this code (comment out the line indicated), or about 50% if unregistering is also done between the timers.
#include <chrono>
#include <iostream>
#include <vector>
#include <cuda_runtime_api.h>
int main()
{
std::size_t giga = 1024*1024*1024;
std::vector<char> src(giga, 3);
char* dst = 0;
if(cudaMalloc((void**)&dst, giga)) return 1;
cudaDeviceSynchronize();
auto t0 = std::chrono::system_clock::now();
if(cudaHostRegister(src.data() + src.size()/2, giga/8, cudaHostRegisterDefault)) return 1; // comment out this line
if(cudaMemcpy(dst, src.data() + src.size()/2, giga/8, cudaMemcpyHostToDevice)) return 1;
cudaDeviceSynchronize();
auto t1 = std::chrono::system_clock::now();
auto d = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
std::cout << (d / 1e6) << " seconds" << std::endl;
// un-register and free
}

CUDA: cublas: exception raised at dot product [duplicate]

I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application.
Is there a way to compute this index efficiently and store it in the GPU?
Thanks!
Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than into a host variable (which entails a device to host transfer and might leave the result in the wrong memory space).
To use this, you need to use the cublasSetPointerMode call to tell the CUBLAS context to expect pointers for scalar arguments to be device pointers by using the CUBLAS_POINTER_MODE_DEVICE mode. This then implies that in a call like
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
that result must be a device pointer.
If you want to use CUBLAS and you have a GPU with compute capability 3.5 (K20, Titan) than you can use CUBLAS with dynamic parallelism. Than you can call CUBLAS from within a kernel on the GPU and no data will be returned to the CPU.
If you have no device with cc 3.5 you will probably have to implement a find max function by yourself or look for an aditional library.

How to make a atomic device function in cuda?

My kernel write the results to some global device variables.
So, I need to make the function to write as a atomic one.
Is it possible?
If it is not, i am trying to use atomicExch() about every global variables.
But some of them are struct or float not int.
As i know, atomic operations are for int.
How can i deal with this problem.
Thanks.
I got the reason why cudaErrorLaunchOutOfResources error raised.
My kernel used 28 registers, but project setting did not cover it.
CUDA C/C++ ->Device->Max Used Register was set 0. after changing it to 30, error disappeared.

can cudaMemcpy accept variables from device as its argument?

cudaMemcpy(dst, src, filesize, cudaMemcpyDeviceToHost);
Where filesize is a variable stored in device global memory.
Simple answer is no.
The argument is passed by value, meaning the value must be known in the host code. Therefore you should have a first call to cudaMemcpy() to get the size and a second call to cudaMemcpy() to perform the actual copy.
If you're using Thrust vectors you can just read the element in the host code, but that's because Thrust handles the copy for you.

why do we have to pass a pointer to a pointer to cudaMalloc

The following codes are widely used for GPU global memory allocation:
float *M;
cudaMalloc((void**)&M,size);
I wonder why do we have to pass a pointer to a pointer to cudaMalloc, and why it was not designed like:
float *M;
cudaMalloc((void*)M,size);
Thanks for any plain descriptions!
cudaMalloc needs to write the value of the pointer to M (not *M), so M must be passed by reference.
Another way would be to return the pointer in the classic malloc fashion. Unlike malloc, however, cudaMalloc returns an error status, like all CUDA runtime functions.
To explain the need in a little more detail:
Before the call to cudaMalloc, M points... anywhere, undefined. After the call to cudaMalloc you want a valid array to be present at the memory location where it points at. One could naïvely say "then just allocate the memory at this location", but that's of course not possible in general: the undefined address will normally not even be inside valid memory. cudaMalloc need to be able to choose the location. But if the pointer is called by value, there's no way to tell the caller where.
In C++, one could make the signature
template<typename PointerType>
cudaStatus_t cudaMalloc(PointerType& ptr, size_t);
where passing ptr by reference allows the function to change the location, but since cudaMalloc is part of the CUDA C API this is not an option. The only way to pass something as modifiable in C is to pass a pointer to it. And the object is itself a pointer what you need to pass is a pointer to a pointer.