Below, I have included a self contained example that uses cudaMemcpyFromSymbol() to retrieve the result from a kernel. The example passes the symbol parameter (the second parameter in the call), as a regular variable. However, as I understand the CUDA documentation, passing the parameter as a string, that is:
cudaMemcpyFromSymbol(&out, "out_d", sizeof(out_d), 0, cudaMemcpyDeviceToHost);
(with quotes around the symbol name), should also work. That does not work for me.
When would the symbol name work and when would the symbol name as a string work?
#include "cuda_runtime.h"
#include <stdio.h>
__device__ int out_d;
__global__ void test() {
out_d = 123;
}
int main() {
test<<<1,1>>>();
int out;
cudaMemcpyFromSymbol(&out, out_d, sizeof(out_d), 0, cudaMemcpyDeviceToHost);
printf("%d\n", out);
return 0;
}
Passing the symbol name as a string parameter was deprecated in CUDA 4.2 and the syntax was eliminated in cuda 5.0. The reasons had to do with enabling of separate device code linker capability, which functionality appeared in CUDA 5. For the cuda 5 toolkit, this change is documented in the release notes:
The use of a character string to indicate a device symbol, which was possible with certain API functions, is no longer supported. Instead, the symbol should be used directly.
Related
In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which includes such a function. I guess this is due to the fact that with Pascal and Compute Capability 6.0, a native double version of atomicAdd has been added, but somehow that is not properly ignored for previous Compute Capabilities.
The code below used to compile and run fine with previous CUDA versions, but now I get this compilation error:
test.cu(3): error: function "atomicAdd(double *, double)" has already been defined
But if I remove my implementation, I instead get this error:
test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
I should add that I only see this if I compile with -arch=sm_35 or similar. If I compile with -arch=sm_60 I get the expected behavior, i.e. only the first error, and successful compilation in the second case.
Edit: Also, it is specific for atomicAdd -- if I change the name, it works well.
It really looks like a compiler bug. Can someone else confirm that this is the case?
Example code:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void kernel(double *a)
{
double b=1.3;
atomicAdd(a,b);
}
int main(int argc, char **argv)
{
double *a;
cudaMalloc(&a,sizeof(double));
kernel<<<1,1>>>(a);
cudaFree(a);
return 0;
}
Edit: I got an answer from Nvidia who recognize this problem, and here is what the developers say about it:
The sm_60 architecture, that is newly supported in CUDA 8.0, has
native fp64 atomicAdd function. Because of the limitations of our
toolchain and CUDA language, the declaration of this function needs to
be present even when the code is not being specifically compiled for
sm_60. This causes a problem in your code because you also define a
fp64 atomicAdd function.
CUDA builtin functions such as atomicAdd are implementation-defined
and can be changed between CUDA releases. Users should not define
functions with the same names as any CUDA builtin functions. We would
suggest you to rename your atomicAdd function to one that is not the
same as any CUDA builtin functions.
That flavor of atomicAdd is a new method introduced for compute capability 6.0. You may keep your previous implementation of other compute capabilities guarding it using macro definition
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif
This macro named architecture identification macro is documented here:
5.7.4. Virtual Architecture Identification Macro
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
I assume NVIDIA did not place it for previous CC to avoid conflict for users defining it and not moving to Compute Capability >= 6.x. I would not consider it a BUG though, rather a release delivery practice.
EDIT: macro guard was incomplete (fixed) - here a complete example.
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif
__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }
int main (int argc, char* argv[])
{
kernel<<<1,1>>> () ;
return ::cudaDeviceSynchronize () ;
}
Compilation with:
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
Command lines (both successful):
$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35
You may find why it works with the include file: sm_60_atomic_functions.h, where the method is not declared if __CUDA_ARCH__ is lower than 600.
I could not find anything in internet. Due to the fact that it is possible to use printf in a __device__ function I am wondering if there is a sprintf like function due to the fact that printf is "using" the result from sprintf to be displayed in stdout.
No there isn't anything built into CUDA for this.
Within CUDA the implementation of device printf is a special case and does not use the same mechanisms as the C library printf.
sprintf(), snprintf() and additional printf()-family functions are now available on the development branch of the CUDA Kernel Author's Toolkit, a.k.a. cuda-kat. Signatures:
namespace kat {
__device__ int sprintf(char* s, const char* format, ...);
__device__ int snprintf(char* s, size_t n, const char* format, ...);
}
... and they do exactly what you would expect. In particular, they support the C standard features which CUDA printf() does not, and then some (e.g. specifying a string argument's field width using an extra argument; format specifiers for size_t, and ptrdiff_t, and printing in base-2).
Caveat: I am the author of cuda-kat, so I'm biased...
Always prefer snprintf() which takes the buffer size oversprintf() which might overflow.
I'm doing a thrust transform_reduce and need to access a thrust::device_vector from within the functor. I am not iterating on the device_vector. It allows me to declare the functor, passing in the device_vector reference, but won't let me dereference it, either with begin() or operator[].
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include\thrust/detail/function.h(187): warning : calling a host function("thrust::detail::vector_base > ::operator []") from a host device function("thrust::detail::host_device_function ::operator () ") is not allowed
I assume I'll be able to pass in the base pointer and do the pointer math myself, but is there a reason this isn't supported?
Just expanding on what #JaredHoberock has already indicated. I think he will not mind.
A functor usable by thrust must (for the most part) conform to the requirements imposed on any CUDA device code.
Both thrust::host_vector and thrust::device_vector are host code containers used to manipulate host data and device data respectively. A reference to the host code container cannot be used successfully in device code. This means even if you passed a reference to the container successfully, you could not use it (i.e. could not do .push_back(), for example) in device code.
For direct manipulation in device code (such as functors, or kernels) you must extract raw device pointers from thrust and use those directly, with your own pointer arithmetic. And advanced functions (such as .push_back()) will not be available.
There are a variety of ways to extract the raw device pointer corresponding to thrust data, and the following example code demonstrates two possibilities:
$ cat t651.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
__global__ void printkernel(float *data){
printf("data = %f\n", *data);
}
int main(){
thrust::device_vector<float> mydata(5);
thrust::sequence(mydata.begin(), mydata.end());
printkernel<<<1,1>>>(mydata.data().get());
printkernel<<<1,1>>>(thrust::raw_pointer_cast(&mydata[2]));
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t651 t651.cu
$ ./t651
data = 0.000000
data = 2.000000
$
I have a __constant__ memory array holding information that is needed by many kernels, which are placed in different source files. This constant memory array is defined in the header GlobalParameters.h, which is #included by all files containing kernels that need to access to this array.
I just discovered (look at talonmies' answer) that __constant memory__ is only available in the translation unit where it is defined, unless you turn on separate compilation (with CUDA 5.0 or later).
I still do not get completely what this means for my case.
Assuming that I cannot turn on separate compilation, is there a way for dealing with my needs? Where should I place the definition of my constant memory array? What if I place it in my header, which is #included in many translation units?
Assuming I can turn on separate compilation, should I declare my __constant__ memory array in the header as extern and place the definition inside a source file (e.g. GlobalParameters.cu)?
One way to make constant memory available to translation units other than the one where it is declared, is to call cudaGetSymbolAddress() and make the pointer available to the other functions.
This technique is playing with fire to some degree, because if you use the pointer to write to the memory without appropriate barriers and synchronization, you may run afoul of the lack of coherency between constant memory and global memory.
Also, you may not get the full performance benefits of constant memory if you use this method. That should be less true on SM 2.x and later hardware - disassemble the object code and make sure the compiler is emitting "load uniform" instructions.
The below example assumes the possibility of using separate compilation. In this case, the below example shows how using extern to work with constant memory across different compilation units.
FILE kernel.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "Utilities.cuh"
__constant__ int N_GPU;
__constant__ float a_GPU;
__global__ void printKernel();
int main()
{
const int N = 5;
const float a = 10.466;
gpuErrchk(cudaMemcpyToSymbol(N_GPU, &N, sizeof(int)));
gpuErrchk(cudaMemcpyToSymbol(a_GPU, &a, sizeof(float)));
printKernel << <1, 1 >> > ();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
FILE otherCompilationUnit.cu
#include <stdio.h>
extern __constant__ int N_GPU;
extern __constant__ float a_GPU;
__global__ void printKernel() {
printf("N = %i; a = %f\n", N_GPU, a_GPU);
}
No, without using separate compilation it won't be possible to use the same constant memory, that is declared once, over several .cu files.
In my oppinion there are two ways for a workaround.
First one is to implement all kernels within one .cu file. Therefore you will get the disadvantage that this file will become very large with a bad overview.
A second way would be to declare in every .cu file the constant memory again. Then once with a wrapper copy the values into the constant memory for every single .cu file - like I described in an answer here. Disadvantages would be that you have to ensure that you don't forget to copy the values in single .cu files and you have to check that you won't run in the limitation of total available constant memory.
Yes. The later CUDA doc says:
When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), device, shared, managed and constant variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated shared variable).
To estimate how much data the program may process in one kernel launch i try to get some memory info with cudaMemGetInfo(). However, the compiler tells me this:
error: identifier "cudaMemGetInfo" is undefined
Other functions like cudaGetDeviceProperties(); work fine. Do I have to install a certain CUDA-version? The library description does not contain infos about the version and so on.
EDIT: the smallest possible code. cudaSetDevice() generates no compiler error while cudaMemGetInfo() does
#include <cuda.h>
#include <cuda_runtime_api.h>
int main(){
unsigned int f, t;
cudaSetDevice(0);
cudaMemGetInfo(&f, &t);
return 0;
}
EDIT 2:
I'm on Linux using "Cuda compilation tools, release 2.0, V0.2.1221" (nvcc).
As I tried to get the cuda driver version installed with cudaDriverGetVersion() the same error occured (same thing when I use the driver function cuDriverGetVersion()).
It seems that the system wont let me know any detail about itself...
For the very old version of CUDA you are using, cudaMemGetInfo is not part of the runtime API. It has a counterpart in the driver cuMemGetInfo, which can be used instead. Note that using the driver API version of this call will require establishing a context first. This should work on CUDA 2.x:
// CUDA 2.x version
#include <cstdio>
#include <cuda.h>
#include <cuda_runtime_api.h>
int main(){
unsigned int f, t;
cudaSetDevice(0);
cudaFree(0); // This will establish a context on the device
cuMemGetInfo(&f, &t);
fprintf(stdout,"%d %d\n",f/1024,t/1024);
return 0;
}
EDIT: this answer applied to CUDA 3.0 and later:
Your problem isn't cudaMemGetInfo, it is the arguments you are supplying it. I would predict that this:
// CUDA 3.0 or later version
#include <cuda.h>
#include <cuda_runtime_api.h>
int main(){
size_t f, t;
cudaSetDevice(0);
cudaMemGetInfo(&f, &t);
return 0;
}
will work where your example fails. Note that nvcc uses a host C++ compiler to compile host code, and it will not find instances of API functions with incorrect arguments. Note that the prototype of cudaMemGetInfo is
cudaError_t cudaMemGetInfo(size_t * free, size_t * total)
and that the arguments should be size_t, which is not the same as unsigned int on many platforms.
to fix this error:
error: argument of type "unsigned int *" is incompatible with parameter of type "size_t *".
I found from nvidia technical report for cuda 3.2 that:
Driver API functions that accept or return memory sizes, such as cuMemAlloc() and cuMemGetInfo(), now use size_t for this purpose rather than unsigned int.
so you must change *.cu code, as the following lines:
Incorrect code:
unsigned int free, total;
cuMemGetInfo(&free, &total);
Correct code:
size_t free, total;
cuMemGetInfo(&free, &total);