I could not find anything in internet. Due to the fact that it is possible to use printf in a __device__ function I am wondering if there is a sprintf like function due to the fact that printf is "using" the result from sprintf to be displayed in stdout.
No there isn't anything built into CUDA for this.
Within CUDA the implementation of device printf is a special case and does not use the same mechanisms as the C library printf.
sprintf(), snprintf() and additional printf()-family functions are now available on the development branch of the CUDA Kernel Author's Toolkit, a.k.a. cuda-kat. Signatures:
namespace kat {
__device__ int sprintf(char* s, const char* format, ...);
__device__ int snprintf(char* s, size_t n, const char* format, ...);
}
... and they do exactly what you would expect. In particular, they support the C standard features which CUDA printf() does not, and then some (e.g. specifying a string argument's field width using an extra argument; format specifiers for size_t, and ptrdiff_t, and printing in base-2).
Caveat: I am the author of cuda-kat, so I'm biased...
Always prefer snprintf() which takes the buffer size oversprintf() which might overflow.
Related
I have some calculations that I want to parallelize if my user has a CUDA-compliant GPU, otherwise I want to execute the same code on the CPU. I don't want to have two versions of the algorithm code, one for CPU and one for GPU to maintain. I'm considering the following approach but am wondering if the extra level of indirection will hurt performance or if there is a better practice.
For my test, I took the basic CUDA template that adds the elements of two integer arrays and stores the result in a third array. I removed the actual addition operation and placed it into its own function marked with both device and host directives...
__device__ __host__ void addSingleItem(int* c, const int* a, const int* b)
{
*c = *a + *b;
}
... then modified the kernel to call the aforementioned function on the element identified by threadIdx...
__global__ void addKernel(int* c, const int* a, const int* b)
{
const unsigned i = threadIdx.x;
addSingleItem(c + i, a + i, b + i);
}
So now my application can check for the presence of a CUDA device. If one is found I can use...
addKernel <<<1, size>>> (dev_c, dev_a, dev_b);
... and if not I can forego parallelization and iterate through the elements calling the host version of the function...
int* pA = (int*)a;
int* pB = (int*)b;
int* pC = (int*)c;
for (int i = 0; i < arraySize; i++)
{
addSingleItem(pC++, pA++, pB++);
}
Everything seems to work in my small test app but I'm concerned about the extra call involved. Do device-to-devce function calls incur any significant performance hits? Is there a more generally accepted way to do CPU fallback that I should adopt?
If addSingleItem and addKernel are defined in the same translation unit/module/file, there should be no cost to having a device-to-device function call. The compiler will aggressively inline that code, as if you wrote it in a single function.
That is undoubtedly the best approach if it can be managed, for the reason described above.
If it's desired to still have some file-level modularity, it is possible to break code into a separate file and include that file in the compilation of the kernel function. Conceptually this is no different than what is described already.
Another possible approach is to use compiler macros to assist in the addition or removal or modification of code to handle the GPU case vs. non-GPU case. There are endless possibilities here, but see here for a simple idea. You can redefine what __host__ __device__ means in different scenarios, for example. I would say this probably only makes sense if you are building separate binaries for the GPU vs. non-GPU case, but you may find a clever way to handle it in the same executable.
Finally, if you desire this but must place the __device__ function in a separate translation unit, it is still possible but there may be some performance loss due to the device-to-device function call across module boundaries. The amount of performance loss here is hard to generalize since it depends heavily on code structure, but it's not unusual to see 10% or 20% performance hit. In that case, you may wish to investigate link-time-optimizations that became available in CUDA 11.
This question may also be of interest, although only tangentially related here.
In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which includes such a function. I guess this is due to the fact that with Pascal and Compute Capability 6.0, a native double version of atomicAdd has been added, but somehow that is not properly ignored for previous Compute Capabilities.
The code below used to compile and run fine with previous CUDA versions, but now I get this compilation error:
test.cu(3): error: function "atomicAdd(double *, double)" has already been defined
But if I remove my implementation, I instead get this error:
test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
I should add that I only see this if I compile with -arch=sm_35 or similar. If I compile with -arch=sm_60 I get the expected behavior, i.e. only the first error, and successful compilation in the second case.
Edit: Also, it is specific for atomicAdd -- if I change the name, it works well.
It really looks like a compiler bug. Can someone else confirm that this is the case?
Example code:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void kernel(double *a)
{
double b=1.3;
atomicAdd(a,b);
}
int main(int argc, char **argv)
{
double *a;
cudaMalloc(&a,sizeof(double));
kernel<<<1,1>>>(a);
cudaFree(a);
return 0;
}
Edit: I got an answer from Nvidia who recognize this problem, and here is what the developers say about it:
The sm_60 architecture, that is newly supported in CUDA 8.0, has
native fp64 atomicAdd function. Because of the limitations of our
toolchain and CUDA language, the declaration of this function needs to
be present even when the code is not being specifically compiled for
sm_60. This causes a problem in your code because you also define a
fp64 atomicAdd function.
CUDA builtin functions such as atomicAdd are implementation-defined
and can be changed between CUDA releases. Users should not define
functions with the same names as any CUDA builtin functions. We would
suggest you to rename your atomicAdd function to one that is not the
same as any CUDA builtin functions.
That flavor of atomicAdd is a new method introduced for compute capability 6.0. You may keep your previous implementation of other compute capabilities guarding it using macro definition
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif
This macro named architecture identification macro is documented here:
5.7.4. Virtual Architecture Identification Macro
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
I assume NVIDIA did not place it for previous CC to avoid conflict for users defining it and not moving to Compute Capability >= 6.x. I would not consider it a BUG though, rather a release delivery practice.
EDIT: macro guard was incomplete (fixed) - here a complete example.
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif
__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }
int main (int argc, char* argv[])
{
kernel<<<1,1>>> () ;
return ::cudaDeviceSynchronize () ;
}
Compilation with:
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
Command lines (both successful):
$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35
You may find why it works with the include file: sm_60_atomic_functions.h, where the method is not declared if __CUDA_ARCH__ is lower than 600.
To pass a device function pointer, we have to
typedef int (*fp_t)(int);
__device__ int dev_fn(int)
{
// do something
}
__device__ fp_t dev_fp = dev_fn; // [1]
void host_fn()
{
fp_t fp;
cudaMemcpyFromSymbol(&fp, (void const*)&dev_fp, sizeof(fp));
// now you can pass fp to cuda functions
}
which is quite convoluted as we have to get the address of dev_fn and store it to another symbol as at [1] above.
Why doesn't device function make itself a symbol so that we can use cudaGetSymbolAddress at host side to get the address of dev_fn directly instead of through an intermediate symbol dev_fp?
For whatever reason, CUDA made the design choice of allowing CPU threads to take the address of __device__ variables, probably because there had to be a way to copy data into and out of them. It would make no sense to attempt to copy data to the address of a __device__ function.
It's possible something similar could be engineered to take the address of __device__ functions, but it's unlikely the solution could be made to work well with __host__ __device__ functions. There's no single address that makes sense for __host__ __device__ functions.
What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...
device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)
So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));
I would like to declare the alignment for a global device variable in CUDA. Specifically, I have a string declaration, like
__device__ char str1 = "some pre-defined string";
In normal gcc, I can request alignment from the compiler as
__device__ char str1 __attribute__ ((aligned (4))) = "some pre-defined string";
However, when I tried this on nvcc, the compiler ignores these requests. The reason I would like to do this is to copy these strings onto a buffer in my kernels, and copying words at a time is much faster than copying bytes at a time, though they require that the src string be aligned. Can anyone please tell me how to request alignment from the nvcc compiler?
See section 5.3.2 "Size and Alignment Requirement" of the "CUDA C Programming Guide", which can be found here:
The alignment requirement is automatically fulfilled for the built-in types of char, short, int, long, longlong, float, double like float2 or float4.
For structs, the size and alignment requirements can be enforced by the compiler using the alignment specifiers __align__(8) or __align__(16).
Example usage:
struct __align__(8) {
float r;
float i;
} complex_num;
Can you check if this works?
__device__ char __align__(4) str1 = "some pre-defined string";