If we have a __host__ __device__ function in CUDA, we can use macros to choose different code paths for host-side and device-side code in its implementations, like so:
__host__ __device__ int foo(int x)
{
#ifdef CUDA_ARCH
return x * 2;
#else
return x;
#endif
}
but why is it that we can't write:
__host__ __device__ int foo(int x);
__device__ int foo(int x) { return x * 2; }
__host__ int foo(int x) { return x; }
instead?
The Clang implementation of CUDA C++ actually supports overloading on __host__ and
__device__ because it considers the execution space qualifiers part of the function signature. Note, however, that even there, you'd have to declare the two functions separately:
__device__ int foo(int x);
__host__ int foo(int x);
__device__ int foo(int x) { return x * 2; }
__host__ int foo(int x) { return x; }
test it out here
Personally, I'm not sure how desirable/important that really is to have though. Consider that you can just define a foo(int x) in the host code outside of your CUDA source. If someone told me they need to have different implementations of the same function for host and device where the host version for some reason needs to be defined as part of the CUDA source, my initial gut feeling would be that there's likely something going in a bit of an odd direction. If the host version does something different, shouldn't it most likely have a different name? If it logically does the same thing just not using the GPU, then why does it have to be part of the CUDA source? I'd generally advocate for keeping as clean and strict a separation between host and device code as possible and keeping any host code inside the CUDA source to the bare minimum. Even if you don't care about the cleanliness of your code, doing so will at least minimize the chances of getting hurt by all the compiler magic that goes on under the hood…
Related
From the documentation:
I.4.20.4. Constexpr functions and function templates
By default, a
constexpr function cannot be called from a function with incompatible
execution space. The experimental nvcc flag --expt-relaxed-constexpr
removes this restriction. When this flag is specified, host code can
invoke a __device__ constexpr function and device code can invoke a
__host__ constexpr function.
I read it, but I don't understand what it means - device code can invoke a host constexpr function? Here is my test:
constexpr int bar(int i)
{
#ifdef __CUDA_ARCH__
return i;
#else
return 555;
#endif
}
__global__ void kernel()
{
int tid = (blockDim.x * blockIdx.x) + threadIdx.x;
printf("%i\n", bar(tid));
}
int main(int argc, char *[])
{
static_assert(bar(5) > 0);
// static_assert(bar(argc) > 0); // compile error
cout << bar(argc) << endl;
kernel<<<2, 2>>>();
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
return 0;
}
It prints:
555
0
1
2
3
According to my understanding, the host invokes the host function, while the device invokes the device function. I.e. it behaves the same as if I declare bar with both __host__ and __device__ attributes. Adding a single attribute (__host__ or __device__) doesn't make any difference.
As a comparison, the documentation for std::initializer_list is much clearer:
I.4.20.2. std::initializer_list
By default, the CUDA compiler will
implicitly consider the member functions of std::initializer_list to
have __host__ __device__ execution space specifiers, and therefore
they can be invoked directly from device code.
Here I don't have any questions.
What does the documentation mean exactly?
Consider the following code.
#include <algorithm> //std::max
__global__ void kernel(int *array, int n) {
array[0] = std::max(array[1], array[2]);
}
This code will not compile by default.
error: calling a constexpr __host__ function("max") from a __global__ function("kernel") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
std::max is a standard host function without __device__ execution space specifiers and thus cannot be called from device code.
However, when the compiler flag --expt-relaxed-constexpr is specified, the code compiles nonetheless. I cannot give you any details about how this is achieved internally
I have a problem with CUDA programing !
Input is a matrix A( 2 x 2 )
Ouput is a matrix A( 2 x 2 ) with every new value is **3 exponent of the old value **
example :
input : A : { 2,2 } output : A { 8,8 }
{ 2,2 } { 8,8 }
I have 2 function in file CudaCode.CU :
__global__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__global__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
and Kernel :
__global__ void CudaProcessingKernel(int *dataA ) //kernel function
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int tid = bx * XTHREADS + tx;
if(tid < 16)
{
Power_of_03(dataA[tid]);
}
__syncthreads();
}
I think it's right, but the error appear : calling a __global__ function("Power_of_02") from a __global__ function("Power_of_03") is only allowed on the compute_35 architecture or above
Why I wrong ? How to repair it ?
The error is fairly explanatory. A CUDA function decorated with __global__ represents a kernel. Kernels can be launched from host code. On cc 3.5 or higher GPUs, you can also launch a kernel from device code. So if you call a __global__ function from device code (i.e. from another CUDA function that is decorated with __global__ or __device__), then you must be compiling for the appropriate architecture. This is called CUDA dynamic parallelism, and you should read the documentation to learn how to use it, if you want to use it.
When you launch a kernel, whether from host or device code, you must provide a launch configuration, i.e. the information between the triple-chevron notation:
CudaProcessingKernel<<<grid, threads>>>(d_A);
If you want to use your power-of-2 code from another kernel, you will need to call it in a similar, appropriate fashion.
Based on the structure of your code, however, it seems like you can make things work by declaring your power-of-2 and power-of-3 functions as __device__ functions:
__device__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__device__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
This should probably work for you and perhaps was your intent. Functions decorated with __device__ are not kernels (so they are not callable directly from host code) but are callable directly from device code on any architecture. The programming guide will also help to explain the difference.
I would like to have two versions of same member function of a class in host and device side.
Lets say
class A {
public:
double stdinvcdf(float x) {
static normal boostnormal(0, 1);
return boost::math::cdf(boostnormal,x);
}
__device__ double stdinvcdf(float x) {
return normcdfinvf(x);
}
};
But when I compile this code using nvcc, it aborts with function redefinition error.
CUDA / C++ does not support this kind of function overloading, because in the end, there is no different function signature. The common approach to have both, i.e. host and device versions is to use __host__ in combination with __device__ alongside with an #ifdef, e.g.
__host__ __device__ double stdinvcdf(float x)
{
#ifdef __CUDA_ARCH__
/* DEVICE CODE */
#else
/* HOST CODE */
#endif
}
This solution was also discussed in this thread in the NVIDIA developer forum.
I have the following template __device__ function in CUDA:
template<typename T>
__device__ void MyatomicAdd(T *address, T val){
atomicAdd(address , val);
}
that compiles and runs just fine if instantiated with T as a float, i.e.
__global__ void myKernel(float *a, float b){
MyatomicAdd<float>(a,b);
}
will run without a problem.
I wanted to specialize this function, as there is no atomicAdd() for doubles, so I can hand code an implementation in double precision. Ignoring the double precision specialization for now, the single precision specialization and template look like this:
template<typename T>
__device__ void MyatomicAdd(T *address, T val){
};
template<>
__device__ void MyatomicAdd<float>(float *address, float val){
atomicAdd(address , val);
}
Now the compiler complains that atomicAdd() is undefined in my specialization, the same applies when I try to use any CUDA functions like __syncthreads() within the specialization. Any ideas? Thanks.
It ended up being a linking problem with some OpenGL code developed by a colleague. Forcing the specializations to be inline fixed the problem, although obviously not the root cause. Still, it'll do for now until I can be be bothered to dig through the other guy's code.
The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}
You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.
In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.
you can't use std::vector in device code, you should use array instead.
I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};
You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.