Does Cuda C++ not have tuples in device code? - cuda

__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = std::make_tuple(1, 2, 3); // Does not work.
c[i] = a[i] + b[i];
}
NVCC has lambdas at least, but std::make_tuple fails to compile. Are tuples not allowed in the current version of Cuda?

I've just tried this out and tuple metaprogramming with std:: (std::tuple, std::get, etc ...) will work in device code with C++14 and expt-relaxed-constexpr enabled (CUDA8+) during compilation (e.g. nvcc -std=c++14 xxxx.cu -o yyyyy --expt-relaxed-constexpr) - CUDA 9 required for C++14, but basic std::tuple should work in CUDA 8 if you are limited to that. Thrust/tuple works but has some drawbacks: limited to 10 items and lacking in some of the std::tuple helper functions (e.g. std::tuple_cat). Because tuples and their related functions are compile-time, expt-relaxed-constexpr should enable your std::tuple to "just work".
#include <tuple>
__global__ void kernel()
{
auto t = std::make_tuple(1, 2, 3);
printf("%d\n",std::get<0>(t));
}
int main()
{
kernel<<<1,1>>>();
cudaDeviceSynchronize();
}

#include <thrust/tuple.h>
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = thrust::make_tuple(1, 2, 3);
c[i] = a[i] + b[i];
}
I needed to get the ones from the Thrust library instead to make them work it seems. The above does compile.

Support for the standard c++ library on device side is problematic for CUDA as the standard library does not have the necessary __host__ or __device__ annotations.
That said, both clang and nvcc do have partial support for some functionality. Usually it's limited to constexpr functions that are considered to be __host__ __device__ if you pass --expt-relaxed-constexpr to nvcc (or by default in clang). Clang also has a bit more support for standard math functions. Neither supports anything that relies on C++ runtime (except for memory allocation, printf and assert) as that does not exist on device side.
So, in short -- most of the standard C++ library is unusable on device side in CUDA, though things do slowly improve as more and more functions in the standard library become constexpr.

Indeed, CUDA itself does not offer a device-side-capable version of std::tuple. However, I have a full tuple implementation as part of my cuda-kat library (still very much under initial development at the time of writing). thrust's tuple class is limited in the following senses:
Limited to 10 tuple elements.
Recursively expands templated types for every tuple element.
No/partial support for rvalues (e.g. in get())
The tuple implementation in cuda-kat is an adaptation of the EASTL tuple, which in turn is an adaptation of the LLVM project's libc++ tuple. Unlike the EASTL's, however, it is C++11-compatible, so you don't have to have the absolute latest CUDA version. It is possible to extract only the tuple class from the library with oh, I think 4 files or so, if you need just that.

Related

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.

A function calls another function in CUDA C++

I have a problem with CUDA programing !
Input is a matrix A( 2 x 2 )
Ouput is a matrix A( 2 x 2 ) with every new value is **3 exponent of the old value **
example :
input : A : { 2,2 } output : A { 8,8 }
{ 2,2 } { 8,8 }
I have 2 function in file CudaCode.CU :
__global__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__global__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
and Kernel :
__global__ void CudaProcessingKernel(int *dataA ) //kernel function
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int tid = bx * XTHREADS + tx;
if(tid < 16)
{
Power_of_03(dataA[tid]);
}
__syncthreads();
}
I think it's right, but the error appear : calling a __global__ function("Power_of_02") from a __global__ function("Power_of_03") is only allowed on the compute_35 architecture or above
Why I wrong ? How to repair it ?
The error is fairly explanatory. A CUDA function decorated with __global__ represents a kernel. Kernels can be launched from host code. On cc 3.5 or higher GPUs, you can also launch a kernel from device code. So if you call a __global__ function from device code (i.e. from another CUDA function that is decorated with __global__ or __device__), then you must be compiling for the appropriate architecture. This is called CUDA dynamic parallelism, and you should read the documentation to learn how to use it, if you want to use it.
When you launch a kernel, whether from host or device code, you must provide a launch configuration, i.e. the information between the triple-chevron notation:
CudaProcessingKernel<<<grid, threads>>>(d_A);
If you want to use your power-of-2 code from another kernel, you will need to call it in a similar, appropriate fashion.
Based on the structure of your code, however, it seems like you can make things work by declaring your power-of-2 and power-of-3 functions as __device__ functions:
__device__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__device__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
This should probably work for you and perhaps was your intent. Functions decorated with __device__ are not kernels (so they are not callable directly from host code) but are callable directly from device code on any architecture. The programming guide will also help to explain the difference.

cudaThreadSynchronize & performance

Some days ago I was comparing performance of some code of mine where I perform a very simple replace and Thrust implementation of the same algorithm. I discovered a mismatching of one order of magnitude (!) in favor of Thrust, so I started to make my debugger "surf" into their code to discover where the magic happens.
Surprisingly, I discovered that my very straight-forward implementation was actually very similar to theirs, once I got rid of all the functor stuff and got to the nitty-gritty. I saw that Thrust has a clever way to decide both block _size & grid_size (btw: exactly, how it works?!), so I just took their settings and executed my code again, being them so similar. I gained some microseconds, but almost the same situation. Then, in the end, I don't know why but just "to try" I removed a cudaThreadSynchronize() after my kernel and BINGO! I zeroed the gap (and better) and gained a whole order of magnitude of execution time. Accessing my array's value I saw that they had exactly what I expected, so correct execution.
The questions, now, are: when can I get rid of cudaThreadSynchronize (et similia)? Why does it cause such a huge overhead? I see that Thrust itself doesn't synchronize at the end (synchronize_if_enabled(const char* message) that is a NOP if macro __THRUST_SYNCHRONOUS isn't defined and it isn't).
Details & code follow.
// my replace code
template <typename T>
__global__ void replaceSimple(T* dev, const int n, const T oldval, const T newval)
{
const int gridSize = blockDim.x * gridDim.x;
int index = blockIdx.x * blockDim.x + threadIdx.x;
while(index < n)
{
if(dev[index] == oldval)
dev[index] = newval;
index += gridSize;
}
}
// replace invocation - not in main because of cpp - cu separation
template <typename T>
void callReplaceSimple(T* dev, const int n, const T oldval, const T newval)
{
replaceSimple<<<30,768,0>>>(dev,n,oldval,newval);
cudaThreadSynchronize();
}
// thrust replace invocation
template <typename T>
void callReplace(thrust::device_vector<T>& dev, const T oldval, const T newval)
{
thrust::replace(dev.begin(), dev.end(), oldval, newval);
}
Param details: arrays: n=10,000,000 elements set to 2, oldval=2, newval=3
Time to execute thrust callReplace (thrust): 0.057 ms
Time to execute callReplaceSimple with sync: 0.662 ms
Time to execute callReplaceSimple without sync: 0.011 ms
I used CUDA 5.0 with Thrust included, my card is a GeForce GTX 570 and I have a quadcore Q9550 2.83 GHz with 2 GB RAM.
Kernel launches are asynchronous. If you remove the cudaThreadSynchronize() call, you only measure the kernel launch time, not the time until completion of the kernel.

Using std::vector in CUDA device code

The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}
You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.
In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.
you can't use std::vector in device code, you should use array instead.
I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};
You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.

Using assert within kernel invocation

Is there convenient way for using asserts within the kernels invocation on device mode?
CUDA now has a native assert function. Use assert(...). If its argument is zero, it will stop kernel execution and return an error. (or trigger a breakpoint if in CUDA debugging.)
Make sure to include "assert.h". Also, this requires compute capability 2.x or higher, and is not supported on MacOS. For more details see CUDA C Programming Guide, Section B.16.
The programming guide also includes this example:
#include <assert.h>
__global__ void testAssert(void)
{
int is_one = 1;
int should_be_one = 0;
// This will have no effect
assert(is_one);
// This will halt kernel execution
assert(should_be_one);
}
int main(int argc, char* argv[])
{
testAssert<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
#define MYASSERT(condition) \
if (!(condition)) { return; }
MYASSERT(condition);
if you need something fancier you can use cuPrintf() which is available from the CUDA site for registered developers.