Using std::vector in CUDA device code - cuda

The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}

You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.

In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.

you can't use std::vector in device code, you should use array instead.

I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};

You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.

Related

cuda::cub error calling a __host__ function from a __device__ functionis not allowed

I use cub::DeviceReduce::Sum to compute the summation of a vector, but it gave me the error :
error: calling a __host__ function("cub::DeviceReduce::Sum<double *, double *> ") from a __device__ function("dotcubdev") is not allowed
error: identifier "cub::DeviceReduce::Sum<double *, double *> " is undefined in device code
The code sample is as follows:
__device__ void sumcubdev(double* a, double *sum, int N)
{
// Declare, allocate, and initialize device-accessible pointers
//for input and output
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
The code can run successfully in the "main{}" body, but it can't work in the function.
To use a cub device-wide function from device code, it is necessary to build your project to support CUDA dynamic parallelism. In the cub documentation, this is indicated here:
Usage Considerations
Dynamic parallelism. DeviceReduce methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
For example, you can compile the code you have shown with:
$ cat t1364.cu
#include <cub/cub.cuh>
__device__ void sumcubdev(double* a, double *sum, int N)
{
// Declare, allocate, and initialize device-accessible pointers
//for input and output
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run sum-reduction
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, a, sum, N);
}
$ nvcc -arch=sm_35 -dc t1364.cu
$
(CUDA 9.2, CUB 1.8.0)
This means CUB will be launching child kernels to get the work done.
This is not a complete tutorial on how to use CUDA Dynamic Parallelism (CDP). The above is the compile command only and omits the link step. There are many questions here on the cuda tag which discuss CDP, you can read about it in two blog articles and the programming guide, and there are CUDA sample projects showing how to compile and use it.

Does Cuda C++ not have tuples in device code?

__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = std::make_tuple(1, 2, 3); // Does not work.
c[i] = a[i] + b[i];
}
NVCC has lambdas at least, but std::make_tuple fails to compile. Are tuples not allowed in the current version of Cuda?
I've just tried this out and tuple metaprogramming with std:: (std::tuple, std::get, etc ...) will work in device code with C++14 and expt-relaxed-constexpr enabled (CUDA8+) during compilation (e.g. nvcc -std=c++14 xxxx.cu -o yyyyy --expt-relaxed-constexpr) - CUDA 9 required for C++14, but basic std::tuple should work in CUDA 8 if you are limited to that. Thrust/tuple works but has some drawbacks: limited to 10 items and lacking in some of the std::tuple helper functions (e.g. std::tuple_cat). Because tuples and their related functions are compile-time, expt-relaxed-constexpr should enable your std::tuple to "just work".
#include <tuple>
__global__ void kernel()
{
auto t = std::make_tuple(1, 2, 3);
printf("%d\n",std::get<0>(t));
}
int main()
{
kernel<<<1,1>>>();
cudaDeviceSynchronize();
}
#include <thrust/tuple.h>
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = thrust::make_tuple(1, 2, 3);
c[i] = a[i] + b[i];
}
I needed to get the ones from the Thrust library instead to make them work it seems. The above does compile.
Support for the standard c++ library on device side is problematic for CUDA as the standard library does not have the necessary __host__ or __device__ annotations.
That said, both clang and nvcc do have partial support for some functionality. Usually it's limited to constexpr functions that are considered to be __host__ __device__ if you pass --expt-relaxed-constexpr to nvcc (or by default in clang). Clang also has a bit more support for standard math functions. Neither supports anything that relies on C++ runtime (except for memory allocation, printf and assert) as that does not exist on device side.
So, in short -- most of the standard C++ library is unusable on device side in CUDA, though things do slowly improve as more and more functions in the standard library become constexpr.
Indeed, CUDA itself does not offer a device-side-capable version of std::tuple. However, I have a full tuple implementation as part of my cuda-kat library (still very much under initial development at the time of writing). thrust's tuple class is limited in the following senses:
Limited to 10 tuple elements.
Recursively expands templated types for every tuple element.
No/partial support for rvalues (e.g. in get())
The tuple implementation in cuda-kat is an adaptation of the EASTL tuple, which in turn is an adaptation of the LLVM project's libc++ tuple. Unlike the EASTL's, however, it is C++11-compatible, so you don't have to have the absolute latest CUDA version. It is possible to extract only the tuple class from the library with oh, I think 4 files or so, if you need just that.

A function calls another function in CUDA C++

I have a problem with CUDA programing !
Input is a matrix A( 2 x 2 )
Ouput is a matrix A( 2 x 2 ) with every new value is **3 exponent of the old value **
example :
input : A : { 2,2 } output : A { 8,8 }
{ 2,2 } { 8,8 }
I have 2 function in file CudaCode.CU :
__global__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__global__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
and Kernel :
__global__ void CudaProcessingKernel(int *dataA ) //kernel function
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int tid = bx * XTHREADS + tx;
if(tid < 16)
{
Power_of_03(dataA[tid]);
}
__syncthreads();
}
I think it's right, but the error appear : calling a __global__ function("Power_of_02") from a __global__ function("Power_of_03") is only allowed on the compute_35 architecture or above
Why I wrong ? How to repair it ?
The error is fairly explanatory. A CUDA function decorated with __global__ represents a kernel. Kernels can be launched from host code. On cc 3.5 or higher GPUs, you can also launch a kernel from device code. So if you call a __global__ function from device code (i.e. from another CUDA function that is decorated with __global__ or __device__), then you must be compiling for the appropriate architecture. This is called CUDA dynamic parallelism, and you should read the documentation to learn how to use it, if you want to use it.
When you launch a kernel, whether from host or device code, you must provide a launch configuration, i.e. the information between the triple-chevron notation:
CudaProcessingKernel<<<grid, threads>>>(d_A);
If you want to use your power-of-2 code from another kernel, you will need to call it in a similar, appropriate fashion.
Based on the structure of your code, however, it seems like you can make things work by declaring your power-of-2 and power-of-3 functions as __device__ functions:
__device__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__device__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
This should probably work for you and perhaps was your intent. Functions decorated with __device__ are not kernels (so they are not callable directly from host code) but are callable directly from device code on any architecture. The programming guide will also help to explain the difference.

Accessing cusp variable element from device kernel

I have a problem to access and assign variable with cusp array1d type from device/global kernel. The attached code gives error
alay.cu(8): warning: address of a host variable "p1" cannot be directly taken in a device function
alay.cu(8): error: calling a __host__ function("thrust::detail::vector_base<float, thrust::device_malloc_allocator<float> > ::operator []") from a __global__ function("func") is not allowed
Code Below
#include <cusp/blas.h>
cusp::array1d<float, cusp::device_memory> p1(10,3);
__global__ void func()
{
p1[blockIdx.x]=p1[blockIdx.x]+blockIdx.x*5;
}
int main()
{
func<<<10,1>>>();
return 0;
}
CUSP matrices and arrays (and the Thrust containers they are built with) are intended for host use only. You cannot directly use them in GPU code.
The canonical way to populate a CUSP sparse matrix would be to construct it in host memory and the copy it across to device memory using the copy constructor, so your trivial example becomes this:
cusp::array1d<float, cusp::host_memory> p1(10);
for(int i=0; i<10; i++) p1[i] = 4.f;
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device
If you want to manipulate a sparse matrix in device code, you will need to have a kernel specifically for whichever format you are interested in, and pass pointers to each of the device arrays holding the matrix data as arguments to that kernel. There is good Doxygen source annotation for all of the sparse types included in the CUSP distribution.
Your edit still doesn't present anything which couldn't be done on the host without a kernel, viz:
cusp::array1d<float, cusp::host_memory> p1(10, 3.f);
for(int i=0; i<10; i++) p1[i] += (i * 5.f);
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device

Create an object with fields on the device directly

I'm trying to create a class that will get allocated on the device. I want the constructor to run on the device so that the whole object including the fields inside are automatically allocated on the device instead of having to create a host object then copy it manually to the device.
I'm using thrust device_new
Here is my code:
using namespace thrust;
class Particle
{
public:
int* data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
I annotated the constructor with __device__ and used device_new (thrust), but this doesn't work, can someone explain to me why?
Cheers for help
I believe the answer lies in the description given here. Someone who knows thrust under the hood will probably come along and indicate whether this is true or not.
Although thrust has changed a lot since 2009, I believe device_new may still be using some form of operation where the object is actually temporarily instantiated on the host, then copied to the device. I believe the size limitation described in the above reference is no longer applicable, however.
I was able to get this to work:
#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>
#define N 512
using namespace thrust;
class Particle
{
public:
int data[N];
__device__ __host__ Particle()
{
// data = new int[10];
for (int i=0; i<N; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<N; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
Interestingly, it gives bogus results if I omit the __host__ decorator on the constructor, suggesting to me that the temporary object copy mechanism is still in place. It also gives bogus results (and cuda-memcheck reports out-of-bounds access errors) if I switch to using the dynamic allocation for data instead of static, also suggesting to me that device_new is using a temporary object creation on the host followed by a copy to the device.
First of all thanks to Rovert Crovella for his input (and previous answers)
So apparently I "overestimated" what device_new can do, I thought that it can initialise the object directly on the device, so any dynamically allocated memory inside is done on the device too.
But it seems like device_new is basically just doing the same as the manual way:
Particle temp;
Particle *d_p;
cudaMalloc(&d_p, sizeof(Particle));
cudaMemcpy(d_p, &temp, sizeof(Particle), cudaMemcpyHostToDevice);
So it makes a temp host object and copies it just like how it would be done manually. That means the memory allocated inside the object is allocated on the host, and only the pointer gets copied as part of the object, so you cannot use that memory in a kernel, you have to copy that memory manually to the device, and thrust doesn't seem to be doing that.
So it's just a cleaner way of creating a temp host object and copying it, except that you lose the ability to copy the dynamic memory allocated inside since you don't have access to that temp variable.
I hope in the future, there will be a method or a feature in CUDA that makes you initialise the object directly on the device so any dynamically allocated data in the constructor (or elsewhere) is allocated on the device too, instead of the tedious way of copying every piece of memory manually.