Create an object with fields on the device directly - cuda

I'm trying to create a class that will get allocated on the device. I want the constructor to run on the device so that the whole object including the fields inside are automatically allocated on the device instead of having to create a host object then copy it manually to the device.
I'm using thrust device_new
Here is my code:
using namespace thrust;
class Particle
{
public:
int* data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
I annotated the constructor with __device__ and used device_new (thrust), but this doesn't work, can someone explain to me why?
Cheers for help

I believe the answer lies in the description given here. Someone who knows thrust under the hood will probably come along and indicate whether this is true or not.
Although thrust has changed a lot since 2009, I believe device_new may still be using some form of operation where the object is actually temporarily instantiated on the host, then copied to the device. I believe the size limitation described in the above reference is no longer applicable, however.
I was able to get this to work:
#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>
#define N 512
using namespace thrust;
class Particle
{
public:
int data[N];
__device__ __host__ Particle()
{
// data = new int[10];
for (int i=0; i<N; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<N; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
Interestingly, it gives bogus results if I omit the __host__ decorator on the constructor, suggesting to me that the temporary object copy mechanism is still in place. It also gives bogus results (and cuda-memcheck reports out-of-bounds access errors) if I switch to using the dynamic allocation for data instead of static, also suggesting to me that device_new is using a temporary object creation on the host followed by a copy to the device.

First of all thanks to Rovert Crovella for his input (and previous answers)
So apparently I "overestimated" what device_new can do, I thought that it can initialise the object directly on the device, so any dynamically allocated memory inside is done on the device too.
But it seems like device_new is basically just doing the same as the manual way:
Particle temp;
Particle *d_p;
cudaMalloc(&d_p, sizeof(Particle));
cudaMemcpy(d_p, &temp, sizeof(Particle), cudaMemcpyHostToDevice);
So it makes a temp host object and copies it just like how it would be done manually. That means the memory allocated inside the object is allocated on the host, and only the pointer gets copied as part of the object, so you cannot use that memory in a kernel, you have to copy that memory manually to the device, and thrust doesn't seem to be doing that.
So it's just a cleaner way of creating a temp host object and copying it, except that you lose the ability to copy the dynamic memory allocated inside since you don't have access to that temp variable.
I hope in the future, there will be a method or a feature in CUDA that makes you initialise the object directly on the device so any dynamically allocated data in the constructor (or elsewhere) is allocated on the device too, instead of the tedious way of copying every piece of memory manually.

Related

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.

Simplest way to clear CUDA shared memory between kernel runs

I am trying to implement box filter in C-CUDA, starting with implementing matrix average problem in CUDA first. When I try following code without commenting those lines within for loops than I get the certain output. But when I comment those lines then it generates the same output again!
if(tx==0)
for(int i=1;i<=radius;i++)
{
//sharedTile[radius+ty][radius-i] = 6666.0;
}
if(tx==(Dx-1))
for(int i=0;i<radius;i++)
{
//sharedTile[radius+ty][radius+Dx+i] = 7777;
}
if(ty==0)
for(int i=1;i<=radius;i++)
{
//sharedTile[radius-i][radius+tx]= 8888;
}
if(ty==(Dy-1))
for(int i=0;i<radius;i++)
{
//sharedTile[radius+Dy+i][radius+tx] = 9999;
}
if((tx==0)&&(ty==0))
for(int i=globalRow,l=0;i<HostPaddedRow,l<radius;i++,l++)
{
for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
{
//sharedTile[l][m]=8866;
}
}
if((tx==(Dx-1))&&(ty==(Dx-1)))
for(int i=(HostPaddedRow+1),l=(radius+Dx);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
{
for(int j=HostPaddedCol,m=(radius+Dx);j<(HostPaddedCol+radius),m<(TILE+2*radius);j++,m++)
{
//sharedTile[l][m]=7799.0;
}
}
if((tx==(Dx-1))&&(ty==0))
for(int i=(globalRow),l=0;i<HostPaddedRow,l<radius;i++,l++)
{
for(int j=(HostPaddedCol+1),m=(radius+Dx);j<(HostPaddedCol+1+radius),m<(TILE+2*radius);j++,m++)
{
//sharedTile[l][m]=9966;
}
}
if((tx==0)&&(ty==(Dy-1)))
for(int i=(HostPaddedRow+1),l=(radius+Dy);i<(HostPaddedRow+1+radius),l<(TILE+2*radius);i++,l++)
{
for(int j=globalCol,m=0;j<HostPaddedCol,m<radius;j++,m++)
{
//sharedTile[l][m]=0.0;
}
}
__syncthreads();
You can ignore those for loop conditions and all, they are irrelevant here right now.
May basic problem and question is why am I getting the same vales even after commenting those lines? I tried making some modification in my main program and kernel as well. Also entered manual errors and removed them, and again compiled and executed the same code, but still getting same values. Is there any way to clear cache memory in CUDA?
I am using Nsight + RedHat + CUDA 5.5.
Thanks in advance.
why am I getting the same vales even after commenting those lines?
Seems like sharedTile is pointing to the same piece of memory between multiple consecutive runs which is absolutely normal. Therefore commented out code does not "generate" anything, it is just your pointer pointing to the same memory which was not flushed.
Is there any way to clear cache memory in CUDA
I believe you are talking about clearing shared memory? If so then you can use analogy of approach described here. Instead of using cudaMemset in host code you'll be zeroing out your shared memory from inside of kernel. The simplest approach is to place following code at the beginning of your kernel which declares sharedTile (this is for one dimensional thread blocks, one dimensional shared memory array):
__global__ void your_kernel(int count) {
extern __shared__ float* sharedTile;
for (int i = threadIdx.x; i < count; i += blockDim.x)
sharedTile[i] = 0.0f;
__syncthreads();
// your code here
}
Following approaches do not guarantee clear shared memory as Robert Crovella pointed out in below comment:
Or possibly call nvidia-smi with
--gpu-reset parameter.
Yet another solution was offered in the other SO
thread which includes
driver unloading and reloading.

Accessing cusp variable element from device kernel

I have a problem to access and assign variable with cusp array1d type from device/global kernel. The attached code gives error
alay.cu(8): warning: address of a host variable "p1" cannot be directly taken in a device function
alay.cu(8): error: calling a __host__ function("thrust::detail::vector_base<float, thrust::device_malloc_allocator<float> > ::operator []") from a __global__ function("func") is not allowed
Code Below
#include <cusp/blas.h>
cusp::array1d<float, cusp::device_memory> p1(10,3);
__global__ void func()
{
p1[blockIdx.x]=p1[blockIdx.x]+blockIdx.x*5;
}
int main()
{
func<<<10,1>>>();
return 0;
}
CUSP matrices and arrays (and the Thrust containers they are built with) are intended for host use only. You cannot directly use them in GPU code.
The canonical way to populate a CUSP sparse matrix would be to construct it in host memory and the copy it across to device memory using the copy constructor, so your trivial example becomes this:
cusp::array1d<float, cusp::host_memory> p1(10);
for(int i=0; i<10; i++) p1[i] = 4.f;
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device
If you want to manipulate a sparse matrix in device code, you will need to have a kernel specifically for whichever format you are interested in, and pass pointers to each of the device arrays holding the matrix data as arguments to that kernel. There is good Doxygen source annotation for all of the sparse types included in the CUSP distribution.
Your edit still doesn't present anything which couldn't be done on the host without a kernel, viz:
cusp::array1d<float, cusp::host_memory> p1(10, 3.f);
for(int i=0; i<10; i++) p1[i] += (i * 5.f);
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device

Using std::vector in CUDA device code

The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}
You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.
In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.
you can't use std::vector in device code, you should use array instead.
I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};
You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.

CUDA global (as in C) dynamic arrays allocated to device memory

So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device.
As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance.
So, I have device functions that want to access a device allocated array.
Ideally, I could do something like:
__device__ float* global_data;
main()
{
cudaMalloc(global_data);
kernel1<<<blah>>>(blah); //access global data
kernel2<<<blah>>>(blah); //access global data again
}
However, I havent figured out how to create a dynamic array. I figured out a work around by declaring the array as follows:
__device__ float global_data[REALLY_LARGE_NUMBER];
And while that doesn't require a cudaMalloc call, I would prefer the dynamic allocation approach.
Something like this should probably work.
#include <algorithm>
#define NDEBUG
#define CUT_CHECK_ERROR(errorMessage) do { \
cudaThreadSynchronize(); \
cudaError_t err = cudaGetLastError(); \
if( cudaSuccess != err) { \
fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n", \
errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
exit(EXIT_FAILURE); \
} } while (0)
__device__ float *devPtr;
__global__
void kernel1(float *some_neat_data)
{
devPtr = some_neat_data;
}
__global__
void kernel2(void)
{
devPtr[threadIdx.x] *= .3f;
}
int main(int argc, char *argv[])
{
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
CUT_CHECK_ERROR("kernel1");
kernel2<<<1,128>>>();
CUT_CHECK_ERROR("kernel2");
return 0;
}
Give it a whirl.
Spend some time focusing on the copious documentation offered by NVIDIA.
From the Programming Guide:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x]++;
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}
So now you can invoke them like so:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);
As this data is used in numerous
functions, I would like it to be
global.
There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.
EDIT:
Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.
So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.
#include <algorithm>
__constant__ float devPtr[1024];
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}
int main(int argc, char *argv[])
{
float some_data[256];
for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
{
some_data[i] = i * 2;
}
cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
kernel2<<<1,128>>>(otherDevPtr);
return 0;
}
Don't forget '--host-compilation=c++' for this example.
I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.
The good news is that it does work :)
However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.
Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.
Erm, it was exactly that problem of moving devPtr to global scope that was my problem.
I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.
I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).
check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.
As this data is used in numerous functions, I would like it to be global.
-
There are few good reasons to use globals. This definitely is not one. I'll leave it as an
exercise to expand this example to include moving "devPtr" to a global scope.
What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?