CUDA/Thrust double pointer problem (vector of pointers) - cuda

Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.
I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!
--------------------------
main.cpp
--------------------------
Sphere * parseSphere(int i)
{
Sphere * s = new Sphere();
s->a = 1+i;
s->b = 2+i;
s->c = 3+i;
return s;
}
int main( int argc, char** argv ) {
int i;
thrust::host_vector<Sphere *> spheres_h;
thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);
//initialize spheres_h
for(i=0;i<NUM_OBJECTS;i++){
Sphere * sphere = parseSphere(i);
spheres_h.push_back(sphere);
}
//initialize spheres_resh
for(i=0;i<NUM_OBJECTS;i++){
spheres_resh[i].a = 1;
spheres_resh[i].b = 1;
spheres_resh[i].c = 1;
}
thrust::device_vector<Sphere *> spheres_dv = spheres_h;
thrust::device_vector<Sphere> spheres_resv = spheres_resh;
Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);
kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);
thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());
bool result = true;
for(i=0;i<NUM_OBJECTS;i++){
result &= (spheres_resh[i].a == i+1);
result &= (spheres_resh[i].b == i+2);
result &= (spheres_resh[i].c == i+3);
}
if(result)
{
cout << "Data GOOD!" << endl;
}else{
cout << "Data BAD!" << endl;
}
return 0;
}
--------------------------
cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float
num_objects)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
spheres_res[index].b = (*(spheres_d+index))->b;
spheres_res[index].c = (*(spheres_d+index))->c;
}
void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{
int threads = 512;//per block
int grids = ((num_objects)/threads)+1;//blocks per grid
deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}

The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.
The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.
The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.
You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.
The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.

Related

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.

Dynamic Shared Memory in CUDA

There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is given by a variable that is read into the program after the start of execution. Because of this, I cannot use that variable to define the size of the arrays, due to the fact that defining the size of shared arrays requires knowing the value at compile time. I do not want to do something like __shared__ double arr1[1000] because typing in the size by hand is useless to me as that will change depending on the input. In the same vein, I cannot use #define to create a constant for the size.
Now I can follow an example similar to what is in the manual (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) such as
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
But this still runs into an issue. From what I've read, defining a shared array always makes the memory address the first element. That means I need to make my second array shifted over by the size of the first array, as they appear to do in this example. But the size of the first array is dependent on user input.
Another question (Cuda Shared Memory array variable) has a similar issue, and they were told to create a single array that would act as the array for both arrays and simply adjust the indices to properly match the arrays. While this does seem to do what I want, it looks very messy. Is there any way around this so that I can still maintain two independent arrays, each with sizes that are defined as input by the user?
When using dynamic shared memory with CUDA, there is one and only one pointer passed to the kernel, which defines the start of the requested/allocated area in bytes:
extern __shared__ char array[];
There is no way to handle it differently. However this does not prevent you from having two user-sized arrays. Here's a worked example:
$ cat t501.cu
#include <stdio.h>
__global__ void my_kernel(unsigned arr1_sz, unsigned arr2_sz){
extern __shared__ char array[];
double *my_ddata = (double *)array;
char *my_cdata = arr1_sz*sizeof(double) + array;
for (int i = 0; i < arr1_sz; i++) my_ddata[i] = (double) i*1.1f;
for (int i = 0; i < arr2_sz; i++) my_cdata[i] = (char) i;
printf("at offset %d, arr1: %lf, arr2: %d\n", 10, my_ddata[10], (int)my_cdata[10]);
}
int main(){
unsigned double_array_size = 256;
unsigned char_array_size = 128;
unsigned shared_mem_size = (double_array_size*sizeof(double)) + (char_array_size*sizeof(char));
my_kernel<<<1,1, shared_mem_size>>>(256, 128);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t501 t501.cu
$ cuda-memcheck ./t501
========= CUDA-MEMCHECK
at offset 10, arr1: 11.000000, arr2: 10
========= ERROR SUMMARY: 0 errors
$
If you have a random arrangement of arrays of mixed data types, you'll want to either manually align your array starting points (and request enough shared memory) or else use alignment directives (and be sure to request enough shared memory), or use structures to help with alignment.

Accessing cusp variable element from device kernel

I have a problem to access and assign variable with cusp array1d type from device/global kernel. The attached code gives error
alay.cu(8): warning: address of a host variable "p1" cannot be directly taken in a device function
alay.cu(8): error: calling a __host__ function("thrust::detail::vector_base<float, thrust::device_malloc_allocator<float> > ::operator []") from a __global__ function("func") is not allowed
Code Below
#include <cusp/blas.h>
cusp::array1d<float, cusp::device_memory> p1(10,3);
__global__ void func()
{
p1[blockIdx.x]=p1[blockIdx.x]+blockIdx.x*5;
}
int main()
{
func<<<10,1>>>();
return 0;
}
CUSP matrices and arrays (and the Thrust containers they are built with) are intended for host use only. You cannot directly use them in GPU code.
The canonical way to populate a CUSP sparse matrix would be to construct it in host memory and the copy it across to device memory using the copy constructor, so your trivial example becomes this:
cusp::array1d<float, cusp::host_memory> p1(10);
for(int i=0; i<10; i++) p1[i] = 4.f;
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device
If you want to manipulate a sparse matrix in device code, you will need to have a kernel specifically for whichever format you are interested in, and pass pointers to each of the device arrays holding the matrix data as arguments to that kernel. There is good Doxygen source annotation for all of the sparse types included in the CUSP distribution.
Your edit still doesn't present anything which couldn't be done on the host without a kernel, viz:
cusp::array1d<float, cusp::host_memory> p1(10, 3.f);
for(int i=0; i<10; i++) p1[i] += (i * 5.f);
cusp::array1d<float, cusp::device_memory> p2(10) = p1; // data now on device

CUDA Global Array declaration and initialization before kernel call example

I need some help with Cuda GLOBAL memory. In my project I must declare Global array for avoid to send this array at every kernel call.
EDIT:
My application can call the kernel more than 1,000 times , and on every call I'm sending him an array with size more than [1000 X 1000], So I think it's taking more time , that's why my app works slowly. So I need declare Global array for GPU, So my questions are
1 How to declare Global array
2 How to initialize Global array from CPU before kernel call
Thanks in advance
Your edited question is confusing because you say you are sending your kernel an array of size 1000 x 1000 but you want to know how to do this using a global array. The only way I know of to send this much data to a kernel is to use a global array, so you are probably already doing this with an array in global memory.
Nevertheless, there are 2 methods, at least, to create and initialize an array in global memory:
1.statically, using __device__ and cudaMemcpyToSymbol, for example:
#define SIZE 100
__device__ int A[SIZE];
...
int main(){
int myA[SIZE];
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMemcpyToSymbol(A, myA, SIZE*sizeof(int));
...
(kernel calls, etc.)
}
(device variable reference, cudaMemcpyToSymbol reference)
2.dynamically, using cudaMalloc and cudaMemcpy:
#define SIZE 100
...
int main(){
int myA[SIZE];
int *A;
for (int i=0; i< SIZE; i++) myA[i] = 5;
cudaMalloc((void **)&A, SIZE*sizeof(int));
cudaMemcpy(A, myA, SIZE*sizeof(int), cudaMemcpyHostToDevice);
...
(kernel calls, etc.)
}
(cudaMalloc reference, cudaMemcpy reference)
For clarity I'm omitting error checking which you should do on all cuda calls and kernel calls.
If I understand well this question, which is kind of unclear, you want to use global array and send it to the device in every kernel call. This bad practice leads to high latency because in every kernel call you need to transfer your data to the device. In my experience such practice led to negative speed-up.
An optimal way would be to use what I call flip-flop technique. The way you do it is:
Declare two array in the device. d_arr1 and d_arr2
Copy the data host -> device into one of the arrays.
Pass as kernel's parameters pointers to d_arr1 and d_arr2
Process the data into the kernel.
In consequent kernel calls you exchange the pointers you are passing as parameters
This way you avoid to transfer the data every kernel call. You transfer only at the beginning and at the end of your host loop.
int a, even =0;
for(a=0;a<1000;a++)
{
if (even % 2 ==0 )
//call to the kernel(pointer_a, pointer_b)
else
//call to the kernel(pointer_b, pointer_a)
}

CUDA global (as in C) dynamic arrays allocated to device memory

So, im trying to write some code that utilizes Nvidia's CUDA architecture. I noticed that copying to and from the device was really hurting my overall performance, so now I am trying to move a large amount of data onto the device.
As this data is used in numerous functions, I would like it to be global. Yes, I can pass pointers around, but I would really like to know how to work with globals in this instance.
So, I have device functions that want to access a device allocated array.
Ideally, I could do something like:
__device__ float* global_data;
main()
{
cudaMalloc(global_data);
kernel1<<<blah>>>(blah); //access global data
kernel2<<<blah>>>(blah); //access global data again
}
However, I havent figured out how to create a dynamic array. I figured out a work around by declaring the array as follows:
__device__ float global_data[REALLY_LARGE_NUMBER];
And while that doesn't require a cudaMalloc call, I would prefer the dynamic allocation approach.
Something like this should probably work.
#include <algorithm>
#define NDEBUG
#define CUT_CHECK_ERROR(errorMessage) do { \
cudaThreadSynchronize(); \
cudaError_t err = cudaGetLastError(); \
if( cudaSuccess != err) { \
fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n", \
errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\
exit(EXIT_FAILURE); \
} } while (0)
__device__ float *devPtr;
__global__
void kernel1(float *some_neat_data)
{
devPtr = some_neat_data;
}
__global__
void kernel2(void)
{
devPtr[threadIdx.x] *= .3f;
}
int main(int argc, char *argv[])
{
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
CUT_CHECK_ERROR("kernel1");
kernel2<<<1,128>>>();
CUT_CHECK_ERROR("kernel2");
return 0;
}
Give it a whirl.
Spend some time focusing on the copious documentation offered by NVIDIA.
From the Programming Guide:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
That's a simple example of how to allocate memory. Now, in your kernels, you should accept a pointer to a float like so:
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x]++;
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= 0.3f;
}
So now you can invoke them like so:
float* devPtr;
cudaMalloc((void**)&devPtr, 256 * sizeof(*devPtr));
cudaMemset(devPtr, 0, 256 * sizeof(*devPtr));
kernel1<<<1,128>>>(devPtr);
kernel2<<<1,128>>>(devPtr);
As this data is used in numerous
functions, I would like it to be
global.
There are few good reasons to use globals. This definitely is not one. I'll leave it as an exercise to expand this example to include moving "devPtr" to a global scope.
EDIT:
Ok, the fundamental problem is this: your kernels can only access device memory and the only global-scope pointers that they can use are GPU ones. When calling a kernel from your CPU, behind the scenes what happens is that the pointers and primitives get copied into GPU registers and/or shared memory before the kernel gets executed.
So the closest I can suggest is this: use cudaMemcpyToSymbol() to achieve your goals. But, in the background, consider that a different approach might be the Right Thing.
#include <algorithm>
__constant__ float devPtr[1024];
__global__
void kernel1(float *some_neat_data)
{
some_neat_data[threadIdx.x] = devPtr[0] * devPtr[1];
}
__global__
void kernel2(float *potentially_that_same_neat_data)
{
potentially_that_same_neat_data[threadIdx.x] *= devPtr[2];
}
int main(int argc, char *argv[])
{
float some_data[256];
for (int i = 0; i < sizeof(some_data) / sizeof(some_data[0]); i++)
{
some_data[i] = i * 2;
}
cudaMemcpyToSymbol(devPtr, some_data, std::min(sizeof(some_data), sizeof(devPtr) ));
float* otherDevPtr;
cudaMalloc((void**)&otherDevPtr, 256 * sizeof(*otherDevPtr));
cudaMemset(otherDevPtr, 0, 256 * sizeof(*otherDevPtr));
kernel1<<<1,128>>>(otherDevPtr);
kernel2<<<1,128>>>(otherDevPtr);
return 0;
}
Don't forget '--host-compilation=c++' for this example.
I went ahead and tried the solution of allocating a temporary pointer and passing it to a simple global function similar to kernel1.
The good news is that it does work :)
However, I think it confuses the compiler as I now get "Advisory: Cannot tell what pointer points to, assuming global memory space" whenever I try to access the global data. Luckily, the assumption happens to be correct, but the warnings are annoying.
Anyway, for the record - I have looked at many of the examples and did run through the nvidia exercises where the point is to get the output to say "Correct!". However, I haven't looked at all of them. If anyone knows of an sdk example where they do dynamic global device memory allocation, I would still like to know.
Erm, it was exactly that problem of moving devPtr to global scope that was my problem.
I have an implementation that does exactly that, with the two kernels having a pointer to data passed in. I explicitly don't want to pass in those pointers.
I have read the documentation fairly closely, and hit up the nvidia forums (and google searched for an hour or so), but I haven't found an implementation of a global dynamic device array that actually runs (i have tried several that compile and then fail in new and interesting ways).
check out the samples included with the SDK. Many of those sample projects are a decent way to learn by example.
As this data is used in numerous functions, I would like it to be global.
-
There are few good reasons to use globals. This definitely is not one. I'll leave it as an
exercise to expand this example to include moving "devPtr" to a global scope.
What if the kernel operates on a large const structure consisting of arrays? Using the so called constant memory is not an option, because it's very limited in size.. so then you have to put it in global memory..?