Opencl clBuildProgram() access violation exception - exception

I'm having a weird error executing an opencl kernel, When I'm trying to build the opencl kernel using the clBuildProgram() execution
err = clBuildProgram(program, 1, &ocl->device, "", NULL, NULL);
My process starts using more and more memory, until it reaches 13GB (Normally it uses about 400MB), then yields:
"0xC0000005: Access violation executing location"
The weird part is this happens only if I use the integrated card, which is an Intel HD 4000. If choose other device like the GTX 960 or the CPU it works fine.
Another strange thing is that if there is any syntax error the clBuildProgram function ends fine, giving the compilation error, its only when there isn't any mistakes. Also, if I comment part of my code it goes.
This is my function:
__kernel void update(__global struct PhysicsComponent_ocl_t* vecPhy, __constant struct BoxCollider_ocl_t* vecBx, __constant ulong* vecIdx, __constant float* deltaTime) {
unsigned int i = get_global_id(0);
unsigned int j = get_global_id(1);
if (j > i) { //From size_t j = i + 1; i < vec.size()...
//Copy data to local memory to avoid race conditions
struct AuxPhy_ocl_t phy1;
copyPhyGL(&vecPhy[vecIdx[i]], &phy1);
struct AuxPhy_ocl_t phy2;
copyPhyGL(&vecPhy[vecIdx[j]], &phy2);
if (collide(&phy1, &phy2, &vecBx[i], &vecBx[j])) {
////Check speed correction for obj 1
struct mivec3_t speed1 = phy1.speed;
struct mivec3_t speed2 = phy2.speed;
modifySpeedAndVelocityOnCollision(&phy1, &phy2, &vecBx[i], &vecBx[j], *deltaTime); //Comprobar los dos objetos, por eso se le da la vuelta a los parametros
modifySpeedAndVelocityOnCollision(&phy2, &phy1, &vecBx[j], &vecBx[i], *deltaTime);
//Make the objects not move
struct mivec3_t auxSub;
multiplyVectorByScalarLL(&speed1, *deltaTime, &auxSub);
substractVectorsLL(&phy1.position, &auxSub, &phy1.position);
multiplyVectorByScalarLL(&speed2, *deltaTime, &auxSub);
substractVectorsLL(&phy2.position, &auxSub, &phy2.position);
//Copy data back to global
copyPhyLG(&phy1, &vecPhy[vecIdx[i]]);
copyPhyLG(&phy2, &vecPhy[vecIdx[j]]);
}
}
}
For example. If I comment the last two functions, builds the program.
//Copy data back to global
//copyPhyLG(&phy1, &vecPhy[vecIdx[i]]);
//copyPhyLG(&phy2, &vecPhy[vecIdx[j]]);
But they are not the cause for this, because if I put this functions, but comment part of the body it also works.
__kernel void update(__global struct PhysicsComponent_ocl_t* vecPhy, __constant struct BoxCollider_ocl_t* vecBx, __constant ulong* vecIdx, __constant float* deltaTime) {
unsigned int i = get_global_id(0);
unsigned int j = get_global_id(1);
if (j > i) { //From size_t j = i + 1; i < vec.size()...
//Copy data to local memory to avoid race conditions
struct AuxPhy_ocl_t phy1;
copyPhyGL(&vecPhy[vecIdx[i]], &phy1);
struct AuxPhy_ocl_t phy2;
copyPhyGL(&vecPhy[vecIdx[j]], &phy2);
//Removed code was here
copyPhyLG(&phy1, &vecPhy[vecIdx[i]]);
copyPhyLG(&phy2, &vecPhy[vecIdx[j]]);
}
}
I'm mind blown by this, the only thing it comes to my mind it's like the code takes too much space.
Here is the complete kernel code.

I ran into a similar problem, and in my case it was an infinite loop in one of my kernels. I guess the compiler tried to unroll it or optimize it in some way without checking for bounds.
To validate my hypothesis I built my ocl program with optimizations turned off
int err = program.build("-cl-opt-disable");
and the build succeeded as I expected.
When you introduce a syntax error the compilation process stops early on and won't reach the optimization part where the compiler bug reside.
The compilers for the other devices don't have this bug and they will give you back an executable that you can run but probably wont terminate (correctly).

Related

racecheck error from a data structure in shared memory

I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS.
The hash table
constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff;
struct OnceLock {
static const unsigned LOCK_FRESH = 0;
static const unsigned LOCK_WORKING = 1;
static const unsigned LOCK_DONE = 2;
volatile unsigned lock;
__device__ void init() {
lock = LOCK_FRESH;
}
__device__ bool enter() {
unsigned lockState = atomicCAS ( (unsigned*) &lock, LOCK_FRESH, LOCK_WORKING );
return lockState == LOCK_FRESH;
}
__device__ void done() {
__threadfence();
lock = LOCK_DONE;
__threadfence();
}
__device__ void wait() {
while ( lock != LOCK_DONE );
}
};
template <typename T>
struct agg_ht {
OnceLock lock;
uint64_t hash;
T payload;
};
template <typename T>
__global__ void initAggHT ( agg_ht<T>* ht, int32_t num ) {
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num; i += blockDim.x * gridDim.x) {
ht[i].lock.init();
ht[i].hash = HASH_EMPTY;
}
}
// returns candidate bucket
template <typename T>
__device__ int hashAggregateGetBucket ( agg_ht<T>* ht, int32_t ht_size, uint64_t grouphash, int& numLookups, T* payl ) {
int location=-1;
bool done=false;
while ( !done ) {
location = ( grouphash + numLookups ) % ht_size;
agg_ht<T>& entry = ht [ location ];
numLookups++;
if ( entry.lock.enter() ) {
entry.payload = *payl;
entry.hash = grouphash;
entry.lock.done();
}
entry.lock.wait();
done = (entry.hash == grouphash);
if ( numLookups == ht_size ) {
printf ( "agg_ht hash table full at threadIdx %d & blockIdx %d \n", threadIdx.x, blockIdx.x );
break;
}
}
return location;
}
Then I have a minimal kernel as well as the main function, just to let the hash table run. An important thing is the hash table is annotated with __shared__, which is allocated in the shared memory in an SM for fast accesses.
(I did not add any input data with cudaMalloc there to hold the example minimal.)
#include <cstdint>
#include <cstdio>
/**hash table implementation**/
constexpr int HT_SIZE = 1024;
__global__ void kernel() {
__shared__ agg_ht<int> aht2[HT_SIZE];
{
int ht_index;
unsigned loopVar = threadIdx.x;
unsigned step = blockDim.x;
while(loopVar < HT_SIZE) {
ht_index = loopVar;
aht2[ht_index].lock.init();
aht2[ht_index].hash = HASH_EMPTY;
loopVar += step;
}
}
int key = 1;
int value = threadIdx.x;
__syncthreads();
int bucket = -1;
int bucketFound = 0;
int numLookups = 0;
while(!(bucketFound)) {
bucket = hashAggregateGetBucket ( aht2, HT_SIZE, key, numLookups, &(value));
int probepayl = aht2[bucket].payload;
bucketFound = 1;
bucketFound &= ((value == probepayl));
}
}
int main() {
kernel<<<1, 128>>>();
cudaDeviceSynchronize();
return 0;
}
The standard way to compile it, if the file is called test.cu:
$ nvcc -G test.cu -o test
I have to say, this hash table would always give me the correct answer during concurrent insertions under huge-sized input.
However, when I ran racecheck on it, I saw Errors everywhere:
$ compute-sanitizer --tool racecheck ./test
========= COMPUTE-SANITIZER
========= Error: Race reported between Write access at 0xd20 in /tmp/test.cu:61:int hashAggregateGetBucket<int>(agg_ht<T1> *, int, unsigned long, int &, T1 *)
========= and Read access at 0xe50 in /tmp/test.cu:65:int hashAggregateGetBucket<int>(agg_ht<T1> *, int, unsigned long, int &, T1 *) [1016 hazards]
=========
========= Error: Race reported between Write access at 0x180 in /tmp/test.cu:25:OnceLock::done()
========= and Read access at 0xd0 in /tmp/test.cu:30:OnceLock::wait() [992 hazards]
=========
========= Error: Race reported between Write access at 0xcb0 in /tmp/test.cu:60:int hashAggregateGetBucket<int>(agg_ht<T1> *, int, unsigned long, int &, T1 *)
========= and Read access at 0x1070 in /tmp/test.cu:103:kernel() [508 hazards]
=========
========= RACECHECK SUMMARY: 3 hazards displayed (3 errors, 0 warnings)
I was confused, that I believe this linear-probing hash table can pass my unit test but has data race hazards everywhere. I suppose those hazards are irrelevant for the correctness. (?)
After a while of debugging, I still could not get the hazard errors away. I strongly believe the volatile is the cause. I was hoping someone might be able to shed some light on it and give me a hand to fix those annoying hazards.
I also hope this question could reflect some design idea on the topic: data structure on shared memory. During searching on StackOverflow, what I saw is merely plain raw array in shared memory.
I suppose those hazards are irrelevant for the correctness. (?)
I wouldn't try to certify the "correctness" of your application or algorithm. If that is what you are looking for, please just disregard my answer.
I was hoping someone might be able to shed some light on it
A shared memory race condition occurs when one thread writes to a location in shared memory, and another thread reads from that location, and there is no intervening synchronization in the code to ensure that the write happens before the read (or perhaps, more correctly, that the written value is visible to the reading thread). This is not a careful, exhaustive definition, but it suffices for what we are dealing with here.
In so far as that definition goes, you certainly have that activity in your code. One specific case that is being flagged is one thread writing here:
entry.hash = grouphash;
and another thread reading the same location here:
done = (entry.hash == grouphash);
Inspecting your code we can see that there is no __syncthreads() statement between those two code positions. Furthermore, due to the loop that encompasses that activity, there are more than one hazard associated with this (there are two).
The other interaction being flagged is one thread writing to lock here:
entry.lock.done();
and another thread reading the same lock location here:
entry.lock.wait();
The hazard reported here are actually being reported against other lines of code because these are both function calls. Again, there is no intervening synchronization.
I acknowledge that due to the looping nature of your application, I'm not sure it's necessary for "correctness" that either of these thread to thread communication paths get picked up at the earliest opportunity. However, I have not studied your application carefully, nor do I intend to state anything about correctness.
and give me a hand to fix those annoying hazards.
As it happens, both of these interactions are in a small section of your code, so we can cause these 3 hazards to go away with the following additions, according to my testing:
__syncthreads(); // add this line
entry.lock.wait();
done = (entry.hash == grouphash);
__syncthreads(); // add this line
The first sync intersects the obvious write-read connections between the lines I have already indicated. The second sync is needed due to the looping nature of the code at this point.
Also note that proper usage of __syncthreads() is such that all threads in the threadblock can reach that sync point. A quick perusal of what you have here didn't suggest to me that the above lines/additions would need to be handled carefully, but you should confirm that and be aware of that for general application/usage. It may be that the while bucketFound loop would create a situation here that should be handled differently, however the compute-sanitizer --tool synccheck did not report any issues, running on V100, with the additions I suggested here.

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

CUDA pointer arithmetic causes uncoalesced memory access?

I am working with a CUDA kernel that must operate on pointers-to-pointers. The kernel basically performs a large number of very small reductions, which are best done in serial since the reductions are of size Nptrs=3-4.
Here are two implementations of the kernel:
__global__
void kernel_RaiseIndexSLOW(double*__restrict__*__restrict__ A0,
const double*__restrict__*__restrict__ B0,
const double*__restrict__*__restrict__ C0,
const int Nptrs, const int Nx){
const int i = blockIdx.y;
const int j = blockIdx.z;
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(i<Nptrs) {
if(j<Nptrs) {
for (int x = idx; x < Nx; x += blockDim.x*gridDim.x){
A0gpu[i+3*j][x] = B0gpu[i][x]*C0gpu[3*j][x]
+B0gpu[i+3][x]*C0gpu[1+3*j][x]
+B0gpu[i+6][x]*C0gpu[2+3*j][x];
}
}
}
}
__global__
void kernel_RaiseIndexsepderef(double*__restrict__*__restrict__ A0gpu,
const double*__restrict__*__restrict__ B0gpu,
const double*__restrict__*__restrict__ C0gpu,
const int Nptrs, const int Nx){
const int i = blockIdx.y;
const int j = blockIdx.z;
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(i<Nptrs) {
if(j<Nptrs){
double*__restrict__ A0ptr = A0gpu[i+3*j];
const double*__restrict__ B0ptr0 = B0gpu[i];
const double*__restrict__ C0ptr0 = C0gpu[3*j];
const double*__restrict__ B0ptr1 = B0ptr0+3;
const double*__restrict__ B0ptr2 = B0ptr0+6;
const double*__restrict__ C0ptr1 = C0ptr0+1;
const double*__restrict__ C0ptr2 = C0ptr0+2;
for (int x = idx; x < Nx; x +=blockDim.x *gridDim.x){
double d2 = C0ptr0[x];
double d4 = C0ptr1[x]; //FLAGGED
double d6 = C0ptr2[x]; //FLAGGED
double d1 = B0ptr0[x];
double d3 = B0ptr1[x]; //FLAGGED
double d5 = B0ptr2[x]; //FLAGGED
A0ptr[x] = d1*d2 + d3*d4 + d5*d6;
}
}
}
}
As indicated by the names, the kernel "sepderef" performs about 40% faster than its counterpart, achieving, once launch overhead is figured in, about 85GBps effective bandwidth at Nptrs=3, Nx=60000 on an M2090 with ECC on (~160GBps would be optimal).
Running these through nvvp shows that the kernel is bandwidth bound. Strangely, however, the lines I have marked //FLAGGED are highlighted by the profiler as areas of sub-optimal memory access. I don't understand why this is, as the access here looks coalesced to me. Why would it not be?
Edit: I forgot to point this out, but notice that the //FLAGGED regions are accessing pointers upon which I have done arithmetic, whereas the others were accessed using the square bracket operator.
To understand this behaviour one needs to be aware that all CUDA GPUs so far execute instructions in-order. After after an instruction to load an operand from memory is issued, other independent instructions still continue to be executed. However once an instruction is encountered that depends on the operand from memory, all further operation on this instruction stream is stalled until the operand becomes available.
In your "sepderef" example, you are loading all operands from memory before summing them, which means that potentially the global memory latency is incurred only once per loop iteration (there are six loads per loop iteration, but they can all overlap. Only the first addition of the loop will stall, until it's operands are available. After the stall, all other additions will have their operands readily or very soon available).
In the "SLOW" example, loading from memory and addition are intermixed, so global memory latency is incurred multiple times per loop operation.
You may wonder why the compiler doesn't automatically reorder load instructions before computation. The CUDA compilers used to do this very aggressively, expending additional registers where the operands are waiting until used. CUDA 8.0 however seems far less aggressive in this respect, sticking much more to the order of instructions in the source code. This gives the programmer better opportunity to structure the code in the best way performance-wise where the compiler's instruction scheduling was suboptimal. At the same time, it also puts more burden on the programmer to explicitly schedule instructions even where previous compiler versions got it right.

Create an object with fields on the device directly

I'm trying to create a class that will get allocated on the device. I want the constructor to run on the device so that the whole object including the fields inside are automatically allocated on the device instead of having to create a host object then copy it manually to the device.
I'm using thrust device_new
Here is my code:
using namespace thrust;
class Particle
{
public:
int* data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
I annotated the constructor with __device__ and used device_new (thrust), but this doesn't work, can someone explain to me why?
Cheers for help
I believe the answer lies in the description given here. Someone who knows thrust under the hood will probably come along and indicate whether this is true or not.
Although thrust has changed a lot since 2009, I believe device_new may still be using some form of operation where the object is actually temporarily instantiated on the host, then copied to the device. I believe the size limitation described in the above reference is no longer applicable, however.
I was able to get this to work:
#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>
#define N 512
using namespace thrust;
class Particle
{
public:
int data[N];
__device__ __host__ Particle()
{
// data = new int[10];
for (int i=0; i<N; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<N; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
Interestingly, it gives bogus results if I omit the __host__ decorator on the constructor, suggesting to me that the temporary object copy mechanism is still in place. It also gives bogus results (and cuda-memcheck reports out-of-bounds access errors) if I switch to using the dynamic allocation for data instead of static, also suggesting to me that device_new is using a temporary object creation on the host followed by a copy to the device.
First of all thanks to Rovert Crovella for his input (and previous answers)
So apparently I "overestimated" what device_new can do, I thought that it can initialise the object directly on the device, so any dynamically allocated memory inside is done on the device too.
But it seems like device_new is basically just doing the same as the manual way:
Particle temp;
Particle *d_p;
cudaMalloc(&d_p, sizeof(Particle));
cudaMemcpy(d_p, &temp, sizeof(Particle), cudaMemcpyHostToDevice);
So it makes a temp host object and copies it just like how it would be done manually. That means the memory allocated inside the object is allocated on the host, and only the pointer gets copied as part of the object, so you cannot use that memory in a kernel, you have to copy that memory manually to the device, and thrust doesn't seem to be doing that.
So it's just a cleaner way of creating a temp host object and copying it, except that you lose the ability to copy the dynamic memory allocated inside since you don't have access to that temp variable.
I hope in the future, there will be a method or a feature in CUDA that makes you initialise the object directly on the device so any dynamically allocated data in the constructor (or elsewhere) is allocated on the device too, instead of the tedious way of copying every piece of memory manually.

CUDA/Thrust double pointer problem (vector of pointers)

Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.
I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!
--------------------------
main.cpp
--------------------------
Sphere * parseSphere(int i)
{
Sphere * s = new Sphere();
s->a = 1+i;
s->b = 2+i;
s->c = 3+i;
return s;
}
int main( int argc, char** argv ) {
int i;
thrust::host_vector<Sphere *> spheres_h;
thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);
//initialize spheres_h
for(i=0;i<NUM_OBJECTS;i++){
Sphere * sphere = parseSphere(i);
spheres_h.push_back(sphere);
}
//initialize spheres_resh
for(i=0;i<NUM_OBJECTS;i++){
spheres_resh[i].a = 1;
spheres_resh[i].b = 1;
spheres_resh[i].c = 1;
}
thrust::device_vector<Sphere *> spheres_dv = spheres_h;
thrust::device_vector<Sphere> spheres_resv = spheres_resh;
Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);
kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);
thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());
bool result = true;
for(i=0;i<NUM_OBJECTS;i++){
result &= (spheres_resh[i].a == i+1);
result &= (spheres_resh[i].b == i+2);
result &= (spheres_resh[i].c == i+3);
}
if(result)
{
cout << "Data GOOD!" << endl;
}else{
cout << "Data BAD!" << endl;
}
return 0;
}
--------------------------
cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float
num_objects)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
spheres_res[index].b = (*(spheres_d+index))->b;
spheres_res[index].c = (*(spheres_d+index))->c;
}
void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{
int threads = 512;//per block
int grids = ((num_objects)/threads)+1;//blocks per grid
deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}
The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.
The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.
The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.
You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.
The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.