I have a question regarding copying pointers in the stl library. Say I define:
struct A{
int x;
}
std::map<int, const A*> map1;
I then populate map1 using memory from the heap using malloc for the pointer to A.
Then I do
std::map<int, const A*> map2 = map1;
For each pointer of struct A in map2, does std::map do a shallow copy of the pointers, or assign new memory from the heap for each of the pointers?
Cheers
Shanker
It will copy just the pointers. That means that a shallow copy will be made as opposed to a deep copy. You can easily check the actual behavior by using a simple test program:
int main() {
std::map<int, int*> map1;
map1[0] = new int(10);
std::map<int, int*> map2 = map1;
*(map2[0]) = 20;
// this must print 20 if a shallow copy is used
std::cout << *(map1[0]) << std::endl;
}
Related
A large data file (having 1000000 rows and 2 columns) is required to be accessed from a device function. At every step of computation, the value of a variable changes in the device function and the values of a particular row of the data file is required. So the whole data file should be available in the device function as there is no control over the value of the variable at every step.
In the following program, the value of the variable is sent to a host device function from a device function and then the variable is sent to a host function (where the data file is available) from the host device function.The host function picks up the required values from the entire array and in this way the required values for one particular step may be got at the device function. But this process is not working for the code.
I don't know whether the keyword __host__ __device__ works in cuda V0.2.1221 or not.
Please suggest a way to accss the large data file in a device function.
The required portion of the code is given beow.
__host__
void magnetic( R *xx, R *magfield)
{
float Bx[1000001],By[1000001];
R x[3];
int k;
FILE *fp;
fp = fopen("field.dat", "r");
for (int i=0;i<=1000000;i++)
{
fscanf(fp, "%f%f", &Bx[i], &By[i]);
}
fclose(fp);
printf("%f\t%f\n",Bx[0],By[0]);
for( int i=0;i<3;i++){
x[i]=*xx;
xx++;
}
//printf("%f\t%f\t%f\n",x[0],x[1],x[2]);
k=round((zi+x[2])/dz);
magfield[0]=Bx[k];
magfield[1]=By[k];
magfield[2]=2.;
}
__host__
void magnetic(R *xx, R *magfield);
__host__ __device__
void field(R *xx, R *magfield){
R x[3];
for( int i=0;i<3;i++){
x[i]=*xx;
xx++;
}
magnetic(x,magfield);
}
__host__ __device__
void field( R *xx, R *magfield);
__device__
void eval_rhs( R *f, R *df, R time, int istep) {
R magfield[3],vXB[3];
df[0] = f[3];
df[1] = f[4];
df[2] = f[5];
field( &f[0], magfield);
crossmultiply(&f[3], magfield, vXB);
df[3] = Ex + vXB[0];
df[4] = Ey + vXB[1];
df[5] = Ez + vXB[2];
}
Question :
Please suggest a way to accss the large data file in a device function.
Answer :
Load the file on host side, copy the buffer from system to device memory, and then use this device-side buffer in your kernels and/or device functions.
int main()
{
FILE* file;
float* fileData;
... // Allocate buffer here
... // Open and load file here
float* dev_fileData;
cudaMalloc(/*allocate the same-sized array on device side*/);
cudaMemcpy(/*copy from host-side fileData to device-side*/);
kernel<<<blocks, threads>>>(dev_fileData);
system("pause");
return 0;
}
__device__ void doSomething(float* dev_fileData)
{
// Read (and/or write, but avoid concurrent access) data here
}
__global__ void kernel(float* dev_fileData)
{
// Some kernel code
doSomething(dev_fileData);
// Some other kernel code
}
I'm trying to create a class that will get allocated on the device. I want the constructor to run on the device so that the whole object including the fields inside are automatically allocated on the device instead of having to create a host object then copy it manually to the device.
I'm using thrust device_new
Here is my code:
using namespace thrust;
class Particle
{
public:
int* data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
I annotated the constructor with __device__ and used device_new (thrust), but this doesn't work, can someone explain to me why?
Cheers for help
I believe the answer lies in the description given here. Someone who knows thrust under the hood will probably come along and indicate whether this is true or not.
Although thrust has changed a lot since 2009, I believe device_new may still be using some form of operation where the object is actually temporarily instantiated on the host, then copied to the device. I believe the size limitation described in the above reference is no longer applicable, however.
I was able to get this to work:
#include <stdio.h>
#include <thrust/device_ptr.h>
#include <thrust/device_new.h>
#define N 512
using namespace thrust;
class Particle
{
public:
int data[N];
__device__ __host__ Particle()
{
// data = new int[10];
for (int i=0; i<N; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle* p)
{
for (int i=0; i<N; i++)
printf("%d\n", p->data[i]);
}
int main() {
device_ptr<Particle> p = device_new<Particle>();
test<<<1,1>>>(thrust::raw_pointer_cast(p));
cudaDeviceSynchronize();
printf("Done!\n");
}
Interestingly, it gives bogus results if I omit the __host__ decorator on the constructor, suggesting to me that the temporary object copy mechanism is still in place. It also gives bogus results (and cuda-memcheck reports out-of-bounds access errors) if I switch to using the dynamic allocation for data instead of static, also suggesting to me that device_new is using a temporary object creation on the host followed by a copy to the device.
First of all thanks to Rovert Crovella for his input (and previous answers)
So apparently I "overestimated" what device_new can do, I thought that it can initialise the object directly on the device, so any dynamically allocated memory inside is done on the device too.
But it seems like device_new is basically just doing the same as the manual way:
Particle temp;
Particle *d_p;
cudaMalloc(&d_p, sizeof(Particle));
cudaMemcpy(d_p, &temp, sizeof(Particle), cudaMemcpyHostToDevice);
So it makes a temp host object and copies it just like how it would be done manually. That means the memory allocated inside the object is allocated on the host, and only the pointer gets copied as part of the object, so you cannot use that memory in a kernel, you have to copy that memory manually to the device, and thrust doesn't seem to be doing that.
So it's just a cleaner way of creating a temp host object and copying it, except that you lose the ability to copy the dynamic memory allocated inside since you don't have access to that temp variable.
I hope in the future, there will be a method or a feature in CUDA that makes you initialise the object directly on the device so any dynamically allocated data in the constructor (or elsewhere) is allocated on the device too, instead of the tedious way of copying every piece of memory manually.
I have a vector, and I would like to do the following, using CUDA and Thrust transformations:
// thrust::device_vector v;
// for k times:
// calculate constants a and b as functions of k;
// for (i=0; i < v.size(); i++)
// v[i] = a*v[i] + b*v[i+1];
How should I correctly implement this? One way I can do it is to have vector w, and apply thrust::transform onto v and save the results to w. But k is unknown ahead of time, and I don't want to create w1, w2, ... and waste a lot of GPU memory space. Preferably I want to minimize the amount of data copying. But I'm not sure how to implement this using one vector without the values stepping on each other. Is there something Thrust provides that can do this?
If the v.size() is large enough to fully utilize the GPU, you could launch k kernels to do this, with a extra buffer mem and no extra data transfer.
thrust::device_vector u(v.size());
for(k=0;;)
{
// calculate a & b
thrust::transform(v.begin(), v.end()-1, v.begin()+1, u.begin(), a*_1 + b*_2);
k++;
if(k>=K)
break;
// calculate a & b
thrust::transform(u.begin(), u.end()-1, u.begin()+1, v.begin(), a*_1 + b*_2);
k++;
if(k>=K)
break;
}
I don't actually understand the "k times", but the following code may help you.
struct OP {
const int a, b;
OP(const int p, const int q): a(p), b(q){};
int operator()(const int v1, const int v2) {
return a*v1+b*v2;
}
}
thrust::device_vector<int> w(v.size());
thrust::transform(v.begin(), v.end()-1, //input_1
v.begin()+1, //input_2
w.begin(), //output
OP(a, b)); //functor
v = w;
I think learning about "functor", and several examples of thrust will give you a good guide.
Hope this will help you to solve your problem. :)
The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}
You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.
In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.
you can't use std::vector in device code, you should use array instead.
I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};
You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.
Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.
I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!
--------------------------
main.cpp
--------------------------
Sphere * parseSphere(int i)
{
Sphere * s = new Sphere();
s->a = 1+i;
s->b = 2+i;
s->c = 3+i;
return s;
}
int main( int argc, char** argv ) {
int i;
thrust::host_vector<Sphere *> spheres_h;
thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);
//initialize spheres_h
for(i=0;i<NUM_OBJECTS;i++){
Sphere * sphere = parseSphere(i);
spheres_h.push_back(sphere);
}
//initialize spheres_resh
for(i=0;i<NUM_OBJECTS;i++){
spheres_resh[i].a = 1;
spheres_resh[i].b = 1;
spheres_resh[i].c = 1;
}
thrust::device_vector<Sphere *> spheres_dv = spheres_h;
thrust::device_vector<Sphere> spheres_resv = spheres_resh;
Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);
kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);
thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());
bool result = true;
for(i=0;i<NUM_OBJECTS;i++){
result &= (spheres_resh[i].a == i+1);
result &= (spheres_resh[i].b == i+2);
result &= (spheres_resh[i].c == i+3);
}
if(result)
{
cout << "Data GOOD!" << endl;
}else{
cout << "Data BAD!" << endl;
}
return 0;
}
--------------------------
cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float
num_objects)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
spheres_res[index].b = (*(spheres_d+index))->b;
spheres_res[index].c = (*(spheres_d+index))->c;
}
void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{
int threads = 512;//per block
int grids = ((num_objects)/threads)+1;//blocks per grid
deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}
The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.
The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.
The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.
You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.
The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.