What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...
device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)
So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));
Related
I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application.
Is there a way to compute this index efficiently and store it in the GPU?
Thanks!
Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than into a host variable (which entails a device to host transfer and might leave the result in the wrong memory space).
To use this, you need to use the cublasSetPointerMode call to tell the CUBLAS context to expect pointers for scalar arguments to be device pointers by using the CUBLAS_POINTER_MODE_DEVICE mode. This then implies that in a call like
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
that result must be a device pointer.
If you want to use CUBLAS and you have a GPU with compute capability 3.5 (K20, Titan) than you can use CUBLAS with dynamic parallelism. Than you can call CUBLAS from within a kernel on the GPU and no data will be returned to the CPU.
If you have no device with cc 3.5 you will probably have to implement a find max function by yourself or look for an aditional library.
SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.
Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Copying a struct containing pointers to CUDA device
I have a structure of device pointers, pointing to arrays allocated on the device.
like this
struct mystruct{
int* dev1;
double* dev2;
.
.
}
There are a large number of arrays in this structure. I started writing a CUDA kernel in which
I passed the pointer to mystruct and then derefernce it within the
CUDA kernel code like this mystruct->dev1[i].
But I realized after writing a few lines that this will not work since by CUDA first principles
you cannot derefernce a host pointer (in this case to mystruct) within a CUDA kernel.
But this is kind of inconveneint, since I will have to pass a larger number of arguments
to my kernels. Is there any way to avoid this. I would like to keep the number of arguments
to my kernel calls as short as possible.
As I explain in this answer, you can pass your struct by value to the kernel, so you don't have to worry about dereferencing a host pointer:
__global__ void kernel(mystruct in)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
in.dev1[idx] *= 2;
in.dev2[idx] += 3.14159;
}
There is the overhead of passing the struct by value to be aware of. However if your struct is not too large, it shouldn't matter.
If you pass the same struct to a lot of kernels, or repeatedly, you may consider copying the struct itself to global or constant memory instead as suggested by aland, or use mapped host memory as suggested by Mark Ebersole. But passing the struct by value is a much simpler way to get started.
(Note: please search StackOverflow before duplicating questions...)
You can copy your mystruct structure to global memory and pass its device address to kernel.
From performance viewpoint, however, it would be better to store mystruct in constant memory, since (I guess) there are a lot of random reads from it by many threads.
You could also use page-locked (pinned) host memory and create the structure within that region if your setup supports it. Please see 3.2.4 of the CUDA programming guide.
Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.
This is my code. I have lot of threads so that those threads calling this function many times.
Inside this function I am creating an array. It is an efficient implementation?? If it is not please suggest me the efficient implementation.
__device__ float calculate minimum(float *arr)
{
float vals[9]; //for each call to this function I am creating this arr
// Is it efficient?? Or how can I implement this efficiently?
// Do I need to deallocate the memory after using this array?
for(int i=0;i<9;i++)
vals[i] = //call some function and assign the values
float min = findMin(vals);
return min;
}
There is no "array creation" in that code. There is a statically declared array. Further, the standard CUDA compilation model will inline expand __device__functions, meaning that the vals will be compiled to be in local memory, or if possible even in registers.
All of this happens at compile time, not run time.
Perhaps I am missing something, but from the code you have posted, you don't need the temporary array at all. Your code will be (a little) faster if you do something like this:
#include "float.h" // for FLT_MAX
__device__ float calculate minimum(float *arr)
{
float minVal = FLT_MAX:
for(int i=0;i<9;i++)
thisVal = //call some function and assign the values
minVal = min(thisVal,minVal);
return minVal;
}
Where an array is actually required, there is nothing wrong with declaring it in this way (as many others have said).
Regarding the "float vals[9]", this will be efficient in CUDA. For arrays that have small size, the compiler will almost surely allocate all the elements into registers directly. So "vals[0]" will be a register, "vals[1]" will be a register, etc.
If the compiler starts to run out of registers, or the array size is larger than around 16, then local memory is used. You don't have to worry about allocating/deallocating local memory, the compiler/driver do all that for you.
Devices of compute capability 2.0 and greater do have a call stack to allow things like recursion. For example you can set the stack size to 6KB per thread with:
cudaStatus = cudaThreadSetLimit(cudaLimitStackSize, 1024*6);
Normally you won't need to touch the stack yourself. Even if you put big static arrays in your device functions, the compiler and driver will see what's there and make space for you.