I would like to use the Texture Memory for Interpolation of Data. I have 2 Arrays (namely A[i] and B[i]) and I would want to interpolate Data between them. I thought I could bind them to Texture Memory and set the interpolation but I am not sure how I can do that.
The examples that come with CUDA use the A[i-1] and A[i+1] for the interpolation.
Is there any way to do what I planned? I'm trying this because I think I can get a good speedup.
Yes, you can do this with texture memory, and it is fast. I personally use ArrayFire to accomplish these kinds of operations, because it is faster than I can hope to code by hand.
If you want to code by hand yourself in CUDA, something like this is what you want:
// outside kernel
texture<float,1> A;
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();
cudaArray *arr = NULL;
cudaError_t e = cudaMallocArray(&arr, &desc, 1, length);
A.filterMode = cudaFilterModePoint;
A.addressMode[0] = cudaAddressModeClamp;
cudaBindTextureToArray(A, arr, desc);
...
// inside kernel
valA = tex1D(A,1,idx)
valB = tex1D(B,1,idx)
float f = 0.5;
output = (f)*valA + (1-f)*valB;
If you want to just plug-in ArrayFire (which in my experience is faster than what I try to code by hand, not to mention way simpler to use), then you'll want:
// in arrayfire
array A = randu(10,1);
array B = randu(10,1);
float f = 0.5;
array C = (f)*A + (1-f)*B;
The above assumes you want to interpolate between corresponding indices of 2 different arrays or matrices. There are other interpolation functions available too.
If you're not used to developing with CUDA, using texture memory is not the easiest thing to start with.
I'd suggest you to try writing a first parallel version of your algorithm in CUDA with no optimisation. Then, use the NVIDIA Visual Profiler on your application to figure out whether you need to set up texture memory to optimize your memory accesses.
Remember that the earlier you optimize, the trickier it is to debug.
Last but not least, the latest CUDA version (CUDA 5, still in release candidate) is able to automatically store your data in texture memory as long as you declare the input buffers passed as parameters to your kernel as const restrict pointers.
Related
I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after.
I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.
gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;
The very short answer is no, you can't do this, or least not if you use magma_ddot.
However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.
In theory you could do something like this:
// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));
for (int i=....) {
// ...
// magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
cudaDeviceSynchronize();
cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);
// Now dotresult holds the magma_ddot result in device memory
// ...
}
Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.
To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.
I launch a very simple kernel <<<1,512>>> on a CUDA Fermi GPU.
__global__ void kernel(){
int x1,x2;
x1=5;
x2=1;
for (int k=0;k<=1000000;k++)
{
x1+=x2;
}
}
The kernel is very simple, it does 10^6 additions and does not transfer anything back to global memory. The result is correct, i.e. after the loop x1 (in all its 512 thread instances) contains 10^6 + 5
I am trying to measure the execution time of the kernel. using both visual studio parallel nsight and nvvp. Nsight measures 2.5 microseconds and nvvp measures 4 microseconds.
The issue is the following: I may increase largely the size of the loop eg to 10^8 and the time remains constant. Same if I decrease the loop size a lot. Why does this happen?
Please note that if I use shared memory or global memory inside the loop, the measurements reflect the work being performed (i.e. there is proportionality).
As noted, CUDA compiler optimisation is very aggressive at removing dead code. Because x2 doesn't participate in a value which is written to memory, it and the loop can be removed. The compiler will also pre-calculate any results which can be deduced at compile time, so if all the constants in the loop are known to the compiler, it can compute the final result and replace it with a constant.
To get around both of these problems, rewrite your code like this:
__global__
void kernel(int *out, int x0, bool flag)
{
int x1 = x0, x2 = 1;
for (int k=0; k<=1000000; k++) {
x1+=x2;
}
if (flag) out[threadIdx.x + blockIdx.x*blockDim.x] = x1;
}
and then run it like this:
kernel<<<1,512>>>((int *)0, 5, false);
By passing the initial value of x1 as an argument to the kernel, you ensure that the loop result isn't available to the compiler. The flag makes the memory store conditional, and then memory store makes the whole calculation unsafe to remove. As long as the flag is set to false at runtime, there is no store performed, so that doesn't effect the timing of the loop.
Because compiler eliminates the dead paths. You code doesn't actually do anything. Look at the assembly.
If you are actually seeing the value, then the compiler may have just optimized out the loop as it can know the value during compile time.
When you write out the register contents to shared memory, compiler cannot guarantee that the result will not be used, and hence the value will actually be computed. In other words, the value you compute must be used somewhere eventually or written to memory otherwise its computation will be dropped.
Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels?
It is declared as:
vector<vector<int>> information;
I want to cudaMalloc and copy from host to device, what would be the best way to do it?
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice);
In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about std::vector either. If you insist on having a vector of vectors as a host source, it will require doing something like this:
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
int *dst = d_information;
for (std::vector<std::vector<int> >::iterator it = information.begin() ; it != information.end(); ++it) {
int *src = &((*it)[0]);
size_t sz = it->size();
cudaMemcpy(dst, src, sizeof(int)*sz, cudaMemcpyHostToDevice);
dst += sz;
}
[disclaimer: written in browser, not compiled or tested. Use at own risk]
This would copy the host memory to an allocation in GPU linear memory, requiring one copy for each vector. If the vector of vectors is a "jagged" array, you will want to store an indexing somewhere for the GPU to use as well.
As far as I understand, the vector of vectors do not need to reside in a contiguous memory, i.e. they can be fragmented.
Depending on the amount of memory you need to transfer I would do one of two issues:
Reorder your memory to be a single vector, and then use your cudaMemcpy.
Create a series of cudaMemcpyAsync, where each copy handles a single vector in your vector of vectors, and then synchronize.
I need to perform a parallel reduction to find the min or max of an array on a CUDA device. I found a good library for this, called Thrust. It seems that you can only perform a parallel reduction on arrays in host memory. My data is in device memory. Is it possible to perform a reduction on data in device memory?
I can't figure how to do this. Here is documentation for Thrust: http://code.google.com/p/thrust/wiki/QuickStartGuide#Reductions. Thank all of you.
You can do reductions in thrust on arrays which are already in device memory. All that you need to do is wrap your device pointers inside thrust::device_pointer containers, and call one of the reduction procedures, just as shown in the wiki you have linked to:
// assume this is a valid device allocation holding N words of data
int * dmem;
// Wrap raw device pointer
thrust::device_ptr<int> dptr(dmem);
// use max_element for reduction
thrust::device_ptr<int> dresptr = thrust::max_element(dptr, dptr+N);
// retrieve result from device (if required)
int max_value = dresptr[0];
Note that the return value is also a device_ptr, so you can use it directly in other kernels using thrust::raw_pointer_cast:
int * dres = thrust::raw_pointer_cast(dresptr);
If thrust or any other library does not provides you such a service you can still create that kernel yourself.
Mark Harris has a great tutorial about parallel reduction and its optimisations on cuda.
Following his slides it is not that hard to implement and modify it for your needs.
This is my code. I have lot of threads so that those threads calling this function many times.
Inside this function I am creating an array. It is an efficient implementation?? If it is not please suggest me the efficient implementation.
__device__ float calculate minimum(float *arr)
{
float vals[9]; //for each call to this function I am creating this arr
// Is it efficient?? Or how can I implement this efficiently?
// Do I need to deallocate the memory after using this array?
for(int i=0;i<9;i++)
vals[i] = //call some function and assign the values
float min = findMin(vals);
return min;
}
There is no "array creation" in that code. There is a statically declared array. Further, the standard CUDA compilation model will inline expand __device__functions, meaning that the vals will be compiled to be in local memory, or if possible even in registers.
All of this happens at compile time, not run time.
Perhaps I am missing something, but from the code you have posted, you don't need the temporary array at all. Your code will be (a little) faster if you do something like this:
#include "float.h" // for FLT_MAX
__device__ float calculate minimum(float *arr)
{
float minVal = FLT_MAX:
for(int i=0;i<9;i++)
thisVal = //call some function and assign the values
minVal = min(thisVal,minVal);
return minVal;
}
Where an array is actually required, there is nothing wrong with declaring it in this way (as many others have said).
Regarding the "float vals[9]", this will be efficient in CUDA. For arrays that have small size, the compiler will almost surely allocate all the elements into registers directly. So "vals[0]" will be a register, "vals[1]" will be a register, etc.
If the compiler starts to run out of registers, or the array size is larger than around 16, then local memory is used. You don't have to worry about allocating/deallocating local memory, the compiler/driver do all that for you.
Devices of compute capability 2.0 and greater do have a call stack to allow things like recursion. For example you can set the stack size to 6KB per thread with:
cudaStatus = cudaThreadSetLimit(cudaLimitStackSize, 1024*6);
Normally you won't need to touch the stack yourself. Even if you put big static arrays in your device functions, the compiler and driver will see what's there and make space for you.