Just like topic says. Can one access CUDA texture using integer coordinates?
ex.
tex2D(myTex, 1, 1);
I'd like to store float values in texture, and use it as my framebuffer.
I will pass it to OpenGL than to render on a screen.
Is this addressing possible? I don't want to interpolate between pixels. I want value from exactly specified point.
Note: there isn't really interpolation going on when you use the 0.5 offset notation for multi-dimensional textures (the actual pixel values start at (0.5, 0.5)). If you're really worried, set round-to-nearest point rather than default of bilinear.
If you use 1D textures instead (when the underlying data is 2D), you may lose performance due to lack of data locality in the other dimension.
If you want to use the texture cache without using any of the texture-specific operations such as interpolation, you can use tex1Dfetch(). This lets you index with integers.
The size limit is 2^27 elements, so you will be able to access 512 MB with floats, or 1GB with int2 [which can also be used to retrieve doubles via __hiloint2double()]. Larger data can be accessed by mapping multiple textures on top of it that cover the data.
You will have to map any multi-dimensional array accesses to the one-dimensional array supported by tex1Dfetch(). I have always used simple C macros for that.
Related
If I have a 200 size array in texture memory with linear interpolation enabled, to access the value of the first element I need to access value 0.5, not 0. Basically I need to access desiredValue+0.5. This ensures that the indexes cover [0-200] inside the image.
How is that with normalized texture memory? are 0-1 the corners of the array, or the element values? to access the first element, would I need to use 0+0.5/200?
As seen in the documentation about Texture Fetching and specifically seen in the images there:
[0-1] are the corners of the array, so indeed to access a specific array value in normalized units one would need to do (desiredValue+0.5)/totalSize
I want to calculate the mean image from the dataset of images(around 100). All the images are 2 dimensional. Can i go for cudaMalloc3D inbuilt function or is their any other way to allocate memory..
I often treat multidimensional array as 1D array in cuda. Let's say, you want to allocate 3D array of size (NxMxK). Then, with the cudaMalloc command, you can allocate 1D array a of size (N*M*K). In order to access element with indexes [i][j][k], you just call a[i+j*N+k*N*M] (assuming 0-based indexing, column-major ordering).
This is also a way to index threads in multidimensional blocks (you can have 1D,2D or 3D blocks):
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
This question is about the buffer required by cuFFT. In the User Guide it is documented that
In the worst case, the CUFFT Library allocates space for
8*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements
(where batch denotes the number of transforms that will be executed in
parallel, rank is the number of dimensions of the input data (see
Multidimensional transforms) and n[] is the array of transform
dimensions) for single and doubleprecision transforms respectively.
What does "array of transform dimensions" mean? How much buffer does cuFFT need? What I understand with the above is that it needs at least 8x the size of the array being FFTed but this does not make sense to me
Thanks in advance
Daniel
The "array of transform dimensions" is the array containing the problem size in each dimension, see the section on multidimensional transforms for more information.
cuFFT is allocating temporary space to be able to accommodate the intermediate data, the part of the doc you quoted says this is "the worst case", so it's not "at least 8x", it's at most. The doc goes on to say:
Depending on the configuration of the plan, less memory may be used.
In some specific cases, the temporary space allocations can be as low
as 1*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex
elements.
So for a NxM 2D single precision transform:
1*N*M*sizeof(cufftComplex) <= space for tmp data <= 8*N*M*sizeof(cufftComplex)
Use cufftGetSize1d and cufftEstimate1d to give you the amount of memory allocated for the buffer. The documentation says cufftPlan1d gives an estimation of the maximum amount and cufftGetSize1d provide a more precise estimation.
In my case I use both 64 and 8192 point FFTs. I get the same issue, the buffer size allocate only 1*batch*n[0] elements.I've made the test with different amount of data and different FFT size and I get this same value.
To conclude, if you need to determine the memory used by a FFT, the CuFFT library provide a fonction to do this.
Is it possible to assign value to a texture memory, for a non-integer co-ordinate?
i.e. assume we have a 1 Dimensional texture memory array. I understand we can allocate array elements at integer co-ordinates. We can then READ values at fractional co-ordinates, using linear interpolation.
My question is: does CUDA allow the programmer to WRITE values to fractional co-ordinates?
Thanks.
It is not possible to write to fractional coordinates. There would be nowhere for the hardware to store the new values. Even though you can read with linear interpolation, the values between which interpolation is being performed can only be stored at integer locations in memory.
One way to implement this might be to write a kernel that reads your initial array of values and creates a higher resolution array with interpolated values. Then, you write your new values in this new array at the integer locations that are closest to the ones you actually want to write to.
I have a curve as follows:
float points[] = {1, 4, 6, 9, 14, 25, 69};
float images[] = {0.3, 0.4, 0.7, 0.9, 1, 2.5, 5.3};
In order to interpolate let's say f(3) I would use linear interpolation between 1 and 4
In order to interpolate let's say f(15) I would apply a binary search on the array of points and get the lowerBound which is 25 and consider interpolation in the interval [14,25] and so on..
I have found out this method is making my device function very slow. I've heard I can use texture memory and tex1D in order to do so ! is it possible even if points[] is not let's say uniform (incremented by constant step)
Any idea ?
It looks like this problem can be broken into two parts:
Use the points array to convert the x value in f(x) to a floating point index between 0 and 7 (requires binary search on points[])
Use that floating point index to get a linearly interpolated value from the images array
Cuda texture memory can make step 2 very fast. I am guessing, however, that most of the time in your kernel is spent on step 1, and I don't think texture memory can help you there.
If you aren't already taking advantage of shared memory, moving your arrays to shared memory will give you a much bigger speedup than using texture memory. There is 48k of shared memory on recent hardware, so if your arrays are less than 24k (6k elements) they should both fit in shared memory. Step 1 can benefit greatly from shared memory because it requires non-contiguous reads of points[], which is very very slow in global memory.
If your arrays don't fit in shared memory, you should break up your arrays into equally sized pieces with 6k elements each and assign each piece to a block. Have each block read through all of the points you are iterpolating, and have it ignore the point if it's not within the portion of the points[] array stored in its shared memory.