Required buffer for cuFFT - cuda

This question is about the buffer required by cuFFT. In the User Guide it is documented that
In the worst case, the CUFFT Library allocates space for
8*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements
(where batch denotes the number of transforms that will be executed in
parallel, rank is the number of dimensions of the input data (see
Multidimensional transforms) and n[] is the array of transform
dimensions) for single and doubleprecision transforms respectively.
What does "array of transform dimensions" mean? How much buffer does cuFFT need? What I understand with the above is that it needs at least 8x the size of the array being FFTed but this does not make sense to me
Thanks in advance
Daniel

The "array of transform dimensions" is the array containing the problem size in each dimension, see the section on multidimensional transforms for more information.
cuFFT is allocating temporary space to be able to accommodate the intermediate data, the part of the doc you quoted says this is "the worst case", so it's not "at least 8x", it's at most. The doc goes on to say:
Depending on the configuration of the plan, less memory may be used.
In some specific cases, the temporary space allocations can be as low
as 1*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex
elements.
So for a NxM 2D single precision transform:
1*N*M*sizeof(cufftComplex) <= space for tmp data <= 8*N*M*sizeof(cufftComplex)

Use cufftGetSize1d and cufftEstimate1d to give you the amount of memory allocated for the buffer. The documentation says cufftPlan1d gives an estimation of the maximum amount and cufftGetSize1d provide a more precise estimation.
In my case I use both 64 and 8192 point FFTs. I get the same issue, the buffer size allocate only 1*batch*n[0] elements.I've made the test with different amount of data and different FFT size and I get this same value.
To conclude, if you need to determine the memory used by a FFT, the CuFFT library provide a fonction to do this.

Related

How to allocate pinned memory to a 2-dimensional array using CudaMallocHost?

How to allocate pinned memory to a 2-dimensional array using CudaMallocHost?
Looking forward to any help!
(Host) memory is one-dimensional. Just like you allocate n * m * sizeof(T) bytes for a two-dimensional, n-by-m, array of type-T elements, with malloc() (or new[], or std::make_unique()) - you do the same with cudaMallocHost().
Now, it's true that the above is not the only way to model a two-dimensional array. As explained in the C FAQ, question 6.16, we may sometimes use an array-of-pointers, each of which points to a 1-D array of the minor dimension. This too can be done using cudaMallocHost() - again, by simply substituting it for malloc(). However, note this indirection has a performance penalty.
If you want array rows to be nicely-aligned, you might want to pad each row with some unused elements; but that's again true both for regular host-side memory allocation and for cudaMallocHost().

Allocating memory for multiple images in cuda to calculate their mean image

I want to calculate the mean image from the dataset of images(around 100). All the images are 2 dimensional. Can i go for cudaMalloc3D inbuilt function or is their any other way to allocate memory..
I often treat multidimensional array as 1D array in cuda. Let's say, you want to allocate 3D array of size (NxMxK). Then, with the cudaMalloc command, you can allocate 1D array a of size (N*M*K). In order to access element with indexes [i][j][k], you just call a[i+j*N+k*N*M] (assuming 0-based indexing, column-major ordering).
This is also a way to index threads in multidimensional blocks (you can have 1D,2D or 3D blocks):
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

cublas: same input and output matrix for better performance?

I see CUBLAS may be an efficient algorithm package for a single large matrices multiplication or addition etc. But in a common setting, most computations are dependent. So the next step relies on the result of the previous step.
This causes one problem, because the output matrix has to be different from input matrix in CUBLAS routine( as input matrices are const ), much time are spend to malloc space and copy data from device to device for these temporary matrices.
So is it possible to do things like multiply(A, A, B), where the first argument is ouput matrix and the second/third are input matrices, to avoid extra memory manipulation time? Or is there a better workaround?
Thanks a lot !
No, it is not possible to perform in-place operations like gemm using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).
Having said that, this comment:
.... much time are spend to malloc space and copy data from device to device for these temporary matrices.
makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:
// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B
can be replaced by
multiply(C, A, B);
float * tmp = A; A = C; C = tmp;
ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.
If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.
You could pre alloc a buffer matrix, and copy the input matrix A to the buffer before the mat-mul operation.
Memcopy(buff, A);
Multiply(A, buffer, B);
By reusing the buffer, you don't need to allocate the buffer every time, and the overhead will be only one mem copy for each mat-mul. When your matrix is large enough, the time cost of the overhead will take very small portion and can be ignored.

CUDA allocation alignment is 256 bytes - seriously?

In "CUDA C Programming Guide 5.0", p73 (also here) says "Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes". I do not know the exact meaning of this sentence. Could anyone show an example for me? Many thanks.
A derivative question:
So, what about allocating an one-dimensional array of basic elements (like int) or self-defined ones? The starting address of the array will be multiples of 256B, while the address of each element in the array is not necessarily multiples of 256B?
The pointers which are allocated by using any of the CUDA Runtime's device memory allocation functions e.g cudaMalloc or cudaMallocPitch are guaranteed to be 256 byte aligned, i.e. the address is a multiple of 256.
Consider the following example:
char *ptr1, *ptr2;
int bytes = 1;
cudaMalloc((void**)&ptr1,bytes);
cudaMalloc((void**)&ptr2,bytes);
Suppose the address returned in ptr1 is some multiple of 256, then the address returned in ptr2 will be atleast (ptr1 + 256).
This is a restriction imposed by the device on which the memory is being allocated. Mostly, pointers are aligned due to performance purposes. (Some NVIDIA guy should be able to tell if there is some other reason also).
Important:
Pointer alignment is not always 256. On my device (GTX460M), it is 512. You can get the device pointer alignment by the cudaDeviceProp::textureAlignment field.
Alignment of pointers is also a requirement for binding the pointer to textures.

How best to transfer a large number of arrays of chars to the GPU?

I am new to CUDA and am trying to do some processing of a large number of arrays. Each array is an array of about 1000 chars (not a string, just stored as chars) and there can be up to 1 million of them, so about 1 gb of data to be transfered. This data is already all loaded into memory and I have a pointer to each array, but I don't think I can rely on all the data being sequential in memory, so I can't just transfer it all with one call.
I currently made a first go at it with thrust, and based my solution kind of on this message ... I made a struct with a static call that allocates all the memory, and then each individual constructor copies that array, and I have a transform call which takes in the struct with the pointer to the device array.
My problem is that this is obviously extremely slow, since each array is copied individually. I'm wondering how to transfer this data faster.
In this question (the question is mostly unrelated, but I think the user is trying to do something similar) talonmies suggests that they try and use a zip iterator but I don't see how that would help transfer a large number of arrays.
I also just found out about cudaMemcpy2DToArray and cudaMemcpy2D while writing this question, so maybe those are the answer, but I don't see immediately how these would work, since neither seem to take pointers to pointers as input...
Any suggestions are welcome...
One way to do this is as marina.k suggested, batching your transfers only as you need them. Since you said each array only contains about 1000 chars, you could assign each char to a thread (since on Fermi we can allocate 1024 threads per block) and have each array handled by one block. In this case you may be able to transfer all the arrays for one "round" in one call - can you use a FORTRAN style, where you make one gigantic array and to get the 5th element of the "third" 1000 char array you would go:
third_array[5] = big_array[5 + 2*1000]
so that the first 1000 char array makes up the first 1000 elements of big_array, the second 1000 char array makes up the second 1000 elements of big_array, etc. ? In this case your chars would be continuous in memory and you could move the set you were going to process with one kernel launch in only one memcpy. Then as soon as you launch one kernel, you refill big_array on the CPU side and copy it asynchronously to the GPU.
Within each kernel, you could simply handle each array within 1 block, so that block N handles the (N-1)-thousandth element up to the N-thousandth of d_big_array (where you copied all those chars to).
Did you try pinned memory? This may provide a considerable speed-up on some hardware configurations.
Take try of async, you can assign the same job to different streams, each stream process a small part of date, make tranfer and computation at the same time
here is code:
cudaMemcpyAsync(
inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]
);
MyKernel<<<100, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(
hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]
);