I'm trying to write a custom xaudio2 effect that involves a fourier transform. However, the number of samples given to the process method each call is not a power of 2 (a precondition of the fourier transform implementation I have).
Is there a way to force power of 2 sized samples? Is there a technique to allow working with non power of 2 sizes?
Don't send samples to the FFT every call that you are given samples. Buffer (save) them up till you have at least a power-of-2 samples or more and then process the power-of-2 number of samples from your intermediate buffer. Rinse and repeat.
Also, newer FFTs will often allow sizes with prime factors larger than 2.
If your implementation requires that you have a power of 2 sample size, then you can pad the sample to force it to accept. Zero padding seems to be the easiest/most straight forward.
Here is an article that explains another way to do it:
The Chirp z-Transform Algorithm and Its Application
Related
I have been reading the Deep Learning book by Ian Goodfellow and it mentions in Section 6.5.7 that
The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.
I understand that backprop stores the gradients in a similar fashion to dynamic programming so not to recompute them. But I am confused as to why it stores the input as well?
Backpropagation is a special case of reverse mode automatic differentiation (AD).
In contrast to the forward mode, the reverse mode has the major advantage that you can compute the derivative of an output w.r.t. all inputs of a computation in one pass.
However, the downside is that you need to store all intermediate results of the algorithm you want to differentiate in a suitable data structure (like a graph or a Wengert tape) for as long as you are computing its Jacobian with reverse mode AD, because you're basically "working your way backwards" through the algorithm.
Forward mode AD does not have this disadvantage, but you need to repeat its calculation for every input, so it only makes sense if your algorithm has a lot more output variables than input variables.
I'm working with some large data using the cublas library for matrix multiplication. To save memory space, I want something like A=A*B where A and B are both n-by-n square matrices, i.e. using the same memory space for the output and one of the input matrices.
While some old posts say this is not allowed in the cublas library, I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run. So I'm wondering if the overlapped input and output is supported by the current cublas library. If yes, how much memory does it actually save? I mean intuitively the function at least needs some extra memory to store intermediate calculations, since Aij = AikBkj is dependent on a whole row of A. Is this particularly memory saving for batched gemms?
While some old posts say this is not allowed in the cublas library,
And they are completely correct (noting that the "old posts" were referring to the standard GEMM calls, not the batched implementations you are asking about).
I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run
This isn't documented as being safe and I suspect you are probably only getting stable results by luck, given that small matrices are probably preloaded into shared memory or registers and so an in-place operation works. If you went to larger matrices, I guess you would see failures, because eventually there would be a case where a single GEMM could not be performed without multiple trips to the source matrix after a write cycle, which would corrupt the source matrix.
I would not recommend in-place operations even if you find it works for one case. Different problem sizes, library versions, and hardware could produce failures which you simply haven't tested. The choice and associated risk is up to you.
It looks like my application starting to be (i)FFT-bounded, it doing a lot of 2D correlations for rectangles with average sizes about 500x200 (width and height always even). Scenario is as usual - do two FFT (one per field), multiply complex fields, then one iFFT.
So, on CPU (Intel Q6600, with JTransforms libraly) FFT-transformations eating about 70% of time according to profiler, on GPU (GTX670, cuFFT library) - about 50% (so, there is some performance increase on CUDA, but not what I want). I realize, that it's may be the case that GPU not fully saturated (bandwith limited), but from other case - doing calculation in batches will significantly increase application complexity.
Questions:
what I can do further to decrease time spent on FFT at least several
times?
should I try FFTW library (at this moment I am not sure that it will give significant gain comparing to JTransforms) ?
are there any specialized hardware which can be plugged to PC
for FFT-conversions ?
I'm answering your first question: what I can do further to decrease time spent by cuFFT?
Quoting the CUFFT LIBRARY USER'S GUIDE
Restrict the size along all dimensions to be representable as 2^a*3^b*5^c*7^d. The CUFFT library has highly optimized kernels for transforms whose dimensions have these prime factors.
Restrict the size along each dimension to use fewer distinct prime factors. For example, a transform of size 3^n will usually be faster than one of size 2^i*3^j even
if the latter is slightly smaller.
Restrict the power-of-two factorization term of the x dimension to be a multiple of either 256 for single-precision transforms or 64 for double-precision transforms. This further aids with memory coalescing.
Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8192 for Fermi-class, Kepler-class, and more recent GPUs or between 2 and 2048 for earlier architectures. These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.
Use native compatibility mode for in-place complex-to-real or real-to-complex transforms. This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.
Starting with version 3.1 of the CUFFT Library, the conjugate symmetry property of real-to-complex output data arrays and complex-to-real input data arrays is exploited when the power-of-two factorization term of the x dimension is at least a multiple of 4. Large 1D sizes (powers-of-two larger than 65,536), 2D, and 3D transforms benefit the most from the performance optimizations in the implementation of real-to-complex or complex-to-real transforms.
Other things you can do are (Quoting Robert Crovella's answer to running FFTW on GPU vs using CUFFT):
cuFFT routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".
Please, note that:
cuFFT may be not be convenient as compared to an optimized sequential or multicore FFT if the dimensions of the transform are not enough large;
You can get a rough idea on the performance of cuFFT as compared to Intel MKL from CUDA Toolkit 4.0 Performance Report.
I need to compute the median of an array of size p inside a CUDA kernel (in my case, p is small e.g. p = 10). I am using an O(p^2) algorithm for its simplicity, but at the cost of time performance.
Is there a "function" to find the median efficiently that I can call inside a CUDA kernel?
I know I could implement a selection algorithm, but I'm looking for a function and/or tested code.
Thanks!
Here are a few hints:
Use a better selection algorithm: QuickSelect is a faster version of QuickSort for selecting the kth element in an array. For compile-time-constant mask sizes, sorting networks are even faster, thanks to high TLP and a O(log^2 n) critical path. If you only have 8-bit values, you can use a histogram-based approach. This paper describes an implementation that takes constant time per pixel, independent of mask size, which makes it very fast for very large mask sizes. You can parallelize it by using a minimal launch strategy (only run as many threads as you need to keep all SMs at max capacity), tiling the image, and letting threads of the same block cooperate on each kernel histogram.
Sort in registers. For small mask sizes, you can keep the entire array in registers, making median selection with a sorting network much faster. For larger mask sizes, you can use shared memory.
Copy all pixels used by the block to shared memory first, and then copy to thread-local buffers that are also in shared memory.
If you only have a few masks that need to go really fast (such as 3x3 and 5x5), use templates to make them compile time constants. This can speed things up a lot because the compiler can unroll loops and re-order a lot more instructions, possibly improving load batching and other goodies, leading to large speed-ups.
Make sure, your reads are coalesced and aligned.
There are many other optimizations you can do. Make sure, you read through the CUDA documents, especially the Programming Guide and the Best Practices Guide.
When you really want to gun for high performance, don't forget to take a good look at a CUDA profiler, such as the Visual Profiler.
Even in a single thread one can sort the array and pick the value in the middle in O(p*log(p)), which makes O(p^2) look excessive. If you have p threads at your disposal it's also possible to sort the array as fast as O(log(p)), although that may not be the fastest solution for small p. See the top answer here:
Which parallel sorting algorithm has the best average case performance?
I am dealing with a set of (largish 2k x 2k) images
I need to do per-pixel operations down a stack of a few sequential images.
Are there any opinions on using a single 2D large texture + calculating offsets vs using 3D arrays?
It seems that 3D arrays are a bit 'out of the mainstream' in the CUDA api, the allocation transfer functions are very different from the same 2D functions.
There doesn't seem to be any good documentation on the higher level "how and why" of CUDA rather than the specific calls
There is the best practices guide but it doesn't address this
I would recommend you to read the book "Cuda by Example". It goes through all these things that aren't documented as well and it'll explain the "how and why".
I think what you should use if you're rendering the result of the CUDA kernel is to use OpenGL interop. This way, your code processes the image on the GPU and leaves the processed data there, making it much faster to render. There's a good example of doing this in the book.
If each CUDA thread needs to read only one pixel from the first frame and one pixel from the next frame, you don't need to use textures. Textures only benefit you if each thread is reading in a bunch of consecutive pixels. So you're best off using a 3D array.
Here is an example of using CUDA and 3D cuda arrays:
https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st