GPGPU: an effective way to handle 'irregular' transform?

GPGPU: an effective way to handle 'irregular' transform? - cuda

On regular transform, every GPU-threads are expected to have same time complexity O. For example:
for i=0 to 10: c[i] = a[i]*b[i]
On irregular transform, it isn't:
for i=0 to len(arr)
for k=0 to random()%100
arr[i] += 1
which results an array like [2,50,32,77,1,5,66, ...] where each element indicates, roughly, a computational cost.
GPGPU programming is well suited to regular transforms like 'element-wise addition', 'matrix-multiplication', 'convolution', ...
But how about irregular transforms? How to 'well' distribute GPU-threads? How to design a 'good' kernel? Is there a common methodology?

If hardware is not Vega nor Volta (both can have nearly independent command execution per item), then your best bet is to re-group suspicious works together. For example, a mandelbrot image generator(different amounts of work per item) can be faster with 2D tiled generation since all items in same group can have more or less same amount of work neighbour workitems and more balanced than a 1-D (scanline) generation (which has more divergent result per group). Eirther you should re-order elements depending on the last iteration or use a spatial grouping.
On worst case, max cycles per compute unit(each having 8,64,128,192 cores) determines resulting performance which will be faster with more compute units. But all other workitems work will still be hidden behind those max cycles and be more efficient than a CPU.

Related

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks

This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

STFT Clarification (FFT for real-time input)

I get how the DFT via correlation works, and use that as a basis for understanding the results of the FFT. If I have a discrete signal that was sampled at 44.1kHz, then that means if I were to take 1s of data, I would have 44,100 samples. In order to run the FFT on that, I would have to have an array of 44,100 and a DFT with N=44,100 in order to get the resolution necessary to detect a frequencies up to 22kHz, right? (Because the FFT can only correlate the input with sinusoidal components up to a frequency of N/2)
That's obviously a lot of data points and calculation time, and I have read that this is where the Short-time FT (STFT) comes in. If I then take the first 1024 samples (~23ms) and run the FFT on that, then take an overlapping 1024 samples, I can get the continuous frequency domain of the signal every 23ms. Then how do I interpret the output? If the output of the FFT on static data is N/2 data points with fs/(N/2) bandwidth, what is the bandwidth of the STFT's frequency output?
Here's an example that I ran in Mathematica:
100Hz sine wave at 44.1kHz sample rate:
Then I run the FFT on only the first 1024 points:
The frequency of interest is then at data point 3, which should somehow correspond to 100Hz. I think 44100/1024 = 43 is something like a scaling factor, which means that a signal with 1Hz in this little window will then correspond to a signal of 43Hz in the full data array. However, this would give me an output of 43Hz*3 = 129Hz. Is my logic correct but not my implementation?

As I have already stated in my earlier comments, the variable N affects the resolution achievable by the output frequency spectrum and not the range of frequencies you can detect.A larger N gives you a higher resolution at the expense of higher computation time and a lower N gives you lower computation time but can cause spectral leakage, which is the effect you have seen in your last figure.
As for your other question, well, theoretically the bandwidth of an FFT is infinite but we band-limit our result to the band of frequencies in the range [-fs/2 to fs/2] because all frequencies outside that band are susceptible to aliasing and are therefore of no use.Furthermore, if the input signal is real (which is true in most cases including ours) then the frequencies from [-fs/2 to 0] are just a reflection of the frequencies from [0 to fs/2] and so some FFT procedures just output the FFT spectrum from [0 to fs/2], which I think applies to your case.This means that the N/2 data points that you received as output represent the frequencies in the range [0 to fs/2] so that is the bandwidth you are working with in the case of the FFT and also in the case of the STFT (the STFT is just a series of FFT's, each FFT in a STFT will give you a spectrum with data points in this band).
I would also like to point out that the STFT will most likely not reduce your computation time if your input is a varying signal such as music because in that case you will need to take perform it several times over the duration of the song for it to be of any use, it will however enable you to understand the frequency characteristics of your song much better that you would do if you just performed one FFT.
To visualise the results of an FFT you use frequency (and/or phase) spectrum plots but in order to visualise the results of an STFT you will most probably need to create a spectrogram which is basically a graph can is made by just basically putting the individual FFT spectrums side by side.The process of creating a spectrogram can be seen in the figure below (Source: Dan Ellis - Introduction to Speech Processing).The spectrogram will show you how your signal's frequency characteristics change over time and how you interpret it will depend on what specific features you are looking to extract/detect from the audio.You might want to look at the spectrogram wikipedia page for more information.

CURAND properties of generators

CURAND comes with an array of random number generators, but I have failed to find any comparison of the performance (and randomness) properties of each of them; mostly, I'd be interested in which generator to use for which application to gain maximum performance. I'd be happy if someone could quickly outline the differences between them or link me a resource that does so.
Thanks in advance.

This picture shows the performance for different RNGs.
For randomness, it should be only related to the RNG type/algorithm. So you can refer to Intel MKL doc. There's detail info and research papers in it. The type names in both CURAND and MKL are very similar.
http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-3D7D2650-A414-4C95-AF33-BE291BAB2AC3.htm

First difference is efficiency. XORWOW is default generator, but isn't always most efficient. For instance, Philox is faster for generating normally distributed floats.
Another difference is, that in practice You can generate more than one float with each call with some generators.
For example, with Philox You can generate even 4 floats normally or uniformly distributed with each call, while with XORWOW you can generate max two floats normally or uniformly distributed.
__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)
Next difference is period of pseudorandom sequence (Total state space of the PRNG before
you start to see repeats). Xorwow has period about 2^190 (with the state set up after 2^67 for the same seed)*. For Philox, subsequence and offset together define offset in a sequence with period 2^128.
Note that if You run millions of threads with the same seed You could run out of state space per thread and start seeing repeats. ((2^190) / (10^6)) / (2^67) = 1.0633824 × 10^31
One more difference is size of the states. For Xorwow sizeof(curandState_t) is 48 bytes and sizeof(curandStatePhilox4_32_10_t) is 64 bytes.
When You run millions of threads (each thread has its own curand state) you can run out of device memory. 1024^2*64 ~= 64 megabytes per million threads.
XORWOW, Philox, MRR32k3a, MTGP32 are Pseudo-random generators while both Sobols are Quasi-ranom generators.
*When calling curand_init with a seed, it scrambles that seed and then skips ahead 2^67 numbers (this is kind of expensive but has some nice properties)
sources:
https://developer.nvidia.com/cuRAND
http://cs.brown.edu/courses/cs195v/lecture/week11.pdf

How to convert a sparse histogram into dense histogram in CUDA?

I am implementing an algorithm using raw CUDA kernels, in which every threadblock needs the dense histogram of available data to that threadblock, now the question is that do I have to calculate the dense histogram from the scratch? (is it worth calculating the dense histogram at all, provided that i already have the sparse histogram which is implemented using shared memory)
I have come up with this idea of converting, I will try to elaborate my idea with example (temp and hist both are in shared memory)
0,1,2,3,4,5,6... //array indexes
4,3,0,2,1,0,5... //contents of hist[]
0,0,2,0,0,5,0... //contents of temp[] if(hist[x]>0)temp[x]=x;
for_every_element //this is sequential part :(
if(temp[x]>0)
shift elements from index x to 256
4,3,2,1,0,5... //pass 1 of the for loop
4,3,2,1,5... //pass 2 of the for loop
//this goes on until all the 0s are compacted
Now I know above is sequential in nature, but the shifting can be done with constant time (and in parallel) because threads_per_block is already set to 256, so shifting is not the main issue, the main issue is how to improve this (or any other suggestion is welcomed).
Edit: i am thinking of another idea, that is as follows
Assuming threads_per_block=256 if i can count which of histogram bins are non-zeros (this operation is parallel because each thread is assigned to each bin, i can atomicadd the values generated by each thread) let's say that i can then start a new shared index variable sindex=0 and each time a thread wants to store the value into d_hist[] it can take the latest value from sindex and store it's values to d_hist[sindex]=hist[treadIdx.x] after that i can atomicAdd the sindex
Now there is only one problem, there is going to be a race condition to getting the value of sindex, so i may have to setup a flag which can be locked or unlocked when a thread is adding any value to d_hist (but i think there can be a deadlock situation here)
Will this technique work? and is there any other technique better than that?

Converting a sparse histogram to a dense histogram is a scatter operation. If the sparse histogram is composed of s_index[S_N] and s_hist[S_N], then first we create a dense histogram d_hist[N] composed of all zeroes (you can do this from host code, perhaps). Then we populate the dense histogram with d_hist[s_index[i]] = s_hist[i]; This can be done in parallel and uses as many threads as there are valid indices in your sparse histogram (i < S_N). Assuming your histogram is sorted, you'll get whatever coalescing benefit may be possible based on the distribution of your sparse histogram indices.
It may not make sense for your case where each threadblock is doing a separate histogram, but you may also be interested in thrust scatter.

Well I guess the simplest method is to find out which bins>0 and after that, and exclusive scan can be done (in order to calculate the target indexes let's say sum_array[]) after that for allbins>0 move to d_hist[sum_array[threadIdx.x]-1]=s_hist[threadIdx.x]
0,1,2,3,4,5,6... //s_indexes[]
4,3,0,2,1,0,5... //contents of s_hist[]
1,1,0,1,1,0,1... //all bins which are > 0 = sum_array[]
1,2,2,3,4,4,5... //inclusive_scan of summ_array[]
//after the moving part
0,1,3,4,6... //s_indexes[]
4,3,2,1,5... //d_hist[]
0,1,2,3,4... //d_indexes[]
The reason why I am inclined to use this pattern is because it takes log_base_2(256) time in order to calculate the sum_array plus, other than that, moving and checking parts are just constant time operations, if anyone have different idea than this, please share.

What is the proper method of constraining a pseudo-random number to a smaller range?

What is the best way to constrain the values of a PRNG to a smaller range? If you use modulus and the old max number is not evenly divisible by the new max number you bias toward the 0 through (old_max - new_max - 1). I assume the best way would be something like this (this is floating point, not integer math)
random_num = PRNG() / max_orginal_range * max_smaller_range
But something in my gut makes me question that method (maybe floating point implementation and representation differences?).
The random number generator will produce consistent results across hardware and software platforms, and the constraint needs to as well.
I was right to doubt the pseudocode above (but not for the reasons I was thinking). MichaelGG's answer got me thinking about the problem in a different way. I can model it using smaller numbers and test every outcome. So, let's assume we have a PRNG that produces a random number between 0 and 31 and you want the smaller range to be 0 to 9. If you use modulus you bias toward 0, 1, 2, and 3. If you use the pseudocode above you bias toward 0, 2, 5, and 7. I don't think there can be a good way to map one set into the other. The best that I have come up with so far is to regenerate the random numbers that are greater than old_max/new_max, but that has deep problems as well (reducing the period, time to generate new numbers until one is in the right range, etc.).
I think I may have naively approached this problem. It may be time to start some serious research into the literature (someone has to have tackled this before).

I know this might not be a particularly helpful answer, but I think the best way would be to conceive of a few different methods, then trying them out a few million times, and check the result sets.
When in doubt, try it yourself.
EDIT
It should be noted that many languages (like C#) have built in limiting in their functions
int maximumvalue = 20;
Random rand = new Random();
rand.Next(maximumvalue);
And whenever possible, you should use those rather than any code you would write yourself. Don't Reinvent The Wheel.

This problem is akin to rolling a k-sided die given only a p-sided die, without wasting randomness.
In this sense, by Lemma 3 in "Simulating a dice with a dice" by B. Kloeckner, this waste is inevitable unless "every prime number dividing k also divides p". Thus, for example, if p is a power of 2 (and any block of random bits is the same as rolling a die with a power of 2 number of faces) and k has prime factors other than 2, the best you can do is get arbitrarily close to no waste of randomness, such as by batching multiple rolls of the p-sided die until p^n is "close enough" to a power of k.
Let me also go over some of your concerns about regenerating random numbers:
"Reducing the period": Besides batching of bits, this concern can be dealt with in several ways:
Use a PRNG with a bigger "period" (maximum cycle length).
Add a Bays–Durham shuffle to the PRNG's implementation.
Use a "true" random number generator; this is not trivial.
Employ randomness extraction, which is discussed in Devroye and Gravel 2015-2020 and in my Note on Randomness Extraction. However, randomness extraction is pretty involved.
Ignore the problem, especially if it isn't a security application or serious simulation.
"Time to generate new numbers until one is in the right range": If you want unbiased random numbers, then any algorithm that does so will generally have to run forever in the worst case. Again, by Lemma 3, the algorithm will run forever in the worst case unless "every prime number dividing k also divides p", which is not the case if, say, k is 10 and p is 32.
See also the question: How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?, especially my answer there.

If PRNG() is generating uniformly distributed random numbers then the above looks good. In fact (if you want to scale the mean etc.) the above should be fine for all purposes. I guess you need to ask what the error associated with the original PRNG() is, and whether further manipulating will add to that substantially.
If in doubt, generate an appropriately sized sample set, and look at the results in Excel or similar (to check your mean / std.dev etc. for what you'd expect)

If you have access to a PRNG function (say, random()) that'll generate numbers in the range 0 <= x < 1, can you not just do:
random_num = (int) (random() * max_range);
to give you numbers in the range 0 to max_range?

Here's how the CLR's Random class works when limited (as per Reflector):
long num = maxValue - minValue;
if (num <= 0x7fffffffL) {
return (((int) (this.Sample() * num)) + minValue);
}
return (((int) ((long) (this.GetSampleForLargeRange() * num))) + minValue);
Even if you're given a positive int, it's not hard to get it to a double. Just multiply the random int by (1/maxint). Going from a 32-bit int to a double should provide adequate precision. (I haven't actually tested a PRNG like this, so I might be missing something with floats.)

Psuedo random number generators are essentially producing a random series of 1s and 0s, which when appended to each other, are an infinitely large number in base two. each time you consume a bit from you're prng, you are dividing that number by two and keeping the modulus. You can do this forever without wasting a single bit.
If you need a number in the range [0, N), then you need the same, but instead of base two, you need base N. It's basically trivial to convert the bases. Consume the number of bits you need, return the remainder of those bits back to your prng to be used next time a number is needed.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008