How to denormalize values to do the Harmonic Product Spectrum - fft

I'm trying to execute the HPS algorithm and the results are not right. (48000Hz, 16bits)
I've applied to a buffer with the recorded frequency several splits, then a Hanning window, and finally the FFT.
I've obtained a peak in each FFT, that correspond with the frequency I am using, or an octave of it. But when i do the HPS, the results of the fundamental frequency are 0, because the numbers of the array where I make the sum(multiply) are too small, more than my peak in the original FFT.
This is the code of the HPS:
int i_max_h = 0;
double m_max_h = miniBuffer[0];
//m_max is the value of the peak in the original time domain array
m_max_h = m_max;
//array for the sum
double sum [] = new double[miniBuffer.length];
int fund_freq = 0;
//It could be divide by 3, but I'm not going over 500Hz, so it should works
for(int k = 0; k < 24000/48 ; k++)
{
//HPS down sampling and multiply
sum[k] = miniBuffer[k] * miniBuffer[2*k] * miniBuffer[3*k];
// find fundamental frequency (maximum value in plot)
if( sum[k] > m_max_h && k > 0 )
{
m_max_h = sum[k];
i_max_h = k;
}
}
//This should get the fundamental freq. from sum
fund_freq = (i_max_h * Fs / 24000);
System.out.print("Fundamental Freq.: ");
System.out.println(fund_freq);
System.out.println("");
The original HPS code is HERE
I don't know why the sum have little values, when it should be bigger than the previous, and the peak of the sum too. I've applied a RealFordward FFT, maybe there is a problem with the -1 to 1 range, that makes my sum decrease when I multiply it.
Any idea how to fix it, to do the HPS?
How could i do the inverse normalize?

The problem was that I was trying to get a higher value of amplitude on the sum array (the HPS array), and my set of values are normalize since I apply the FFT algorithm to them.
This is the solution I've created, multiplying the individual values of the sum array by 10 before make the multiply.
The number 10 is a coefficient that I have selected, but it could be wrong in some high frequencies cases, this coefficient could be another higher number.
'''
for(int k = 0; k < 24000/48 ; k++)
{
sum[k] = ((miniBuffer[k]*10) * (miniBuffer[2*k]*10) * (miniBuffer[3*k]*10));
// find fundamental frequency (maximum value in plot)
if( sum[k] > m_max_h && k > 0 )
{
m_max_h = sum[k];
i_max_h = k;
}
}
'''
The range of the frequencies is 24000/48 = 500, so it's between 0 and 499 Hz, more than I need in a bass.
If the split of the full array is less than 24000, i should decrease the number 48, and this is admissible, because the down sampled arrays are 24000/3 and 24000/2, so this value could decrease to 3, and it should work well.

Related

find nearest non-zero element in another vector in CUDA

There is a M x N matrix A and B.(Actual size of matrix is 512 x 4096)
In each row of A, the points to be processed are set to 1.
And each row of B contains values obtained through a specific operation.
Based on each row, I am going to do an operation to get the value of B that is closest to the point of 1 in A.
The example is shown in the figure below, and the code I wrote in MATLAB was also written down.
Here's how I thought of it:
Pick the non-zero element index of A with thrust. And for each element, the closest value is fetched from the corresponding row of B by for-loop.
(If there are several non-zero elements in A, it is expected to be slow.)
I want to make good use of the power of the GPU for this operation, do you have any more efficient ideas?
[idxY,idxX] = find(A == 1);
for Point = 1:length(idxY)
pointBuf = find(B(:,idxY(Point)) == 1); // find non-zero elements in Row of B
if ~isempty(pointBuf) // there are non-zero elements in Row of B
[MinValue, MinIndex] = min(abs(pointBuf - idxY(Point)));
C(idxY(Point),idxX(Point)) = B(pointBuf(MinIndex(1)),RangeInd(Point)); // Get closest point in B
else
C(DopInd(Point),RangeInd(Point)) = 0; // if there is no non-zero elements in Row of B, just set to 0
end
end
Just as reference a solution, which shifts left and right by 4095. It has similarities with bubble sort variants, which bubble up and down at the same time.
Advantage is that it does not depend on the position of the non-null elements in B and can be easily parallelized between threads.
But the inner loop, which translates to 2 SASS instructions is still just too slow (too often called): The program takes 26ms on my notebook.
It would do so in the best and the absolute worst case of the input matrices.
Parts and methods of it probably can be reused, as it shows some CUDA programming methods.
So more or less for reference, in the end not a final (fast enough) solution:
__global__ void calcmatrix(bool* A, double* B, double* C)
{
// calculate row number
int row = blockDim.x * gridDim.y + threadIdx.y;
if (row >= 512)
return;
// store index of valid double from B, this is moved up and down
// those indices are for the current thread. Each thread is responsible for 128 consecutive columns.
int indices[128];
// prefill the indices with their own number (as if every double from B is valid)
#pragma unroll
for (int i = 0; i < 128; i++)
indices[i] = threadIdx.x * 128 + i;
// Store zero flags (4 * 32 bits) for our 128 elements
unsigned int myzeroflags[4];
// For efficiently loading data from memory, we distribute the data in another way: thread 0 gets columns 0, 32, 64, 96, ...; thread 1 gets columns 1, 33, 65, 97, ...; thread 2 gets columns 2, 34, 66, 98, ...; and so on
#pragma unroll
for (int i = 0; i < 128; i++) {
// load value from B
double in = B[row * 4096 + i * 32 + threadIdx.x];
// compare to zero (!in) and combine all bool results from the 32 threads (__ballot_sync))
unsigned int zeroflag = __ballot_sync(0xFFFFFFFF, !in);
// store the ones, which belong to us
if (threadIdx.x == i / 4)
myzeroflags[i & 3] = zeroflag;
}
// go through our zero flags and set those indices to -1 (there is already a valid index "0", so we use a negative number to signify invalid)
#pragma unroll
for (int i = 0; i < 4; i++)
#pragma unroll
for (int j = 0; j < 32; j++)
if (myzeroflags[i] & (1 << j))
indices[i * 32 + j] = -1;
// main loop, do 4095 times
#pragma unroll 1
for (int i = 0; i < 4095; i++) {
// move all elements to the left (if the index there is invalid)
// send index over thread boundaries
int fromright = __shfl_down_sync(0xFFFFFFFF, indices[0], 1, 32);
#pragma unroll
// if left index is -1, set it to one index to the right
for (int j = 0; j < 127; j++)
if (indices[j] == -1)
indices[j] = indices[j + 1];
// move over thread boundaries (except for the rightmost thread)
if (threadIdx.x != 31 && indices[127] == -1)
indices[127] = fromright;
// move to the right in the same way as to the left
int fromleft = __shfl_up_sync(0xFFFFFFFF, indices[127], 1, 32);
#pragma unroll
for (int j = 127; j > 0; j--)
if (indices[j] == -1)
indices[j] = indices[j - 1];
if (threadIdx.x != 0 && indices[0] == -1)
indices[0] = fromleft;
}
// for the other distribution of elements for memory accesses, we have to redistribute the indices to the correct threads
// To not have bank conflicts, we define the shared memory array with 33 instead of 32 elements in the last dimension, but use only 32. With this method we can put threadIdx.x into the last and previous to last dimension without bank conflicts
__shared__ short2 distribidx[8][32][33];
int indices2[128];
// Redistribute first half; the index can go from 0..4095 (and also theoreticially -1, if there was no non-null element in this row). This fits into a short, convert for faster transfer
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i]), static_cast<short>(indices[i + 32]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 0] = idxback.x;
indices2[4 * i + 1] = idxback.y;
}
__syncwarp();
// Redistribute second half
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i + 64]), static_cast<short>(indices[i + 96]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 2] = idxback.x;
indices2[4 * i + 3] = idxback.y;
}
// Do final calculation
#pragma unroll
for (int i = 0; i < 128; i++) {
// Default value is zero
double result = 0;
// Read only, if A is true and indices2 is valid
if (A[row * 4096 + i * 32 + threadIdx.x] && indices2[i] != -1)
// Read B with calculated index (this read is not optimized/coalesced, because the indices can be wild, but hopefully was or can be cached)
result = B[row * 4096 + indices2[i]];
// Store result in C
C[row * 4096 + i * 32 + threadIdx.x] = result;
}
}
int main()
{
bool* A;
double* B;
double* C;
cudaMalloc(&A, 2 * 512 * 4096);
cudaMalloc(&B, 8 * 512 * 4096);
cudaMalloc(&C, 8 * 512 * 4096);
// called in this fashion
calcmatrix<<<(512 + 7) / 8, dim3(32, 8)>>>(A, B, C);
return 0;
}
This problem is really far from being simple to implement efficiently on a GPU. The main reason is that GPUs are designed to efficiently execute SIMD-friendly algorithm while this problem can hardly be solve in a SIMD friendly way.
The naive solution you propose will be very inefficient due to the many small kernels to execute (starting a kernel is expensive and Thrust tends to run them synchronously by default AFAIK), not to mention the amount of parallelism of each kernel would be far too small for any modern GPU. I expect this solution to be slower than a naive CPU implementation.
First things first, one need to find an efficient algorithm. The proposed solution runs in O(n m²) where n is the number of row and m the number of columns. That being said, the solution should be fast (ie. close to O(n m)) if most values are non-zero which is not the case in the example.
A more efficient solution is to first iterate over the B matrix and find the location of all the non-zero items so to put it in an array L. Then you can iterate over A, track the non-zero values and search for the closest index of L matching to the location of the current item in A. If the number of items in L is big for the target row (eg. >50), you can use a binary search so to find the location faster (since items of L are sorted). This solution runs in O(n m log m) time.
An even better solution is to iterate simultaneously over A and L like a merge algorithm. Indeed, the indices of A and the items of B are both sorted so the binary search is not even needed. When the index of the current non-zero item of A is bigger than the current item of L you can iterate to the next value of L (and memorize the last value of L discarded needed to compute the closest value). This algorithm runs in O(n m) (optimal). An efficient CPU implementation consists in computing chunks of raw in each many threads.
On a GPU, things are more complex since all the previously provided algorithm are not SIMD-friendly. Computing a row in an SIMD-friendly way turns out to be complex and generally inefficient (the overhead can be higher than the serial algorithm on a CPU). One possible solution would be to compute rows in parallel (1 thread per row) and transpose the matrix block per block in shared-memory so to perform SIMD-friendly memory accesses after that (assuming there is enough space). The non-zero values of A and B certainly needs to be extracted first so to avoid thread divergence as much as possible. This solution works only if the number of non-zero is relatively uniform between the lines (otherwise I doubt a GPU can actually be helpful). Note the overhead of the transposition can be significant compared to the computation. Thus, I am not sure it will be faster than a CPU based solution. In fact, if data lies on the CPU memory, then just transferring data to the GPU will certainly be more expensive than computing the result on a CPU in parallel.

How to interpret FFT data for making a spectrum visualizer

I am trying to visualize a spectrum where the frequency range is divided into N bars, either linearly or logarithmic. The FFT seems to work fine, but I am not sure how to interpret the values in order to decide the max height for the visualization.
I am using FMODAudio, a wrapper for C#. It's set up correctly.
In the case of a linear spectrum, the bars are defined as following:
public int InitializeSpectrum(int windowSize = 1024, int maxBars = 16)
{
numSamplesPerBar_Linear.Clear();
int barSamples = (windowSize / 2) / maxBars;
for (int i = 0; i < maxBars; ++i)
{
numSamplesPerBar_Linear.Add(barSamples);
}
IsInitialized = true;
Data = new float[numSamplesPerBar_Linear.Count];
return numSamplesPerBar_Linear.Count;
}
Data is the array which holds the spectrum values received from the update loop.
The update looks like this:
public unsafe void UpdateSpectrum(ref ParameterFFT* fftData)
{
int length = fftData->Length / 2;
if (length > 0)
{
int indexFFT = 0;
for (int index = 0; index < numSamplesPerBar_Linear.Count; ++index)
{
for (int frec = 0; frec < numSamplesPerBar_Linear[index]; ++frec)
{
for (int channel = 0; channel < fftData->ChannelCount; ++channel)
{
var floatspectrum = fftData->GetSpectrum(channel); //this is a readonlyspan<float> by default.
Data[index] += floatspectrum[indexFFT];
}
++indexFFT;
}
Data[index] /= (float)(numSamplesPerBar_Linear[index] * fftData->ChannelCount); // average of both channels for more meaningful values.
}
}
}
The values I get when testing a song are very low across the bands.
A randomly chosen moment when playing a song gives these values:
16 bars = 0,0326 0,0031 0,001 0,0003 0,0004 0,0003 0,0001 0,0002 0,0001 0,0001 0,0001 0 0 0 0 0
I realize it's more useful to use a logarithmic spectrum in many cases, and I intend to, but I still need to figure how how to find the max values for each bar so that I can setup the visualization on a proper scale.
Q: How can I know the potential max values for each bar based on this setup (it's not 1.0)?
output from FFT call is an array where each element is a complex number ( A + Bi ) where A is the real number component and B the imaginary number component ... element zero of this array represents frequency zero as in DC which is the offset bias can typically be ignored ... as you iterate across each element of this array you increment the frequency ... this freq increment is calculated using
Audio_samples <-- array of raw audio samples in PCM format which gets
fed into FFT call
num_fft_bins := float64(len(Audio_samples)) / 2.0 // using Nyquist theorem
freq_incr_per_bin := (input_audio_sample_rate / 2.0) / num_fft_bins
so to answer your question the output array from FFT call is a linear progression evenly spaced based in above freq increment constant
Depends on your input data to the FFT, and the scaling that your particular FFT implementation uses (not all FFTs use the same scale factor).
With an energy preserving forward-FFT, Parseval's theorem applies. So the energy (sum of squares) of the input vector equals the energy of the FFT result vector. Note that for a single integer periodic in aperture sinusoidal input (a pure tone), all that energy can appear in a single FFT result element. So if you know the maximum possible input energy, you can use that to compute the maximum possible result element magnitude for scaling purposes.
The range is often large enough that visualizers commonly need to use log scaling, or else typical input can get pixel quantized to a graph of all zeros.

Free-Pascal Implementation of the Sieve of Eratosthenes

My teacher gave me an assignment like this:
Using the number n given, find the largest prime number p with p<=n and n<=10^9.
I tried doing this by using the following function:
Const amax=1000000000
Var i,j,n:longint;
a:array [1..amax] of boolean;
Function lp(n:longint):longint;
Var max:longint;
Begin
For i:=1 to n do a[i]:=true;
For i:=2 to round(sqrt(n)) do
If (a[i]=true) then
For j:=1 to n div i do
If (i*i+(j-1)*i<=n) then
a[i*i+(j-1)*i]:=false;
max:=0;
i:=n;
While max=0 do
Begin
If a[i]=true then max:=i;
i:=i-1;
End;
lp:=max;
End;
This code worked flawlessly for numbers such as 1 million, but when i tried n=10^9, the program took a long time to print the output. So here's my question: Are there any ways to improve my code for lower delay? Or maybe a different code?
The most important aspect here is that the greatest prime that is not greater than n must be fairly close to n. A quick look at The Gaps Between Primes (at The Prime Pages - always worth a look for everything to do with primes) shows that for 32-bit numbers the gaps between primes cannot be greater than 335. This means that the greatest prime not greater than n must be in the range [n - 335, n]. In other words, at most 336 candidates need to be checked - for example via trial division - and this is bound to be lots faster than sieving a billion numbers.
Trial division is a reasonable choice for tasks of this kind, because the range to be scanned is so small. In my answer to Prime sieve implementation (using trial division) in C++ I analysed a couple of ways for speeding it up.
The Sieve of Eratosthenes is also a good choice, it just needs to be modified to sieve only the range of interest instead of all numbers from 1 to n. This is called a 'windowed sieve' because it sieves only a window. Since the window will most likely not contain all the primes up to the square root of n (i.e. all the primes that could be potential least prime factors of composites in the range to be scanned) it is best to sieve the factor primes via a separate, simple Sieve of Eratosthenes.
First I'm showing a simple rendition of normal (non-windowed) sieve, as a baseline for comparing the windowed code to. I'm using C# in order to show the algorithm more clearly than would be possible with Pascal.
List<uint> small_primes_up_to (uint n)
{
if (n == uint.MaxValue)
throw new ArgumentOutOfRangeException("n", "n must be less than UINT32_MAX");
var eliminated = new bool[n + 1]; // +1 because indexed by numbers
eliminated[0] = true;
eliminated[1] = true;
for (uint i = 2, sqrt_n = (uint)Math.Sqrt(n); i <= sqrt_n; ++i)
if (!eliminated[i])
for (uint j = i * i; j <= n; j += i)
eliminated[j] = true;
return remaining_unmarked_numbers(eliminated, 0);
}
The fuction has 'small' in its name because it is not really suited for sieving big ranges; I use similar code (with a few bells and whistles) only for sieving the small factor primes needed by more advanced sieves.
The code for extracting the sieved primes is equally simple:
List<uint> remaining_unmarked_numbers (bool[] eliminated, uint sieve_base)
{
var result = new List<uint>();
for (uint i = 0, e = (uint)eliminated.Length; i < e; ++i)
if (!eliminated[i])
result.Add(sieve_base + i);
return result;
}
Now, the windowed version. One difference is that the potential least factor primes need to be sieved separately (by the function just shown) as explained earlier. Another difference is that the starting point of the crossing-off sequence for a given prime may lie outside the range to be sieved. If the starting point lies before the start of the window then a bit of modulo magic is necessary to find the first 'hop' that lands in the window. From then on everything proceeds as usual.
List<uint> primes_between (uint m, uint n)
{
m = Math.Max(m, 2);
if (m > n)
return new List<uint>(); // empty range -> no primes
// index overflow in the inner loop unless `(sieve_bits - 1) + stride <= UINT32_MAX`
if (n - m > uint.MaxValue - 65521) // highest prime not greater than sqrt(UINT32_MAX)
throw new ArgumentOutOfRangeException("n", "(n - m) must be <= UINT32_MAX - 65521");
uint sieve_bits = n - m + 1;
var eliminated = new bool[sieve_bits];
foreach (uint prime in small_primes_up_to((uint)Math.Sqrt(n)))
{
uint start = prime * prime, stride = prime;
if (start >= m)
start -= m;
else
start = (stride - 1) - (m - start - 1) % stride;
for (uint j = start; j < sieve_bits; j += stride)
eliminated[j] = true;
}
return remaining_unmarked_numbers(eliminated, m);
}
The two '-1' terms in the modulo calculation may seem strange, but they bias the logic down by 1 to eliminate the inconvenient case stride - foo % stride == stride that would need to be mapped to 0.
With this, the greatest prime not exceeding n could be computed like this:
uint greatest_prime_not_exceeding (uint n)
{
return primes_between(n - Math.Min(n, 335), n).Last();
}
This takes less than a millisecond all told, including the sieving of the factor primes and so on, even though the code contains no optimisations whatsoever. A good overview of applicable optimisations can be found in my answer to prime number summing still slow after using sieve; with the techniques shown there the whole range up to 10^9 can be sieved in about half a second.

thrust equivalent of cilk::reducer_list_append

I have a list of n intervals or domains. I would like to subdivide in parallel each interval into k parts making a new list (unordered). However, most of the subdivision won't pass certain criteria and shouldn't be added to the new list.
cilk::reducer_list_append extends the idea of parallel reduction to forming a list with push_back. This way I can collect in parallel only valid sub-intervals.
What is the thrust way of accomplishing the task? I suspect one way would be to form a large nxk list, then use parallel filter and stream compaction? But I really hope there is a reduction list append operation, because nxk can be very large indeed.
I am new to this forum but maybe you find some of these useful..
If you are not fixed upon Thrust, you can also have a look at Arrayfire.
I learned about it quite recently and it's free for that sorts of problems.
For example, with arrayfire you can evaluate selection criterion for each interval
in parallel using gfor construct, ie. consider:
// # of intervals n and # of subintervals k
const int n = 10, k = 5;
// this array represets original intervals
array A = seq(n); // A = 0,1,2,...,n-1
// for each interval A[i], subI[i] counts # of subintervals
array subI = zeros(n);
gfor(array i, n) { // in parallel for all intervals
// here evaluate your predicate for interval's subdivision
array pred = A(i)*A(i) + 1234;
subI(i) = pred % (k + 1);
}
//array acc = accum(subI);
int n_total = sum<float>(subI); // compute the total # of intervals
// this array keeps intervals after subdivision
array B = zeros(n_total);
std::cout << "total # of subintervals: " << n_total << "\n";
print(A);
print(subI);
gfor(array i, n_total) {
// populate the array of new intervals
B(i) = ...
}
print(B);
of course, it depends on a way how your intervals are represented and
which criterion you use for subdivision..

1D multiple peak detection?

I am currently trying to implement basic speech recognition in AS3. I need this to be completely client side, as such I can't access powerful server-side speech recognition tools. The idea I had was to detect syllables in a word, and use that to determine the word spoken. I am aware that this will grealty limit the capacities for recognition, but I only need to recognize a few key words and I can make sure they all have a different number of syllables.
I am currently able to generate a 1D array of voice level for a spoken word, and I can clearly see, if I somehow draw it, that there are distinct peaks for the syllables in most of the cases. However, I am completely stuck as to how I would find out those peaks. I only really need the count, but I suppose that comes with finding them. At first I thought of grabbing a few maximum values and comparing them with the average of values but I had forgot about that peak that is bigger than the others and as such, all my "peaks" were located on one actual peak.
I stumbled onto some Matlab code that looks almost too short to be true, but I can't very that as I am unable to convert it to any language I know. I tried AS3 and C#. So I am wondering if you guys could start me on the right path or had any pseudo-code for peak detection?
The matlab code is pretty straightforward. I'll try to translate it to something more pseudocodeish.
It should be easy to translate to ActionScript/C#, you should try this and post follow-up questions with your code if you get stuck, this way you'll have the best learning effect.
Param: delta (defines kind of a tolerance and depends on your data, try out different values)
min = Inf (or some very high value)
max = -Inf (or some very low value)
lookformax = 1
for every datapoint d [0..maxdata] in array arr do
this = arr[d]
if this > max
max = this
maxpos = d
endif
if this < min
min = this
minpos = d
endif
if lookformax == 1
if this < max-delta
there's a maximum at position maxpos
min = this
minpos = d
lookformax = 0
endif
else
if this > min+delta
there's a minimum at position minpos
max = this
maxpos = d
lookformax = 1
endif
endif
Finding peaks and valleys of a curve is all about looking at the slope of the line. At such a location the slope is 0. As i am guessing a voice curve is very irregular, it must first be smoothed, until only significant peaks exist.
So as i see it the curve should be taken as a set of points. Groups of points should be averaged to produce a simple smooth curve. Then the difference of each point should be compared, and points not very different from each other found and those areas identified as a peak, valleys or plateau.
If anyone wants the final code in AS3, here it is:
function detectPeaks(values:Array, tolerance:int):void
{
var min:int = int.MIN_VALUE;
var max:int = int.MAX_VALUE;
var lookformax:int = 1;
var maxpos:int = 0;
var minpos:int = 0;
for(var i:int = 0; i < values.length; i++)
{
var v:int = values[i];
if (v > max)
{
max = v;
maxpos = i;
}
if (v < min)
{
min = v;
minpos = i;
}
if (lookformax == 1)
{
if (v < max - tolerance)
{
canvas.graphics.beginFill(0x00FF00);
canvas.graphics.drawCircle(maxpos % stage.stageWidth, (1 - (values[maxpos] / 100)) * stage.stageHeight, 5);
canvas.graphics.endFill();
min = v;
minpos = i;
lookformax = 0;
}
}
else
{
if (v > min + tolerance)
{
canvas.graphics.beginFill(0xFF0000);
canvas.graphics.drawCircle(minpos % stage.stageWidth, (1 - (values[minpos] / 100)) * stage.stageHeight, 5);
canvas.graphics.endFill();
max = v;
maxpos = i;
lookformax = 1;
}
}
}
}