CMSIS real-FFT on 8192 samples in Q15 - fft

I need to perform an FFT on a block of 8192 samples on an STM32F446 microcontroller.
For that I wanted to use the CMSIS DSP library as it's available easily and optimised for the STM32F4.
My 8192 samples of input will ultimately be values from the internal 12-bit ADC (left aligned and converted to q15 by flipping the sign bit)., but for testing purpose I'm feeding the FFT with test-buffers.
With CMSIS's FFT functions, only the Q15 version supports lengths of 8192. Thus I am using arm_rfft_q15().
Because the FFT functions of the CMSIS libraries include by default about 32k of LUTs - to adapt to many FFT lengths, I have "rewritten" them to remove all the tables corresponding to other length than the one I'm interested in. I haven't touched anything except removing the useless code.
My samples are stored on an external SDRAM that I access via DMA.
When using the FFT, I have several problems :
Both my source buffer and my destination buffer get modified ;
the result is not at all as expected
To make sure I had wrong results I did an IFFT right after the FFT but it just confirmed that the code wasn't working.
Here is my code :
status_codes FSM::fft_state(void)
{
// Flush the SDRAM section
si_ovf_buf_clr_u16((uint16_t *)0xC0000000, 8192);
q15_t* buf = (q15_t*)(0xC0000000);
for(int i = 0; i<50; i++)
buf[i] = 0x0FFF; // Fill the buffer with test vector (50 sp gate)
// initialise FFT
// ---> Forward, 8192 samples, bitReversed
arm_rfft_instance_q15 S;
if(arm_rfft_init_q15(&S, 8192, 0, 1) != ARM_MATH_SUCCESS)
return state_error;
// perform FFT
arm_rfft_q15(&S, (q15_t*)0xC0000000, (q15_t*)0xC0400000);
// Post-shift by 12, in place (see doc)
arm_shift_q15((q15_t*)0xC0400000, 12, (q15_t*)0xC0400000, 16384);
// Init inverse FFT
if(arm_rfft_init_q15(&S, 8192, 1, 1) != ARM_MATH_SUCCESS)
return state_error;
// Perform iFFT
arm_rfft_q15(&S, (q15_t*)0xC0400000, (q15_t*)0xC0800000);
// Post shift
arm_shift_q15((q15_t*)0xC0800000, 12, (q15_t*)0xC0800000, 8192);
return state_success;
}
And here is the result (from GDB)
PS : I'm using ChibiOS - not sure if it is relevant.

Related

step doubling Runge Kutta implementation stuck shrinking stepsize to machine precision

I need to integrate a system of ODES using an adaptive RK4 method with stepsize control via step doubling techniques.
The problem is that the program continues forever shrinking the stepsize down to machine precision while not advancing time.
the idea is to step the solution once by a single step and also by two successive half steps, compare the result as their difference and store it in eps. So eps is a measure of the error. Now I want to determine the next step stepsize according to whether eps is greater to a specified accuracy eps0 (as described in the book "Numerical Recipes")
RK4Step(double t, double* Y, double *Yout, void (*RHSFunc)(double, double *, double *),double h) steps the solution vector Y by h and puts the result into Yout using the function RHSFunc.
#define NEQ 4 //problem dimension
int main(int argc, char* argv[])
{
ofstream frames("./frames.dat");
ofstream graphs("./graphs.dat");
double Y[4] = {2.0, 2.0, 1.0, 0.0}; //initial conditions for solution vector
double finaltime = 100; //end of integration
double eps0 = 10e-5; //error to compare with eps
double t = 0.0;
double step = 0.01;
while(t < finaltime)
{
double eps = 0.0;
double Y1[4], Y2[4]; //Y1 will store half step solution
//Y2 will store double step solution
double dt = step; //cache current stepsize
for(;;)
{
//make a step starting from state stored in Y and
//put solution into Y1. Then from Y1 make another half step
//and store into Y1.
RK4Step(t, Y, Y1, RHS, step); //two half steps
RK4Step(t+step,Y1, Y1, RHS, step);
RK4Step(t, Y, Y2, RHS, 2*step); //one long step
//compute eps as maximum of differences between Y1 and Y2
//(an alternative would be quadrature sums)
for(int i=0; i<NEQ; i++)
eps=max(eps, fabs( (Y1[i]-Y2[i])/15.0 ) );
//if error is within tolerance we grow stepsize
//and advance time
if(eps < eps0)
{
//stepsize is accepted, grow stepsize
//save solution from Y1 into Y,
//advance time by the previous (cached) stepsize
Y[0] = Y1[0]; Y[1] = Y1[1];
Y[2] = Y1[2]; Y[3] = Y1[3];
step = 0.9*step*pow(eps0/eps, 0.20); //(0.9 is the safety factor)
t+=dt;
break;
}
//if the error is too big we shrink stepsize
step = 0.9*step*pow(eps0/eps, 0.25);
}
}
frames.close();
graphs.close();
return 0;
}
You never reset eps in the inner loop. This could be the direct cause of your problem. While the actual error reduces with ever decreasing step sizes, the maximum in eps stays constant, and above eps0. This results in a constant reducing factor in the step size update, without any chance to break the loop.
Another "wrong thing" is that the error estimate and tolerance are incompatible. The error tolerance eps0 is an error density or unit-step error. To bring your error estimate eps into that format you need to divide eps by step. Or put another way, currently you are forcing the actual step error to be close to 0.5*eps0, so that the global error is 0.5*eps0 times the number of steps taken, with the number of steps loosely proportional to eps0^0.2. In the version using the unit-step error, the local error is forced to be "dynamically" close to 0.5*eps0*step, so that the global error is about 5*eps0 times the length of the integration interval. I'd say that the second variant is more in line with intuition about the expected behavior.
This is not a critical error, but may lead to sub-optimal step sizes and an actual global error that deviates non-trivially from the desired error tolerance.
You also have a coding inconsistency as in the propagation of the state and declaration of state vectors you have hard-coded 4 components in the state vector, while in the error computation you have a loop over a variable number NEQ of equations and components. As you are using C++, you could use a state vector class that handles all dimension-dependent loops internally. (If done too far, frequent allocation of instances with a short life span could be an efficiency issue.)

Getting pointers to specific elements of a 1D contiguous array on the device

I am trying to use CUBLAS in C++ to rewrite a python/tensorflow script which is operating on batches of input samples (of shape BxD, B: BatchSize, D: Depth of the flattened 2D matrix)
For the first step, I decided to use CUBLAS cublasSgemmBatched to compute MatMul for batches of matrices.
I've found couple working sample codes as the one in link to the question,
but what I want is to allocate one big contiguous device array to store batches of flattened identical shaped matrices. I DO NOT want to store batches separated from each other on device memory(as they are in the provided sample code in the given link to StackOverflow question)
From what I can imagine, somehow I have to get a list of pointers to starting elements of each batch on device memory. something like this:
float **device_batch_ptr;
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
for(int i = 0 ; i < batch_size; i++ ) {
// set device_batch_ptr[i] to starting point of i'th batch on device memory array.
}
Note that cublasSgemmBatched needs a float** that each float* in it, points to starting element of each batch in a given input matrix.
Any advice and suggestions will be greatly appreciated.
If your arrays are in contiguous linear memory (device_array) then all you need to do is calculate the offsets using standard pointer arithmetic and store the device addresses in a host array which you then copy to the device. Something like:
float** device_batch_ptr;
float** h_device_batch_ptr = new float*[batch_size];
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
size_t nelementsperrarray = N * N;
for(int i = 0 ; i < batch_size; i++ ) {
// set h_device_batch_ptr[i] to starting point of i'th batch on device memory array.
h_device_batch_ptr[i] = device_array + i * nelementsperarray;
}
cudaMemcpy(device_batch_ptr, h_device_batch_ptr, batch_size*sizeof(float *)),
cudaMemcpyHostToDevice);
[Obviously never compiled or tested, use at own risk]

How can I record sound with 16 bits per sample (16 bit depth)?

I try to record PCM sound from flash (using Microphone class). I use org.bytearray.micrecorder.MicRecorder helper class.
In Microphone class I cannot find property like bitDepth or bitsPerSample.
I always get 32 bits.
Is it possible to do?
UPDATE: The asker John812 was able to solve this by using..
bit16_bytes.writeShort( data.readFloat() * 32767 ); see comments below for context
METHOD #2: Based on my experience with using the LoadPCMfromByteArray method
I have something you could try but I've only used it with an actual 32bit WAVE file and played via the LoadPCMFromByteArray command.
The AS3 Microphone Class records 32 bits. You have to write the conversion of samples to a different bit-depth by yourself. I have no idea how many samples you are processing but the general code below shows you how to convert. Note: * 512 means use your actual samples amount (example: * 4096? or * 8192?) If you get the numbers wrong there'll be hiss/distortion so either experiment from small or provide the full details in your question for a more helpful edit/answer.
CONVERT: Assuming your recorded byteArray is called data
public var bit16_bytes : ByteArray; //will hold the 16bit version
public function convert_to16Bit () : void
{
bit16_bytes = new ByteArray(); data.position = 0;
while (bit16_bytes.position < data.length - 4)
//if you get noise/distortion try either: 256, 512, 1024, 2048, 4096 or 8192
{ bit16_bytes.writeShort( data.readInt() * 512 ); } //multiply by samples amount
data = new ByteArray(); //recycle for re-use
bit16_bytes.position = 0; //reset or else E-O-File error
bit16_bytes.readBytes( data ); //copy 16bit back into Data byte-array
}
To run the above function whenever you're ready just add the line convert_to16Bit(); inside whatever function deals with your "recording complete" situation.

Low memory copy throughput Host to Device

I have a vector of vectors vector<vector<double>> data.
I want to copy only the information contained in that "2D matrix" as there are no vectors in CUDA.
So the first approach I used was
vector<vector<double>> *values;
vector<vector<double>>::iterator it;
double *d_values;
double *dst;
checkCudaErr(
cudaMalloc((void**)&d_values, sizeof(double)*M*N)
);
dst = d_values;
for (it = values->begin(); it != values->end(); ++it){
double *src = &((*it)[0]);
size_t s = it->size();
checkCudaErr(
cudaMemcpy(dst, src, sizeof(double)*s, cudaMemcpyHostToDevice)
);
dst += s;
}
After profiling with NVVP I got a very low cudaMempcpy throughput. I think this is logic as I'm sending a very small amount of
bytes in each cudaMemcpy call.
So I decided to change a little bit the code to try to improve this, so the second approach is
double *h_values = new double[M*N];
dst = h_values;
for (it = values->begin(); it != values->end(); ++it){
double *src = &((*it)[0]);
size_t s = it->size();
memcpy(dst, src, sizeof(double)*s);
dst += s;
}
checkCudaErr(
cudaMemcpy(d_values, h_values, sizeof(double)*M*N, cudaMemcpyHostToDevice)
);
the result after profiling is still a low memcpy throughput.
So, my question is, how can I improve the copies from host to device?
I'm using a Quadro K4000. I'm getting 25 MB/s for the first case and about 2 GB/s on the second one. M = 5 and N = 2000000. I must say the value for M is a common value, but sometimes it can get up to 50.
A reason for your slow throughput can be that you allocate your double matrix with new. This memory is not page locked. You can either use a system function (dont know which system you use) or the cuda function providing this functionality. It would be cudaMallocHost.
Just remove your =new double[M*N] and set your h_values with cudaMallocHost(&h_values, sizeof(double)*M*N) (and of course dont delete it, but free it (with cudaFreeHost)).
Btw. the theoretical top speed is 8 GB/s (PCI 2.0 x 16 lanes), practical you will stay below it (around 6 GB/s).

Invalid argument in cudaMemcpy3D using width in bytes?

I've made a simple texture3D test and found a strange behavior when copying data to device. The function cudaMemcpy3D return an 'invalid argument'.
I found the problem is related with cudaExtent. According to the CUDA Toolkit Reference Manual 4.0, cudaExtent Parameters are as follow:
w - Width in bytes
h - Height in elements
d - Depth in elements
So, I prepared the texture as follows:
// prepare texture
cudaChannelFormatDesc t_desc = cudaCreateChannelDesc<baseType>();
// CUDA extent parameters w - Width in bytes, h - Height in elements, d - Depth in elements
cudaExtent t_extent = make_cudaExtent(NCOLS*sizeof(baseType), NROWS, DEPTH);
// CUDA arrays are opaque memory layouts optimized for texture fetching
cudaArray *i_ArrayPtr = NULL;
// allocate 3D
status = cudaMalloc3DArray(&i_ArrayPtr, &t_desc, t_extent);
And configured the 3D parameters as follow:
// prepare input data
cudaMemcpy3DParms i_3DParms = { 0 };
i_3DParms.srcPtr = make_cudaPitchedPtr( (void*)h_idata, NCOLS*sizeof(baseType), NCOLS, NROWS);
i_3DParms.dstArray = i_ArrayPtr;
i_3DParms.extent = t_extent;
i_3DParms.kind = cudaMemcpyHostToDevice;
And finally copied the data to device memory:
// copy input data from host to device
status = cudaMemcpy3D( &i_3DParms );
The problem is solved if I only specified the number of element in the x dimension as:
cudaExtent t_extent = make_cudaExtent(NCOLS, NROWS, DEPTH);
which does not produce any error and the test work as expected.
I'm wondering if I miss something with the cudaExtent function or something else. Why the width parameter is not needed to be expressed in bytes ?
For CUDA arrays, the extent is specified with the width given in array elements. For allocating linear memory, the extent is specified with the width given in bytes. Because you are allocating an array with cudaMalloc3DArray, use the width in elements. If you were using cudaMalloc3D, the extent would have a width in bytes.