Kiss FFT on TI MSP430FR6989 - fft

I am trying to run Kiss FFT (Kiss FFT github) on an MSP430FR6989 Launchpad. For now, I'm just trying to get the kiss_fftr test shown here to work. I am running into an issue with kiss_fftr_alloc(int nfft,int inverse_fft,void * mem, size_t * lenmem). My input is (16, 0, NULL, NULL). The function reaches this point in kiss_fftr.c and then returns NULL due to the !st if statement.
if (lenmem == NULL) {
st = (kiss_fftr_cfg) KISS_FFT_MALLOC (memneeded);
} else {
if (*lenmem >= memneeded)
st = (kiss_fftr_cfg) mem;
*lenmem = memneeded;
}
if (!st)
{
return NULL;
}
Malloc sets st to NULL, so it failed to allocate the memory. I am sure that there is enough memory available in my MCU. My memory allocation in CCS is 35% RAM (736/2048), 3% FRAM1 (1896/48000) and 28% FRAM2 (23144/81912).
Does anyone have advice on how to fix this, or what I should learn to fix this? I don't want to chase down the wrong rabbit hole if memory allocation is not the issue.
What I've tried: When I try running the test code given in the stack overflow link and sending the output array over UART, I get no output. I checked to see if kiss_fft_alloc is working correctly by making an if statement (st == NULL) that would throw a KISS_FFT_ERROR. The error threw at the point mentioned above, but I cannot figure out how to "fix" malloc failing to allocate memory.

Related

openCV / unhandled exception or msvcp100d.dll

I do realise that this problem is pretty common, but I have spent around 4 days so far, trying to fix it on my own, using all the smart advice I found on the Internet and - unfortunately - I've failed.
I managed to make openCV2.4.6 work with my VisualStudio 2012, or at least that's what I assumed after I was able to stream a video from my webcam with this example:
#include "stdafx.h"
#include "opencv2/opencv.hpp"
int main( int argc, const char** argv )
{
CvCapture* capture;
IplImage* newImg;
while (true)
{
capture = cvCaptureFromCAM(-1);
newImg = cvQueryFrame( capture );
cvNamedWindow("Window1", CV_WINDOW_AUTOSIZE);
cvShowImage("Window1", newImg);
int c = cvWaitKey(10);
if( (char)c == 27 ) { exit(0); }
}
cvReleaseImage(&newImg);
return 0;
}
Everything worked fine, so I decided to play around with it and I made an attempt to use a simple image processing operation such as converting rgb to grayscale. I modified my code to following:
#include "stdafx.h"
#include "opencv2/opencv.hpp"
int main( int argc, const char** argv )
{
CvCapture* capture;
IplImage* img1;
IplImage* img2;
while (true)
{
capture = cvCaptureFromCAM(-1);
img1 = cvQueryFrame( capture );
img2 = cvCreateImage(cvGetSize(img1),IPL_DEPTH_8U,1);
cvCvtColor(img1,img2,CV_RGB2GRAY);
cvNamedWindow("Window1", CV_WINDOW_AUTOSIZE);
cvNamedWindow("Window2", CV_WINDOW_AUTOSIZE);
cvShowImage("Window1", img1);
cvNamedWindow("Window2", CV_WINDOW_AUTOSIZE);
int c = cvWaitKey(10);
if( (char)c == 27 ) { exit(0); }
}
cvReleaseImage(&img1);
cvReleaseImage(&img2);
return 0;
}
And that's the place where the nightmare started. I keep getting the
Unhandled exception at at 0x000007FEFD57AA7D in opencvbegginer.exe: Microsoft C++ exception: cv::Exception at memory location 0x000000000030F920.
I did some research and tried few solutions, such as exchanging opencv_core246.lib to opencv_core246d.lib, etc. For a second I hoped it might work, but the reality punched me again with msvcp100d.dll missing. I tried to update all redistributable packages, but it didn't change the fact I keep getting this error. I tried to find out how to fix it another way and I found some forum on which they tell to go to C/C++ properties and change the Runtime Library to MTd, so... I tried this as well, but - as you can notice by now - it didn't work.
At this current moment I just ran out of ideas on how to fix this, so I would be really grateful for any help.
Cheers
PS. Important thing to add: when I got the unhandled exception, opencv 'spoke to me', saying
OpenCV Error: Bad argument in
unknown function, file ......\scr\opencv\modules\core\src\array.cpp,
line 1238
However, I already assumed way back then that I'm just not clever enough with my idiot-resistant code and I tried few other pieces of code that were written by competent people - unfortunately, I keep getting exactly the same error (the rest is the same as well, after I change the things mentioned above).
If img1 == NULL, then it crashes on cvGetSize(img1). Try enclosing the code after cvQueryFrame in an if (img1 != NULL).
However, if it returns NULL for every frame, it means there is something wrong with your camera/drivers/way you capture the frames.
You should also move the cvNamedWindow outside of the loop, since there is no need for it to be recreated for every frame.

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

Issue with periodically discrepancies in cufft-fftw complex to real transformations

For my thesis, I have to optimize a special MPI-Navier Stokes-Solver program with CUDA. The original program uses FFTW for solving several PDEs. In detail, several upper triangle matrices are fourier tranformed in two dimensions, but handled as one dimensional arrays. For the moment, I'm struggling with parts of the original code: (N is always set to 64)
Original:
//Does the complex to real in place fft an normalizes
void fftC2R(double complex *arr) {
fftw_execute_dft_c2r(plan_c2r, (fftw_complex*)arr, (double*)arr);
//Currently ignored: Normalization
/* for(int i=0; i<N*(N/2+1); i++)
arr[i] /= (double complex)sqrt((double complex)(N*N));*/
}
void doTimeStepETDRK2_nonlin_original() {
//calc velocity
ux[0] = 0;
uy[0] = 0;
for(int i=1; i<N*(N/2+1); i++) {
ux[i] = I*kvec[1][i]*qvec[i] / kvec[2][i];
uy[i] = -I*kvec[0][i]*qvec[i] / kvec[2][i];
}
fftC2R(ux);
fftC2R(uy);
//do some stuff here...
//...
return;
}
Where ux and uy are allocated as (double complex arrays):
ux = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
uy = (double complex*)fftw_malloc(N*(N/2+1) * sizeof(double complex));
The fft-plan is created as:
plan_c2r = fftw_plan_dft_c2r_2d(N, N,(fftw_complex*) qvec, (double*)qvec, FFTW_ESTIMATE);
Where qvec is allocated the same way like ux and uy and has data type double complex.
Here are the relevant parts of the CUDA code:
NN2_VecSetZero_and_init<<<block_size,grid_size>>>();
cudaSafeCall(cudaDeviceSynchronize());
cudaSafeCall(cudaGetLastError());
int err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_ux,(cufftDoubleReal*)sym_ux);
if (err != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return;
}
err = (int)cufftExecZ2D(cu_plan_c2r,(cufftDoubleComplex*)sym_uy,(cufftDoubleReal*)sym_uy);
if (err != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return;
}
//do some stuff here...
//...
return;
Where sim_ux and sim_uy are allocated as:
cudaMalloc((void**)&sym_ux, N*(N/2+1)*sizeof(cufftDoubleComplex));
cudaMalloc((void**)&sym_uy, N*(N/2+1)*sizeof(cufftDoubleComplex));
The initialization of the relevant cufft parts looks like
if (cufftPlan2d(&cu_plan_c2r,N,N, CUFFT_Z2D) != CUFFT_SUCCESS){
exit(EXIT_FAILURE);
return -1;
}
if (cufftPlan2d(&cu_plan_r2c,N,N, CUFFT_D2Z) != CUFFT_SUCCESS){
exit(EXIT_FAILURE);
return -1;
}
if ( cufftSetCompatibilityMode ( cu_plan_c2r , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return -1;
}
if ( cufftSetCompatibilityMode ( cu_plan_r2c , CUFFT_COMPATIBILITY_FFTW_ALL) != CUFFT_SUCCESS ) {
exit(EXIT_FAILURE);
return -1;
}
So I use full FFTW compatibility and I call every function with the FFTW calling patterns.
When I run both versions, I receive almost equal results for ux and uy (sim_ux and sim_uy). But at periodically positions of the arrays, Cufft seems to ignore these elements, where FFTW sets the real part of these elements to zero and calculates the complex parts (the arrays are too large to show them here). The stepcount, for which this occurs, is N/2+1. So I believe, I didn't completely understand fft-padding theory between Cufft and FFTW.
I can exclude any former discrepancies between these arrays, until Cufft-executions are called. So any other arrays of the code above are not relevant here.
My question is: am I too optimistic in using almost 100% of the FFTW calling style? Do I have to handle my arrays before the ffts? The Cufft documentation says, I'd have to resize the data input and output arrays. But how can I do this, when I'm running inplace transformations? I really wouldn't like to go too far distant from the original code and I don't want to use any more copy instructions for each fft call, because memory is limited and arrays should remain and processed on gpu as long as possible.
I'm thankful for every hint, critical statement or idea!
My configuration:
Compiler: gcc 4.6 (C99 Standard)
MPI-Package: mvapich2-1.5.1p1 (shouldn't play a role because of reduced single processing debug mode)
CUDA-Version: 4.2
GPU: CUDA-arch-compute_20 (NVIDIA GeForce GTX 570)
FFTW 3.3
once i had to do with CUFFT the only solution I got to work was with exclusive usage of "cu_plan_c2c" - there are easy transformations between real and complex arrays:
-filling complex part with 0 for emulating cu_plan_r2c
-use atan2 (not atan) on the complex result to emulate cu_plan_c2r
Sorry for not pointing you to any better solution, but this is how I ended up solving this problem. Hope you are not getting into any hard toruble with low memory cpu-sided....

CUDA: atomicAdd takes too much time, serializing threads

I have a kernel which makes some comparisons and decides whether two objects collide or not. I want to store the colliding objects' id's to an output buffer. I do not want to have gap in the output buffer. I want to record each collision to a unique index in the output buffer.
So I created an atomic variable in the shared memory (local sum), and also in global memory (global sum). The code below shows the incrementing of the shared variable as the collision is found. I do not have problem with incrementing atomic variable at global memory for now.
__global__ void mykernel(..., unsigned int *gColCnt) {
...
__shared__ unsigned int sColCnt;
__shared__ unsigned int sIndex;
if (threadIdx.x == 0) {
sColCnt = 0;
}
__syncthreads();
unsigned int index = 0;
if (colliding)
index = atomicAdd(&sColCnt, 1); //!!Time Consuming!!
__syncthreads();
if (threadIdx.x == 0)
sIndex = atomicAdd(gColCnt, sColCnt);
__syncthreads();
if (sColCnt + sIndex > outputSize) { //output buffer is not enough
//printf("Exceeds outputsize: %d + %d > %d\n", sColCnt, sIndex, outputSize);
return;
}
if (colliding) {
output[sIndex + index] = make_uint2(startId, toId);
}
}
My problem is that, when many threads try to increment the atomic variable, they get serialized. Before writing something like prefix-sum, I wanted to ask if there is a way of getting this done efficiently.
The elapsed time of my kernel increases from 13msec to 44msec because of this one line out there.
I found a prefix-sum example code but its referenced links fails because of NVIDIA's discussing board is down.
https://stackoverflow.com/a/3836944/596547
Edit:
I have added the end of my code too to above. In fact I do have an hierarchy. To see the affect of every code line, I setup scenes where every object collides with each other, extreme case, and another extreme case where approximately no objects collide.
At the end I add the shared atomic variable to a global variable (gColCnt) to inform outside about the number of collisions and find correct index values. I think I have to use atomicAdd here in any way.
Consider using a parallel stream compaction algorithm, for instance thrust::copy_if.
nvidia blog article related : http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

CUDA/Thrust double pointer problem (vector of pointers)

Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.
I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!
--------------------------
main.cpp
--------------------------
Sphere * parseSphere(int i)
{
Sphere * s = new Sphere();
s->a = 1+i;
s->b = 2+i;
s->c = 3+i;
return s;
}
int main( int argc, char** argv ) {
int i;
thrust::host_vector<Sphere *> spheres_h;
thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);
//initialize spheres_h
for(i=0;i<NUM_OBJECTS;i++){
Sphere * sphere = parseSphere(i);
spheres_h.push_back(sphere);
}
//initialize spheres_resh
for(i=0;i<NUM_OBJECTS;i++){
spheres_resh[i].a = 1;
spheres_resh[i].b = 1;
spheres_resh[i].c = 1;
}
thrust::device_vector<Sphere *> spheres_dv = spheres_h;
thrust::device_vector<Sphere> spheres_resv = spheres_resh;
Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);
kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);
thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());
bool result = true;
for(i=0;i<NUM_OBJECTS;i++){
result &= (spheres_resh[i].a == i+1);
result &= (spheres_resh[i].b == i+2);
result &= (spheres_resh[i].c == i+3);
}
if(result)
{
cout << "Data GOOD!" << endl;
}else{
cout << "Data BAD!" << endl;
}
return 0;
}
--------------------------
cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float
num_objects)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
spheres_res[index].b = (*(spheres_d+index))->b;
spheres_res[index].c = (*(spheres_d+index))->c;
}
void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{
int threads = 512;//per block
int grids = ((num_objects)/threads)+1;//blocks per grid
deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}
The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.
The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.
The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.
You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.
The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.