FFTW - computing the IFFT without first computing an FFT - fft

This may seem like a simple question but I've been trying to find the answer on the FFTW page and I am unable to.
I created the FFTW plans for forward and backward transforms and I fed some data into the fftw_complex *fft structure directly (instead of computing an FFT from input data first). Then I compute an IFFT on this and the result is not correct. Am I doing this right?
EDIT: So what I did is the following:
int ht=2, wd=2;
fftw_complex *inp = fftw_alloc_complex(ht * wd);
fftw_complex *fft = fftw_alloc_complex(ht * wd);
fftw_complex *ifft = fftw_alloc_complex(ht * wd);
fftw_plan plan_f = fftw_plan_dft_1d(wd *ht, inp, fft, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_plan plan_b = fftw_plan_dft_1d(wd * ht, fft, ifft, FFTW_BACKWARD, FFTW_ESTIMATE );
for(int i =0 ; i < 2; i++)
{
for(int j = 0; j<2; j++)
{
inp[wd*i + j][0] = 1.0;
inp[wd*i + j][1] = 0.0;
}
}
// fftw_execute(plan_f);
for(int i =0 ; i < 2; i++)
{
for(int j = 0; j<2; j++)
{
fft[wd*i + j][1] = 0.0;
if(i == j == 0)
fft[wd*i+j][0] = 4.0;
else
fft[wd*i+j][0] = 0.0;
std::cout << fft[wd*i+j][0] << " and " << fft[wd*i+j][1] << std::endl;
}
}
fftw_execute(plan_b);
for(int i =0 ; i < 2; i++)
{
for(int j = 0; j<2; j++)
std::cout << ifft[wd*i+j][0]/(double)(wd*ht) << " and " << ifft[wd*i+j][1]/(double)(wd*ht) << std::endl;
}
This is the full code. The ifft should return [1 1 1 1] for the real part. It doesn't.

I did the stupidest thing - in the if condition I posted:
i == j == 0
instead of i ==0 && j == 0
when I fixed that, it works. Thank you all so much for helping me out.

Related

What is the behavior of the vector here?

I am unable to comprehend why in the test shown below, iterator p never reaches the end and therefore the loop breaks only when k = 20? What exactly is the push_back doing to cause undefined behavior? Is it because the vector dynamically allocated a bunch of additional storage for the new elements I want to use, and the amount is not necessarily the amount I will use?
#include <iostream>
#include <vector>
#include <list>
using namespace std;
const int MAGIC = 11223344;
void test()
{
bool allValid = true;
int k = 0;
vector<int> v2(5, MAGIC);
k = 0;
for (vector<int>::iterator p = v2.begin(); p != v2.end(); p++, k++)
{
if (k >= 20) // prevent infinite loop
break;
if (*p != MAGIC)
{
cout << "Item# " << k << " is " << *p << ", not " << MAGIC <<"!" << endl;
allValid = false;
}
if (k == 2)
{
for (int i = 0; i < 5; i++)
v2.push_back(MAGIC);
}
}
if (allValid && k == 10)
cout << "Passed test 3" << endl;
else
cout << "Failed test 3" << "\n" << k << endl;
}
int main()
{
test();
}
Insertion to a vector while iterating over it is really a bad idea. Data insertion may cause memory reallocation that invalidates iterators. In this case, the capacity was not enough to insert additional elements, which caused memory allocation with a different address. You can check it yourself:
void test()
{
bool allValid = true;
int k = 0;
vector<int> v2(5, MAGIC);
k = 0;
for (vector<int>::iterator p = v2.begin(); p != v2.end(); p++, k++)
{
cout << v2.capacity() << endl; // Print the vector capacity
if (k >= 20) // prevent infinite loop
break;
if (*p != MAGIC) {
//cout << "Item# " << k << " is " << *p << ", not " << MAGIC <<"!" << endl;
allValid = false;
}
if (k == 2) {
for (int i = 0; i < 5; i++)
v2.push_back(MAGIC);
}
}
if (allValid && k == 10)
cout << "Passed test 3" << endl;
else
cout << "Failed test 3" << "\n" << k << endl;
}
This code will output something like the following:
5
5
5
10 <-- the capacity has changed
10
... skipped ...
10
10
Failed test 3
20
We can see that where k is equal to 2 (third line), the capacity of the vector doubled (fourth line) because we are adding new elements. The memory is redistributed, and the vector elements are most likely now located elsewhere. You can also check it by printing vector base address with data member function instead of capacity:
Address: 0x136dc20 k: 0
Address: 0x136dc20 k: 1
Address: 0x136dc20 k: 2
Address: 0x136e050 k: 3 <-- the address has changed
Address: 0x136e050 k: 4
... skipped ...
Address: 0x136e050 k: 19
Address: 0x136e050 k: 20
Failed test 3
20
The code is poorly written, you can make it more robust by using indices instead of iterators.

System get stuck on running matrix multiplication using CUDA

When i'm running this code on my system, after some seconds my system get stuck and i have to restart system again. So my question is what's i'm doing wrong here? Any suggestion will appreciated.
__global__ void matMul(float* d_M, float* d_N, float* d_P, int width) {
int row = blockIdx.y*width + threadIdx.y;
int col = blockIdx.x*width + threadIdx.x;
if (row < width && col < width) {
float product_val = 0;
for (int k = 0; k < width; k++) {
product_val += d_M[row*width + k] * d_N[k*width + col];
}
d_P[row*width + col] = product_val;
}
}
int main() {
const int n = 9;
float* d_M;
float* d_N;
float* d_P;
cudaMallocManaged(&d_M, SIZE * sizeof(float));
cudaMallocManaged(&d_N, SIZE * sizeof(float));
cudaMallocManaged(&d_P, SIZE * sizeof(float));
for (int i = 0; i < n; ++i) {
d_P[i] = 0;
}
int count = 0;
for (int i = 0; i < n; ++i) {
d_N[i] = ++count;
}
count = 0;
for (int i = 0; i < n; ++i) {
d_M[i] = ++count;
}
matMul <<<1, n>>> (d_M, d_N, d_P, 3);
cudaDeviceSynchronize();
for (int i = 0; i < n; ++i) {
printf("%f\n", d_P[i]);
}
cudaFree(d_N);
cudaFree(d_M);
cudaFree(d_P);
return 0;
}
Assuming that when you mean your system gets stuck, you get some kind of error in your program, it's likely that you're accessing memory that is invalid.
This could be in the higher indexes of your d_M and d_N iterations when k + row*width is indexing beyond the size of memory that you've allocated in cudaMallocManaged.
It's always good practice in situations like these to add some error handling using commands such as cudaPeekatLastError().
This link might be helpful for implementing some debugging.

cuda addvectors memory intuitive explanation

I have the following code and
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <ctime>
#include <vector>
#include <numeric>
float random_float(void)
{
return static_cast<float>(rand()) / RAND_MAX;
}
std::vector<float> add(float alpha, std::vector<float>& v1, std::vector<float>& v2 )
{ /*Do quick size check on vectors before proceeding*/
std::vector<float> result(v1.size());
for (unsigned int i = 0; i < result.size(); ++i)
{
result[i]=alpha*v1[i]+v2[i];
}
return result;
}
__global__ void Addloop( int N, float alpha, float* x, float* y ) {
int i;
int i0 = blockIdx.x*blockDim.x + threadIdx.x;
for( i = i0; i < N; i += blockDim.x*gridDim.x )
y[i] = alpha*x[i] + y[i];
/*
if ( i0 < N )
y[i0] = alpha*x[i0] + y[i0];
*/
}
int main( int argc, char** argv ) {
float alpha = 0.3;
// create array of 256k elements
int num_elements = 10;//1<<18;
// generate random input on the host
std::vector<float> h1_input(num_elements);
std::vector<float> h2_input(num_elements);
for(int i = 0; i < num_elements; ++i)
{
h1_input[i] = random_float();
h2_input[i] = random_float();
}
for (std::vector<float>::iterator it = h1_input.begin() ; it != h1_input.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
for (std::vector<float>::iterator it = h2_input.begin() ; it != h2_input.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
std::vector<float> host_result;//(std::vector<float> h1_input, std::vector<float> h2_input );
host_result = add( alpha, h1_input, h2_input );
for (std::vector<float>::iterator it = host_result.begin() ; it != host_result.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
// move input to device memory
float *d1_input = 0;
cudaMalloc((void**)&d1_input, sizeof(float) * num_elements);
cudaMemcpy(d1_input, &h1_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice);
float *d2_input = 0;
cudaMalloc((void**)&d2_input, sizeof(float) * num_elements);
cudaMemcpy(d2_input, &h2_input[0], sizeof(float) * num_elements, cudaMemcpyHostToDevice);
Addloop<<<1,3>>>( num_elements, alpha, d1_input, d2_input );
// copy the result back to the host
std::vector<float> device_result(num_elements);
cudaMemcpy(&device_result[0], d2_input, sizeof(float) * num_elements, cudaMemcpyDeviceToHost);
for (std::vector<float>::iterator it = device_result.begin() ; it != device_result.end(); ++it)
std::cout << ' ' << *it;
std::cout << '\n';
cudaFree(d1_input);
cudaFree(d2_input);
h1_input.clear();
h2_input.clear();
device_result.clear();
std::cout << "DONE! \n";
getchar();
return 0;
}
I am trying to understand the gpu memory access. The kernel, for reasons of simplicity, is launched as Addloop<<<1,3>>>. I am trying to understand how this code is working by imagining the for loops working on the gpu as instances. More specifically, I imagine the following instances but they do not help.
Instance 1:
for( i = 0; i < N; i += 3*1 ) // ( i += 0*1 --> i += 3*1 after Eric's comment)
y[i] = alpha*x[i] + y[i];
Instance 2:
for( i = 1; i < N; i += 3*1 )
y[i] = alpha*x[i] + y[i];
Instance 3:
for( i = 3; i < N; i += 3*1 )
y[i] = alpha*x[i] + y[i];
Looking inside of every loop it does not make any sense in the logic of adding two vectors. Can some one help?
The reason I am adopting this logic of instances is because it is working well in the case of the code inside the kernel which is in comments.
If these thoughts are correct what would be the instances in case we have multiple blocks inside the grid? In other words what would be the i values and the update rates (+=updaterate) in some examples?
PS: The kernel code borrowed from here.
UPDATE:
After Eric's answer I think the execution for N = 15, e.i the number of elements, goes like this (correct me if I am wrong):
For the instance 1 above i = 0 , 3, 6, 9, 12 which computes the corresponding y[i] values.
For the instance 2 above i = 1 , 4, 7, 10, 13 which computes the corresponding remaining y[i] values.
For the instance 3 above i = 2 , 5, 8, 11, 14 which computes the rest y[i] values.
Your blockDim.x is 3 and gridDim.x is 1 according to your setup <<<1,3>>>. So in each thread (you call it instance), it should be i+=3*1
update
With the for loop you can compute 15 element using only 3 threads. Generally you can use limited number of threads to do "infinit" work. And more work per threads can improve the performance by reducing the launch overhead and hiding the instruction stalls.
Another advantage is you could use fixed number of threads/blocks to do work of various sizes, thus requires less tuning.

cuda reach device function from global

I am trying to call a device function from global function. This function is only declaring an array to be used by all threads. But my problem when I printed the array its elements are not in the same order as declared. Is it because of all threads are creating the array again ? I confused about threads. If it is , Can I learn which thread is run first in global function and can I only allow it to declare the array for the others. Thanks.
Here my function to create array :
__device__ float myArray[20][20];
__device__ void calculation(int no){
filterWidth = 3+(2*no);
filterHeight = 3+(2*no);
int arraySize = filterWidth;
int middle = (arraySize - 1) / 2;
int startIndex = middle;
int stopIndex = middle;
// at first , all values of array are 0
for(int i=0; i<arraySize; i++)
for (int j = 0; j < arraySize; j++)
{
myArray[i][j] = 0;
}
// until middle line of the array, required indexes are 1
for (int i = 0; i < middle; i++)
{
for (int j = startIndex; j <= stopIndex; j++)
{ myArray[i][j] = 1; sum+=1; }
startIndex -= 1;
stopIndex += 1;
}
// for middle line
for (int i = 0; i < arraySize; i++)
{myArray[middle][i] = 1; sum+=1;}
// after middle line of the array, required indexes are 1
startIndex += 1;
stopIndex -= 1;
for (int i = (middle + 1); i < arraySize; i++)
{
for (int j = startIndex; j <= stopIndex; j++)
{ myArray[i][j] = 1; sum+=1; }
startIndex +=1 ;
stopIndex -= 1;
}
filterFactor = 1.0f / sum;
}
And global function :
__global__ void FilterKernel(Format24bppRgb* imageData)
{
int tidX = threadIdx.x + blockIdx.x * blockDim.x;
int tidY = threadIdx.y + blockIdx.y * blockDim.y;
Colour Cpixel = Colour (imageData[tidX + tidY*imageWidth] );
float depthPixel = Colour(depthData[tidX + tidY*imageWidth]).Red;
float absoluteDistanceFromFocus = fabs (depthPixel - focusDepth);
if(depthPixel == 0)
return;
Colour Cresult = Cpixel;
for (int i=0;i<8;i++)
{
calculation(i);
...
...
}
}
If you really want to select and force one thread to call the function and the rest to wait for it to do so, use __shared__ memory for the array created by the device function so that all threads in a block see the same one, and you can call it with:
for (int i=0;i<8;i++)
{
if (threadIdx.x == 0 && threadIdx.y == 0)
calculation(i);
__syncthreads();
...
}
Of course, this won't work between blocks - in a globally defined function, you have no control over the order in which blocks are computed.
Instead, if you can, you should do the initialization calculation (that only 1 thread needs to do) on the CPU and memcpy it to the GPU before launching your kernel. It looks like you'll use 8x the memory for your myArray's, but it'll dramatically speed up your computation.

Complicated for loop to be ported to a CUDA kernel

I have the next for nested loop and I would like to port it to CUDA to be run on a GPU
int current=0;
int ptr=0;
for (int i=0; i < Nbeans; i++){
for(int j=0;j< NbeamletsPerbeam[i];j++){
current = j + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[i];
}
}
I would be very happy if any body has an idea of how to do it or how can be done.
We are talking about Nbeams=5, NbeamletsPerBeam around 200 each.
This is what I currently have but I am not sure it is right...
for (int i= blockIdx.x; i < d_params->Nbeams; i += gridDim.x){
for (int j= threadIdx.y; j < d_beamletsPerBeam[i]; j+= blockDim.y){
currentBeamlet= j+k;
for (int ivoxel= threadIdx.x; ivoxel < totalVoxels; ivoxel += blockDim.x){
I would suggest this idea. But you might need to do some minor modifications based on your code.
dim3 blocks(NoOfThreads, 1);
dim3 grid(Nbeans, 1);
kernel<<grid, blocks, 1>>()
__global__ kernel()
{
int noOfBlocks = ( NbeamletsPerbeam[blockIdx.x] + blockDim.x -1)/blockDim.x;
for(int j=0; j< noOfBlocks;j++){
// use threads and compute....
if( (threadIdx.x * j) < NbeamletsPerbeam[blockIdx.x]) {
current = (threadIdx.x * j) + ptr;
for(int k=0;k<Nmax;k++){
......
}
ptr+=NbeamletsPerbeam[blockIdx.x];
}
}
}
This should do the trick and gives you better parallelization.