GSL Fast-Fourier Transform - Non-zero Imaginary for Transformed Gaussian? - fft

As an extension to this question that I asked. The Fourier transform of a real Gaussian is a real Gaussian. Now of course a DFT of a set of points that only resemble a Gaussian will not always be a perfect Gaussian, but it should certainly be close. In the code below I'm taking this [discrete] Fourier transform using GSL. Aside from the issue of the returned/transformed real components (outlined in linked question), I'm getting a weird result for the imaginary component (which should be identically zero). Granted, it's very small in magnitude, but its still weird. What is the cause for this asymmetric & funky output?
#include <gsl/gsl_fft_complex.h>
#include <gsl/gsl_errno.h>
#include <fstream>
#include <iostream>
#include <iomanip>
#define REAL(z,i) ((z)[2*(i)]) //complex arrays stored as [Re(z0),Im(z0),Re(z1),Im(z1),...]
#define IMAG(z,i) ((z)[2*(i)+1])
#define MODU(z,i) ((z)[2*(i)])*((z)[2*(i)])+((z)[2*(i)+1])*((z)[2*(i)+1])
#define PI 3.14159265359
using namespace std;
int main(){
int n = pow(2,9);
double data[2*n];
double N = (double) n;
ofstream file_out("out.txt");
double xmin=-10.;
double xmax=10.;
double dx=(xmax-xmin)/N;
double x=xmin;
for (int i=0; i<n; ++i){
REAL(data,i)=exp(-100.*x*x);
IMAG(data,i)=0.;
x+=dx;
}
gsl_fft_complex_radix2_forward(data, 1, n);
for (int i=0; i<n; ++i){
file_out<<(i-n/2)<<" "<<IMAG(data,((i+n/2)%n))<<'\n';
}
file_out.close();
}

Your result for the imaginary part is correct and expected.
The difference to zero (10^-15) is less than accuracy that you give to pi (12 digits, pi is used in the FFT, but I'm can't know whether you are overriding the pi inside the routine).
The FFT of a real function is not in general a real function. When you do the math analytically you integrate over the following expression:
f(t) e^{i w t} = f(t) cos wt + i f(t) sin wt,
so only if the function f(t) is real and even will the imaginary part (which is otherwise odd) vanish during integration. This has little meaning though, since the real part and imaginary part have physical meaning only in special cases.
Direct physical meaning is in the abs value (magnitude spectrum), the abs. value squared (intensity spectrum) and the phase or angle (phase spectrum).
A more significant offset from zero in the imaginary part would happen if it wasn't centered at the center of your time vector. Try shifting the x vector by some fraction of dx.
See below how the shift of the input by dx/2 (right column) affects the imaginary part, but not the magnitude (example written in Python, Numpy).
from __future__ import division
import numpy as np
import matplotlib.pyplot as p
%matplotlib inline
n=512 # number of samples 2**9
x0,x1=-10,10
dx=(x1-x0)/n
x= np.arange(-10,10,dx) # even number, asymmetric range [-10, 10-dx]
#make signal
s1= np.exp(-100*x**2)
s2= np.exp(-100*(x+dx/2 )**2)
#make ffts
f1=np.fft.fftshift(np.fft.fft(s1))
f2=np.fft.fftshift(np.fft.fft(s2))
#plots
p.figure(figsize=(16,12))
p.subplot(421)
p.title('gaussian (just ctr shown)')
p.plot(s1[250:262])
p.subplot(422)
p.title('same, shifted by dx/2')
p.plot(s2[250:262])
p.subplot(423)
p.plot(np.imag(f1))
p.title('imaginary part of FFT')
p.subplot(424)
p.plot(np.imag(f2))
p.subplot(425)
p.plot(np.real(f1))
p.title('real part of FFT')
p.subplot(426)
p.plot(np.real(f2))
p.subplot(427)
p.plot(np.abs(f1))
p.title('abs. value of FFT')
p.subplot(428)
p.plot(np.abs(f2))

Related

step doubling Runge Kutta implementation stuck shrinking stepsize to machine precision

I need to integrate a system of ODES using an adaptive RK4 method with stepsize control via step doubling techniques.
The problem is that the program continues forever shrinking the stepsize down to machine precision while not advancing time.
the idea is to step the solution once by a single step and also by two successive half steps, compare the result as their difference and store it in eps. So eps is a measure of the error. Now I want to determine the next step stepsize according to whether eps is greater to a specified accuracy eps0 (as described in the book "Numerical Recipes")
RK4Step(double t, double* Y, double *Yout, void (*RHSFunc)(double, double *, double *),double h) steps the solution vector Y by h and puts the result into Yout using the function RHSFunc.
#define NEQ 4 //problem dimension
int main(int argc, char* argv[])
{
ofstream frames("./frames.dat");
ofstream graphs("./graphs.dat");
double Y[4] = {2.0, 2.0, 1.0, 0.0}; //initial conditions for solution vector
double finaltime = 100; //end of integration
double eps0 = 10e-5; //error to compare with eps
double t = 0.0;
double step = 0.01;
while(t < finaltime)
{
double eps = 0.0;
double Y1[4], Y2[4]; //Y1 will store half step solution
//Y2 will store double step solution
double dt = step; //cache current stepsize
for(;;)
{
//make a step starting from state stored in Y and
//put solution into Y1. Then from Y1 make another half step
//and store into Y1.
RK4Step(t, Y, Y1, RHS, step); //two half steps
RK4Step(t+step,Y1, Y1, RHS, step);
RK4Step(t, Y, Y2, RHS, 2*step); //one long step
//compute eps as maximum of differences between Y1 and Y2
//(an alternative would be quadrature sums)
for(int i=0; i<NEQ; i++)
eps=max(eps, fabs( (Y1[i]-Y2[i])/15.0 ) );
//if error is within tolerance we grow stepsize
//and advance time
if(eps < eps0)
{
//stepsize is accepted, grow stepsize
//save solution from Y1 into Y,
//advance time by the previous (cached) stepsize
Y[0] = Y1[0]; Y[1] = Y1[1];
Y[2] = Y1[2]; Y[3] = Y1[3];
step = 0.9*step*pow(eps0/eps, 0.20); //(0.9 is the safety factor)
t+=dt;
break;
}
//if the error is too big we shrink stepsize
step = 0.9*step*pow(eps0/eps, 0.25);
}
}
frames.close();
graphs.close();
return 0;
}
You never reset eps in the inner loop. This could be the direct cause of your problem. While the actual error reduces with ever decreasing step sizes, the maximum in eps stays constant, and above eps0. This results in a constant reducing factor in the step size update, without any chance to break the loop.
Another "wrong thing" is that the error estimate and tolerance are incompatible. The error tolerance eps0 is an error density or unit-step error. To bring your error estimate eps into that format you need to divide eps by step. Or put another way, currently you are forcing the actual step error to be close to 0.5*eps0, so that the global error is 0.5*eps0 times the number of steps taken, with the number of steps loosely proportional to eps0^0.2. In the version using the unit-step error, the local error is forced to be "dynamically" close to 0.5*eps0*step, so that the global error is about 5*eps0 times the length of the integration interval. I'd say that the second variant is more in line with intuition about the expected behavior.
This is not a critical error, but may lead to sub-optimal step sizes and an actual global error that deviates non-trivially from the desired error tolerance.
You also have a coding inconsistency as in the propagation of the state and declaration of state vectors you have hard-coded 4 components in the state vector, while in the error computation you have a loop over a variable number NEQ of equations and components. As you are using C++, you could use a state vector class that handles all dimension-dependent loops internally. (If done too far, frequent allocation of instances with a short life span could be an efficiency issue.)

Thrust: Stream compaction copying only first N valid elements

I have a const thrust vector of elements from which I would like to extract at most N elements that pass a predicate (in any order), where the thrust vector size and N are known at compile-time. In my specific case, my vector is 500k elements and N is 100k.
My initial thought was to use thrust::copy_if to get all elements that pass the predicate, then to use only the first N elements for my subsequent calculations. However, in that case I would have to allocate two vectors of 500k elements (one for the initial vector, and one for the output of copy_if) and I'd have to process every element.
As this is an operation I have to do many times and across several CUDA streams, I would like to know if there is a way to obtain the N output elements while minimizing the memory footprint required, and ideally, minimizing the number of elements that need to be processed (i.e. breaking the process once N valid elements have been found).
One possible method to perform a stream compaction operation is to perform a predicated prefix-sum followed by a conditional indexed copy. By breaking a "monolithic" operation into these 2 pieces, it becomes fairly easy to insert the desired limiting behavior on output size.
The prefix sum is a fairly involved operation. We will use thrust for that. The conditional indexed copy is fairly trivial, so we will write our own CUDA kernel for that, rather than try to wrestle with a thrust::copy_if operation to get the copy logic just right. This kernel is where we will insert the limiting behavior on the output size.
Here is a worked example:
$ cat t34.cu
#include <thrust/scan.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef int mt;
__global__ void my_copy(mt *d, int *i, mt *r, int limit, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size){
if ((idx == 0) && (*i == 1) && (limit > 0))
*r = *d;
else if ((idx > 0) && (i[idx] > i[idx-1]) && (i[idx] <= limit)){
r[i[idx]-1] = d[idx];}
}
}
int main(){
int rs = 3;
mt d[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
int ds = sizeof(d)/sizeof(d[0]);
thrust::device_vector<mt> data(d, d+ds);
thrust::device_vector<int> idx(ds);
thrust::device_vector<mt> result(rs);
auto my_cmp = thrust::make_transform_iterator(data.begin(), 0+(_1>0));
thrust::inclusive_scan(my_cmp, my_cmp+ds, idx.begin());
my_copy<<<(ds+255)/256, 256>>>(thrust::raw_pointer_cast(data.data()), thrust::raw_pointer_cast(idx.data()), thrust::raw_pointer_cast(result.data()), rs, ds);
thrust::host_vector<mt> h_result = result;
thrust::copy_n(h_result.begin(), rs, std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -std=c++14 -o t34 t34.cu -arch=sm_52
$ ./t34
1,2,3,
$
(CUDA 11.0, Fedora 29, GTX 960)
Note that this code is provided for demonstration purposes. You should not assume that it is defect-free or suitable for any particular purpose. Use it at your own risk.
A bit of study with a profiler will show that the thrust::inclusive_scan operation does perform a cudaMalloc and cudaFree operation "under the hood". So even though we have pulled most of the allocations "out into the open" here, thrust apparently still needs to perform a single temporary allocation (of unknown size) to support the scan operation.
Responding to a question in the comments below. To understand this: 0+(_1>0), there are two things to note:
The general syntax is using thrust::placeholders. This capability of thrust allows us to write simple unary or binary functions inline, avoiding the need to use lambdas or write separate functors.
The reason for the 0+ is as follows. If we simply used (_1>0), then thrust would use as its unary function a boolean test of the item returned by dereferencing the iterator, compared to zero. The result of that comparison is a boolean, and if we leave it that way, the prefix sum will ultimately be computed using boolean arithmetic, which we do not want. We want the result of the boolean greater-than test (i.e. true/false) to be converted to an integer, so that the subsequent prefix sum gets performed using integer arithmetic. Prepending the (_1>0) boolean test with 0+ accomplishes that.

Dynamic Shared Memory in CUDA

There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is given by a variable that is read into the program after the start of execution. Because of this, I cannot use that variable to define the size of the arrays, due to the fact that defining the size of shared arrays requires knowing the value at compile time. I do not want to do something like __shared__ double arr1[1000] because typing in the size by hand is useless to me as that will change depending on the input. In the same vein, I cannot use #define to create a constant for the size.
Now I can follow an example similar to what is in the manual (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared) such as
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
But this still runs into an issue. From what I've read, defining a shared array always makes the memory address the first element. That means I need to make my second array shifted over by the size of the first array, as they appear to do in this example. But the size of the first array is dependent on user input.
Another question (Cuda Shared Memory array variable) has a similar issue, and they were told to create a single array that would act as the array for both arrays and simply adjust the indices to properly match the arrays. While this does seem to do what I want, it looks very messy. Is there any way around this so that I can still maintain two independent arrays, each with sizes that are defined as input by the user?
When using dynamic shared memory with CUDA, there is one and only one pointer passed to the kernel, which defines the start of the requested/allocated area in bytes:
extern __shared__ char array[];
There is no way to handle it differently. However this does not prevent you from having two user-sized arrays. Here's a worked example:
$ cat t501.cu
#include <stdio.h>
__global__ void my_kernel(unsigned arr1_sz, unsigned arr2_sz){
extern __shared__ char array[];
double *my_ddata = (double *)array;
char *my_cdata = arr1_sz*sizeof(double) + array;
for (int i = 0; i < arr1_sz; i++) my_ddata[i] = (double) i*1.1f;
for (int i = 0; i < arr2_sz; i++) my_cdata[i] = (char) i;
printf("at offset %d, arr1: %lf, arr2: %d\n", 10, my_ddata[10], (int)my_cdata[10]);
}
int main(){
unsigned double_array_size = 256;
unsigned char_array_size = 128;
unsigned shared_mem_size = (double_array_size*sizeof(double)) + (char_array_size*sizeof(char));
my_kernel<<<1,1, shared_mem_size>>>(256, 128);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -arch=sm_20 -o t501 t501.cu
$ cuda-memcheck ./t501
========= CUDA-MEMCHECK
at offset 10, arr1: 11.000000, arr2: 10
========= ERROR SUMMARY: 0 errors
$
If you have a random arrangement of arrays of mixed data types, you'll want to either manually align your array starting points (and request enough shared memory) or else use alignment directives (and be sure to request enough shared memory), or use structures to help with alignment.

Parallel Anti diagonal 'for' loop?

I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).
I'm implementing an algorithm which requires the following to be performed:
There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)
I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.
In each iteration, all the elements in that anti-diagonal must run parallelly.
Each anti-diagonal is calculated based on the values of the previous anti-diagonal.
So, how do I index the threads with this requirement in CUDA?
As long as I understand, you want something like
Parallelizing the Smith-Waterman Local Alignment Algorithm using CUDA A
At each iteration, the kernel is launched with a different number of threads.
Perhaps the code in Parallel Anti diagonal 'for' loop could be modified as
int iDivUp(const int a, const int b) { return (a % b != 0) ? (a / b + 1) : (a / b); };
#define BLOCKSIZE 32
__global__ antiparallel(float* d_A, int step, int N) {
int i = threadIdx.x + blockIdx.x* blockDim.x;
int j = step-i;
/* do work on d_A[i*N+j] */
}
for (int step = 0; step < 2*N-1; step++) {
dim3 dimBlock(BLOCKSIZE);
dim3 dimGrid(iDivUp(step,dimBlock.x));
antiparallel<<<dimGrid.x,dimBlock.x>>>(d_A,step,N);
}
This code is untested and is just a sketch of a possible solution (provided that I have not misunderstood your question). Furthermore, I do not know how efficient would be a solution like that since you will have kernels launched with very few threads.

CUDA: Getting max value and its index in an array

I have several blocks were each block executes on separate part of an integer array. As an example: block one from array[0] to array[9] and block two from array[10] to array[20].
What is the best way i can get the index of the max value of the array for each block?
Example block one a[0] to a[10] have the following values:
5 10 2 3 4 34 56 3 9 10
So 56 is the largest value at index 6.
I cannot use the shared memory because the size of the array may be very big. Therefore it won't fit. Are there any libraries that allows me to do so fast?
I know about the reduction algorithm, but i think my case is different because i want to get the index of the largest element.
If I understood exactly what you want is : Get the index for the array A of the max value inside it.
If that is true then I would suggest you to use the thrust library:
Here is how you would do it:
#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>
using namespace thrust;
// return the biggest of two tuples
template <class T>
struct bigger_tuple {
__device__ __host__
tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b)
{
if (a > b) return a;
else return b;
}
};
template <class T>
int max_index(device_vector<T>& vec) {
// create implicit index sequence [0, 1, 2, ... )
counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
tuple<T,int> init(vec[0],0);
tuple<T,int> smallest;
smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
init, bigger_tuple<T>());
return get<1>(smallest);
}
int main(){
thrust::host_vector<int> h_vec(1024);
thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
int index = max_index(d_vec);
std::cout << "Max index is:" << index <<std::endl;
std::cout << "Value is: " << h_vec[index] <<std::endl;
return 0;
}
This will not benefit the original poster but for those who came to this page looking for an answer I would second the recommendation to use thrust that already has a function thrust::max_element that does exactly that - returns an index of the largest element. min_element and minmax_element functions are also provided. See thrust documentation for details here.
As well as the suggestion to use Thrust, you could also use the CUBLAS cublasIsamax function.
The size of your array in comparison to shared memory is almost irrelevant, since the number of threads in each block is the limiting factor rather than the size of the array. One solution is to have each thread block work on a size of the array the same size as the thread block. That is, if you have 512 threads, then block n will be looking at array[ n ] thru array[ n + 511 ]. Each block does a reduction to find the highest member in that portion of the array. Then you bring the max of each section back to the host and do a simple linear search to locate the highest value in the overall array. Each reduction no the GPU reduces the linear search by a factor of 512. Depending on the size of the array, you might want to do more reductions before you bring the data back. (If your array is 3*512^10 in size, you might want to do 10 reductions on the gpu, and have the host search through the 3 remaining data points.)
One thing to watch out for when doing a max value plus index reduction is that if there is more than one identical valued maximum element in your array, i.e. in your example if there were 2 or more values equal to 56, then the index which is returned would not be unique and possibly be different on every run of the code because the timing of the thread ordering over the GPU is not deterministic.
To get around this kind of problem you can use a unique ordering index such as threadid + threadsperblock * blockid, or else the element index location if that is unique. Then the max test is along these lines:
if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{
max_so_far = a;
index_max_so_far = index_a;
order_max_so_far = order_a;
}
(index and order can be the same variable, depending on the application.)