Understanding heisenbug example: different precision in registers vs main memory - terminology

I read the wiki page about heisenbug, but don't understand this example. Can
anyone explain it in detail?
One common example
of a heisenbug is a bug that appears when the program is compiled with an
optimizing compiler, but not when the same program is compiled without
optimization (as is often done for the purpose of examining it with a debugger).
While debugging, values that an optimized program would normally keep in
registers are often pushed to main memory. This may affect, for instance, the
result of floating-point comparisons, since the value in memory may have smaller
range and accuracy than the value in the register.

Here's a concrete example recently posted:
Infinite loop heisenbug: it exits if I add a printout
It's a really nice specimen because we can all reproduce it: http://ideone.com/rjY5kQ
These bugs are so dependent on very precise features of the platform that people also find them very difficult to reproduce.
In this case when the 'print-out' is omitted the program performs a high precision comparison inside the CPU registers (higher than stored in a double).
But to print out the value the compiler decides to move the result to main memory which results in an implicit truncation of the precision. When it uses that truncated value for the comparison it succeeds.
#include <iostream>
#include <cmath>
double up = 19.0 + (61.0/125.0);
double down = -32.0 - (2.0/3.0);
double rectangle = (up - down) * 8.0;
double f(double x) {
return (pow(x, 4.0)/500.0) - (pow(x, 2.0)/200.0) - 0.012;
}
double g(double x) {
return -(pow(x, 3.0)/30.0) + (x/20.0) + (1.0/6.0);
}
double area_upper(double x, double step) {
return (((up - f(x)) + (up - f(x + step))) * step) / 2.0;
}
double area_lower(double x, double step) {
return (((g(x) - down) + (g(x + step) - down)) * step) / 2.0;
}
double area(double x, double step) {
return area_upper(x, step) + area_lower(x, step);
}
int main() {
double current = 0, last = 0, step = 1.0;
do {
last = current;
step /= 10.0;
current = 0;
for(double x = 2.0; x < 10.0; x += step) current += area(x, step);
current = rectangle - current;
current = round(current * 1000.0) / 1000.0;
//std::cout << current << std::endl; //<-- COMMENT BACK IN TO "FIX" BUG
} while(current != last);
std::cout << current << std::endl;
return 0;
}
Edit: Verified bug and fix still exhibit: 03-FEB-22, 20-Feb-17

It comes from Uncertainty Principle which basically states that there is a fundamental limit to the precision with which certain pairs of physical properties of a particle can be known simultaneously. If you start observing some particle too closely,(i.e., you know its position precisely) then you can't measure its momentum precisely. (And if you have precise speed, then you can't tell its exact position)
So following this, Heisenbug is a bug which disappears when you are watching closely.
In your example, if you need the program to perform well, you will compile it with optimization and there will be a bug. But as soon as you enter in debugging mode, you will not compile it with optimization which will remove the bug.
So if you start observing the bug too closely, you will be uncertain to know its properties(or unable to find it), which resembles the Heisenberg's Uncertainty Principle and hence called Heisenbug.

The idea is that code is compiled to two states - one is normal or debug mode and other is optimised or production mode.
Just as it is important to know what happens to matter at quantum level, we should also know what happens to our code at compiler level!

Related

step doubling Runge Kutta implementation stuck shrinking stepsize to machine precision

I need to integrate a system of ODES using an adaptive RK4 method with stepsize control via step doubling techniques.
The problem is that the program continues forever shrinking the stepsize down to machine precision while not advancing time.
the idea is to step the solution once by a single step and also by two successive half steps, compare the result as their difference and store it in eps. So eps is a measure of the error. Now I want to determine the next step stepsize according to whether eps is greater to a specified accuracy eps0 (as described in the book "Numerical Recipes")
RK4Step(double t, double* Y, double *Yout, void (*RHSFunc)(double, double *, double *),double h) steps the solution vector Y by h and puts the result into Yout using the function RHSFunc.
#define NEQ 4 //problem dimension
int main(int argc, char* argv[])
{
ofstream frames("./frames.dat");
ofstream graphs("./graphs.dat");
double Y[4] = {2.0, 2.0, 1.0, 0.0}; //initial conditions for solution vector
double finaltime = 100; //end of integration
double eps0 = 10e-5; //error to compare with eps
double t = 0.0;
double step = 0.01;
while(t < finaltime)
{
double eps = 0.0;
double Y1[4], Y2[4]; //Y1 will store half step solution
//Y2 will store double step solution
double dt = step; //cache current stepsize
for(;;)
{
//make a step starting from state stored in Y and
//put solution into Y1. Then from Y1 make another half step
//and store into Y1.
RK4Step(t, Y, Y1, RHS, step); //two half steps
RK4Step(t+step,Y1, Y1, RHS, step);
RK4Step(t, Y, Y2, RHS, 2*step); //one long step
//compute eps as maximum of differences between Y1 and Y2
//(an alternative would be quadrature sums)
for(int i=0; i<NEQ; i++)
eps=max(eps, fabs( (Y1[i]-Y2[i])/15.0 ) );
//if error is within tolerance we grow stepsize
//and advance time
if(eps < eps0)
{
//stepsize is accepted, grow stepsize
//save solution from Y1 into Y,
//advance time by the previous (cached) stepsize
Y[0] = Y1[0]; Y[1] = Y1[1];
Y[2] = Y1[2]; Y[3] = Y1[3];
step = 0.9*step*pow(eps0/eps, 0.20); //(0.9 is the safety factor)
t+=dt;
break;
}
//if the error is too big we shrink stepsize
step = 0.9*step*pow(eps0/eps, 0.25);
}
}
frames.close();
graphs.close();
return 0;
}
You never reset eps in the inner loop. This could be the direct cause of your problem. While the actual error reduces with ever decreasing step sizes, the maximum in eps stays constant, and above eps0. This results in a constant reducing factor in the step size update, without any chance to break the loop.
Another "wrong thing" is that the error estimate and tolerance are incompatible. The error tolerance eps0 is an error density or unit-step error. To bring your error estimate eps into that format you need to divide eps by step. Or put another way, currently you are forcing the actual step error to be close to 0.5*eps0, so that the global error is 0.5*eps0 times the number of steps taken, with the number of steps loosely proportional to eps0^0.2. In the version using the unit-step error, the local error is forced to be "dynamically" close to 0.5*eps0*step, so that the global error is about 5*eps0 times the length of the integration interval. I'd say that the second variant is more in line with intuition about the expected behavior.
This is not a critical error, but may lead to sub-optimal step sizes and an actual global error that deviates non-trivially from the desired error tolerance.
You also have a coding inconsistency as in the propagation of the state and declaration of state vectors you have hard-coded 4 components in the state vector, while in the error computation you have a loop over a variable number NEQ of equations and components. As you are using C++, you could use a state vector class that handles all dimension-dependent loops internally. (If done too far, frequent allocation of instances with a short life span could be an efficiency issue.)

should divide by zero raise an exception

I've been debugging a C++ application in VS2015 and found that a number of my double variables were ending up a NaN following a divide by zero. While this is reasonable, I have floating point exceptions enabled (/fp:except) so I would have expected this to raise an exception. Looking at the MS help page, it doesn't list what causes a floating point exception. According to this answer to a related question, divide by zero is a floating point exception. This is not the case, i.e. the following test program with /fp:except enabled
int main()
{
try
{
double x = 1;
double y = 0;
double z = x / y;
printf("%f\n", z);
return 0;
}
catch (...)
{
printf("Exception!\n");
return 0;
}
}
displays "inf". Should this raise a floating point exception?
Edit: Checked the exceptions were enabled in the debugger and get the same result regardless
Edit2: Further reading here on IEE 754 suggests to me that with floating point exceptions enabled I should be getting an exception. A comment to the previously linked question however states 'The name "floating point exception" is a historical misnomer. Floating point division by zero is well-defined (per Annex F/IEEE754) and does not produce any signal.'
Floating point exceptions are not the same as C++ exceptions. They either have to be checked using std::fetestexcept() (below) or trapped (SIGFPE). Using traps requires calling _controlfp() in MSVC and feenableexcept() in GCC.
#include <iostream>
#include <cfenv>
#include <stdexcept>
// MSVC specific, see
// https://learn.microsoft.com/en-us/cpp/preprocessor/fenv-access?view=msvc-160
#pragma fenv_access (on)
int main() {
std::feclearexcept(FE_ALL_EXCEPT);
double y = 0.0;
double result{};
result = 1 / y;
std::cout << result << std::endl;
if (std::fetestexcept(FE_ALL_EXCEPT) & FE_DIVBYZERO) {
throw std::runtime_error("Division by zero");
}
return 0;
}
There are more examples of this in the SEI CERT C Coding Standard: https://wiki.sei.cmu.edu/confluence/display/c/FLP03-C.+Detect+and+handle+floating-point+errors.
Note that division by zero is undefined; it's not required to signal, but C and C++ implementations that define __STDC_IEC_559__ "shall" signal.

Thrust - accessing neighbors

I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.

CUDA: Unspecified Launch Failure

I was using the CUDA-GDB to find out what the problem was with my kernel execution. It would always output; Cuda error: kernel execution: unspecified launch failure. That's probably the worst error anyone could possibly get because there is no indication whatsoever of what is going on!
Back to the CUDA-GDB... When I was using the debugger it would arrive at the kernel and output;
Breakpoint 1, myKernel (__cuda_0=0x200300000, __cuda_1=0x200400000, __cuda_2=320, __cuda_3=7872, __cuda_4=0xe805c0, __cuda_5=0xea05e0, __cuda_6=0x96dfa0, __cuda_7=0x955680, __cuda_8=0.056646065580379823, __cuda_9=-0.0045986640087569072, __cuda_10=0.125,
__cuda_11=18.598229033761132, __cuda_12=0.00048828125, __cuda_13=5.9604644775390625e-08)
at myFunction.cu:60
Then I would type: next.
output;
0x00007ffff7f7a790 in __device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd ()
from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
The notable part in that section is that it has a tag to a typedef'd datatype. COMPLEX16 is defined as: typedef double complex COMPLEX16
Then I would type: next.
output;
Single stepping until exit from function Z84_device_stub__Z31chisquared_LogLikelihood_KernelPdS_iiP12tagCOMPLEX16S1_S1_S_ddddddPdS_iiP12tagCOMPLEX16S1_S1_S_dddddd#plt,
which has no line number information.
0x00007ffff7f79560 in ?? () from /home/alex/master/opt/lscsoft/lalinference/lib/liblalinference.so.3
Type next...
output;
Cannot find bounds of current function
Type continue...
Cuda error: kernel execution: unspecified launch failure.
Which is the error I get without debugging. I have seen some forum topics on something similar where the debugger cannot find the bounds of current function, possibly because the library is somehow not linked or something along those lines? The ?? was said to be because the debugger is somewhere is shell for some reason and not in any function.
I believe the problem lies deeper in the fact that I have these interesting data types in my code. COMPLEX16 REAL8
Here is my kernel...
__global__ void chisquared_LogLikelihood_Kernel(REAL8 *d_temp, double *d_sum, int lower, int dataSize,
COMPLEX16 *freqModelhPlus_Data,
COMPLEX16 *freqModelhCross_Data,
COMPLEX16 *freqData_Data,
REAL8 *oneSidedNoisePowerSpectrum_Data,
double FplusScaled,
double FcrossScaled,
double deltaF,
double twopit,
double deltaT,
double TwoDeltaToverN)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ REAL8 ssum[MAX_THREADS];
if (idx < dataSize)
{
idx += lower; //accounts for the shift that was made in the original loop
memset(ssum, 0, MAX_THREADS * sizeof(*ssum));
int tid = threadIdx.x;
int bid = blockIdx.x;
REAL8 plainTemplateReal = FplusScaled * freqModelhPlus_Data[idx].re
+ freqModelhCross_Data[idx].re;
REAL8 plainTemplateImag = FplusScaled * freqModelhPlus_Data[idx].im
+ freqModelhCross_Data[idx].im;
/* do time-shifting... */
/* (also un-do 1/deltaT scaling): */
double f = ((double) idx) * deltaF;
/* real & imag parts of exp(-2*pi*i*f*deltaT): */
double re = cos(twopit * f);
double im = - sin(twopit * f);
REAL8 templateReal = (plainTemplateReal*re - plainTemplateImag*im) / deltaT;
REAL8 templateImag = (plainTemplateReal*im + plainTemplateImag*re) / deltaT;
double dataReal = freqData_Data[idx].re / deltaT;
double dataImag = freqData_Data[idx].im / deltaT;
/* compute squared difference & 'chi-squared': */
double diffRe = dataReal - templateReal; // Difference in real parts...
double diffIm = dataImag - templateImag; // ...and imaginary parts, and...
double diffSquared = diffRe*diffRe + diffIm*diffIm ; // ...squared difference of the 2 complex figures.
//d_temp[idx - lower] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
//ssum[tid] = ((TwoDeltaToverN * diffSquared) / oneSidedNoisePowerSpectrum_Data[idx]);
/***** REDUCTION *****/
//__syncthreads(); //all the temps should have data before we add them up
//for (int i = blockDim.x / 2; i > 0; i >>= 1) { /* per block */
// if (tid < i)
// ssum[tid] += ssum[tid + i];
// __syncthreads();
//}
//d_sum[bid] = ssum[0];
}
}
When I'm not debugging (-g -G not included in command) then the kernel only runs fine if I don't include the line(s) that begin with d_temp[idx - lower] and ssum[tid]. I only did d_temp to make sure that it wasn't a shared memory error, ran fine. I also tried running with ssum[tid] = 20.0 and other various number types to make sure it wasn't that sort of problem, ran fine too. When I run with either of them included then the kernel exits with the cuda error above.
Please ask me if something is unclear or confusing.
There was a lack of context here for my question. The assumption was probably that I had done cudaMalloc and other such preliminary things before the kernel execution for ALL the pointers involved. However I had only done it to d_temp and d_sum (I was making tons of switches and barely realized I was making the other four pointers). Once I did cudaMalloc and cudaMemcpy for the data needed, then everything ran perfectly.
Thanks for the insight.

Translation from Complex-FFT to Finite-Field-FFT

Good afternoon!
I am trying to develop an NTT algorithm based on the naive recursive FFT implementation I already have.
Consider the following code (coefficients' length, let it be m, is an exact power of two):
/// <summary>
/// Calculates the result of the recursive Number Theoretic Transform.
/// </summary>
/// <param name="coefficients"></param>
/// <returns></returns>
private static BigInteger[] Recursive_NTT_Skeleton(
IList<BigInteger> coefficients,
IList<BigInteger> rootsOfUnity,
int step,
int offset)
{
// Calculate the length of vectors at the current step of recursion.
// -
int n = coefficients.Count / step - offset / step;
if (n == 1)
{
return new BigInteger[] { coefficients[offset] };
}
BigInteger[] results = new BigInteger[n];
IList<BigInteger> resultEvens =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset);
IList<BigInteger> resultOdds =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset + step);
for (int k = 0; k < n / 2; k++)
{
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
}
return results;
}
It worked for complex FFT (replace BigInteger with a complex numeric type (I had my own)). It doesn't work here even though I changed the procedure of finding the primitive roots of unity appropriately.
Supposedly, the problem is this: rootsOfUnity parameter passed originally contained only the first half of m-th complex roots of unity in this order:
omega^0 = 1, omega^1, omega^2, ..., omega^(n/2)
It was enough, because on these three lines of code:
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
I originally made use of the fact, that at any level of recursion (for any n and i), the complex root of unity -omega^(i) = omega^(i + n/2).
However, that property obviously doesn't hold in finite fields. But is there any analogue of it which would allow me to still compute only the first half of the roots?
Or should I extend the cycle from n/2 to n and pre-compute all the m-th roots of unity?
Maybe there are other problems with this code?..
Thank you very much in advance!
I recently wanted to implement NTT for fast multiplication instead of DFFT too. Read a lot of confusing things, different letters everywhere and no simple solution, and also my finite fields knowledge is rusty , but today i finally got it right (after 2 days of trying and analog-ing with DFT coefficients) so here are my insights for NTT:
Computation
X(i) = sum(j=0..n-1) of ( Wn^(i*j)*x(i) );
where X[] is NTT transformed x[] of size n where Wn is the NTT basis. All computations are on integer modulo arithmetics mod p no complex numbers anywhere.
Important values
Wn = r ^ L mod p is basis for NTT
Wn = r ^ (p-1-L) mod p is basis for INTT
Rn = n ^ (p-2) mod p is scaling multiplicative constant for INTT ~(1/n)
p is prime that p mod n == 1 and p>max'
max is max value of x[i] for NTT or X[i] for INTT
r = <1,p)
L = <1,p) and also divides p-1
r,L must be combined so r^(L*i) mod p == 1 if i=0 or i=n
r,L must be combined so r^(L*i) mod p != 1 if 0 < i < n
max' is the sub-result max value and depends on n and type of computation. For single (I)NTT it is max' = n*max but for convolution of two n sized vectors it is max' = n*max*max etc. See Implementing FFT over finite fields for more info about it.
functional combination of r,L,p is different for different n
this is important, you have to recompute or select parameters from table before each NTT layer (n is always half of the previous recursion).
Here is my C++ code that finds the r,L,p parameters (needs modular arithmetics which is not included, you can replace it with (a+b)%c,(a-b)%c,(a*b)%c,... but in that case beware of overflows especial for modpow and modmul) The code is not optimized yet there are ways to speed it up considerably. Also prime table is fairly limited so either use SoE or any other algo to obtain primes up to max' in order to work safely.
DWORD _arithmetics_primes[]=
{
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409,
419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,
661,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883,887,907,911,919,929,937,941,
947,953,967,971,977,983,991,997,1009,1013,1019,1021,1031,1033,1039,1049,1051,1061,1063,1069,1087,1091,1093,1097,1103,1109,1117,1123,1129,1151,
0}; // end of table is 0, the more primes are there the bigger numbers and n can be used
// compute NTT consts W=r^L%p for n
int i,j,k,n=16;
long w,W,iW,p,r,L,l,e;
long max=81*n; // edit1: max num for NTT for my multiplication purposses
for (e=1,j=0;e;j++) // find prime p that p%n=1 AND p>max ... 9*9=81
{
p=_arithmetics_primes[j];
if (!p) break;
if ((p>max)&&(p%n==1))
for (r=2;r<p;r++) // check all r
{
for (l=1;l<p;l++)// all l that divide p-1
{
L=(p-1);
if (L%l!=0) continue;
L/=l;
W=modpow(r,L,p);
e=0;
for (w=1,i=0;i<=n;i++,w=modmul(w,W,p))
{
if ((i==0) &&(w!=1)) { e=1; break; }
if ((i==n) &&(w!=1)) { e=1; break; }
if ((i>0)&&(i<n)&&(w==1)) { e=1; break; }
}
if (!e) break;
}
if (!e) break;
}
}
if (e) { error; } // error no combination r,l,p for n found
W=modpow(r, L,p); // Wn for NTT
iW=modpow(r,p-1-L,p); // Wn for INTT
and here is my slow NTT and INTT implementations (i havent got to fast NTT,INTT yet) they are both tested with Schönhage–Strassen multiplication successfully.
//---------------------------------------------------------------------------
void NTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wj,wi,a,n2=n>>1;
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=a;
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
void INTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wi=1,wj=1,rN,a,n2=n>>1;
rN=modpow(n,m-2,m);
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=modmul(a,rN,m);
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
dst is destination array
src is source array
n is array size
m is modulus (p)
w is basis (Wn)
hope this helps to someone. If i forgot something please write ...
[edit1: fast NTT/INTT]
Finally I manage to get fast NTT/INTT to work. Was little bit more tricky than normal FFT:
//---------------------------------------------------------------------------
void _NFTT(long *dst,long *src,long n,long m,long w)
{
if (n<=1) { if (n==1) dst[0]=src[0]; return; }
long i,j,a0,a1,n2=n>>1,w2=modmul(w,w,m);
// reorder even,odd
for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
for ( j=1;i<n ;i++,j+=2) dst[i]=src[j];
// recursion
_NFTT(src ,dst ,n2,m,w2); // even
_NFTT(src+n2,dst+n2,n2,m,w2); // odd
// restore results
for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w,m))
{
a0=src[i];
a1=modmul(src[j],w2,m);
dst[i]=modadd(a0,a1,m);
dst[j]=modsub(a0,a1,m);
}
}
//---------------------------------------------------------------------------
void _INFTT(long *dst,long *src,long n,long m,long w)
{
long i,rN;
rN=modpow(n,m-2,m);
_NFTT(dst,src,n,m,w);
for (i=0;i<n;i++) dst[i]=modmul(dst[i],rN,m);
}
//---------------------------------------------------------------------------
[edit3]
I have optimized my code (3x times faster than code above),but still i am not satisfied with it so i started new question with it. There I have optimized my code even further (about 40x times faster than code above) so its almost the same speed as FFT on floating point of the same bit size. Link to it is here:
Modular arithmetics and NTT (finite field DFT) optimizations
To turn Cooley-Tukey (complex) FFT into modular arithmetic approach, i.e. NTT, you must replace complex definition for omega. For the approach to be purely recursive, you also need to recalculate omega for each level based on current signal size. This is possible because min. suitable modulus decreases as we move down in the call tree, so modulus used for root is suitable for lower layers. Additionally, as we are using same modulus, the same generator may be used as we move down the call tree. Also, for inverse transform, you should take additional step to take recalculated omega a and instead use as omega: b = a ^ -1 (via using inverse modulo operation). Specifically, b = invMod(a, N) s.t. b * a == 1 (mod N), where N is the chosen prime modulus.
Rewriting an expression involving omega by exploiting periodicity still works in modular arithmetic realm. You also need to find a way to determine the modulus (a prime) for the problem and a valid generator.
We note that your code works, though it is not a MWE. We extended it using common sense, and got correct result for a polynomial multiplication application. You just have to provide correct values of omega raised to certain powers.
While your code works, though, like from many other sources, you double spacing for each level. This does not lead to recursion that is as clean, though; this turns out to be identical to recalculating omega based on current signal size because the power for omega definition is inversely proportional to signal size. To reiterate: halving signal size is like squaring omega, which is like giving doubled powers for omega (which is what one would do for doubling of spacing). The nice thing about the approach that deals with recalculating of omega is that each subproblem is more cleanly complete in its own right.
There is a paper that shows some of the math for modular approach; it is a paper by Baktir and Sunar from 2006. See the paper at the end of this post.
You do not need to extend the cycle from n / 2 to n.
So, yes, some sources which say to just drop in a different omega definition for modular arithmetic approach are sweeping under the rug many details.
Another issue is that it is important to acknowledge that the signal size must be large enough if we are to not have overflow for result time-domain signal if we are performing convolution. Additionally, it may be useful to find certain implementations for exponentiation subject to modulus exist that are fast, even if the power is quite large.
References
Baktir and Sunar - Achieving efficient polynomial multiplication in Fermat fields using the fast Fourier transform (2006)
You must make sure that roots of unity actually exist. In R there are only 2 roots of unity: 1 and -1, since only for them x^n=1 can be true.
In C you have infinitely many roots of unity: w=exp(2*pi*i/N) is a primitive N-th roots of unity and all w^k for 0<=k
Now to your problem: you have to make sure the ring you're working in offers the same property: enough roots of unity.
Schönhage and Strassen (http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) use integers modulo 2^N+1. This ring has enough roots of unity. 2^N == -1 is a 2nd root of unity, 2^(N/2) is a 4th root of unity and so on. Furthermore, these roots of unity have the advantage that they are powers of two and can be implemented as binary shifts (with a modulo operation afterwards, which comes down to a add/subtract).
I think QuickMul (http://www.cs.nyu.edu/exact/doc/qmul.ps) works modulo 2^N-1.