How to optimize finding values in 2D array in verilog

How to optimize finding values in 2D array in verilog - function

I need to set up a function that determines if a match exists in a 2D array (index).
My current implementation works, but is creating a large chain of LUTs due to if statements checking each element of the array.
function result_type index_search ( index_type index, logic[7:0] address );
for ( int i=0; i < 8; i++ ) begin
if ( index[i] == address ) begin
result = i;
end
end
Is there a way to check for matches in a more efficient manner?

Not much to be done, really; at least for the code in hand. Since your code targets hardware, to optimize it think in terms of hardware, not function/verilog code.
For a general purpose implementation, without any known data patterns, you'll definitely need (a) N W-bit equality checks, plus (b) an N:1 FPA (Fixed Priority Arbiter, aka priority encoder, aka leading zero detector) that returns the first match, assuming N W-bit inputs. Something like this:
Not much optimization to be done, but here are some possible general-purpose optimizations:
Pipelining, as shown in the figure, if timing is an issue.
Consider an alternative FPA implementation that makes use of 2's complement characteristics and may result to a more LUT-efficient implementation: assign fpa_out = fpa_in & ((~fpa_in)+1); (result in one-hot encoding, not weighted binary, as in your code)
Sticking to one-hot encoding can come in handy and reduce some of your logic down your path, but I cannot say for sure until we see some more code.
This is what the implementation would look like:
logic[N-1:0] addr_eq_idx;
logic[N-1:0] result;
for (genvar i=0; i<N; i++) begin: g_eq_N
// multiple matches may exist in this vector
assign addr_eq_idx[i] = (address == index[i]) ? 1'b1 : 1'b0;
// pipelined version:
// always_ff #(posedge clk, negedge arstn)
// if (!arstn)
// addr_eq_idx[i] <= 1'b0;
// else
// addr_eq_idx[i] <= (address == index[i]) ? 1'b1 : 1'b0;
end
// result has a '1' at the position where the first match is found
assign result = addr_eq_idx & ((~addr_eq_idx) + 1);
Finally, try to think if your design can be simplified due to known run-time data characteristics. For example, let's say you are 100% sure that the address you're looking for may exist within the index 2D array in at most one position. If that is the case, then you do not need an FPA at all, since the first match will be the only match. In that case, addr_eq_idx already points to the matching index, as a one-hot vector.

Related

Is using base case variable in a recursive function important?

I'm currently learning about recursion, it's pretty hard to understand. I found a very common example for it:
function factorial(N)
local Value
if N == 0 then
Value = 1
else
Value = N * factorial(N - 1)
end
return Value
end
print(factorial(3))
N == 0 is the base case. But when i changed it into N == 1, the result is still remains the same. (it will print 6).
Is using the base case important? (will it break or something?)
What's the difference between using N == 0 (base case) and N == 1?

That's just a coincidence, since 1 * 1 = 1, so it ends up working either way.
But consider the edge-case where N = 0, if you check for N == 1, then you'd go into the else branch and calculate 0 * factorial(-1), which would lead to an endless loop.
The same would happen in both cases if you just called factorial(-1) directly, which is why you should either check for > 0 instead (effectively treating every negative value as 0 and returning 1, or add another if condition and raise an error when N is negative.
EDIT: As pointed out in another answer, your implementation is not tail-recursive, meaning it accumulates memory for every recursive functioncall until it finishes or runs out of memory.
You can make the function tail-recursive, which allows Lua to treat it pretty much like a normal loop that could run as long as it takes to calculate its result:
local function factorial(n, acc)
acc = acc or 1
if n <= 0 then
return acc
else
return factorial(n-1, acc*n)
end
return Value
end
print(factorial(3))
Note though, that in the case of factorial, it would take you way longer to run out of stack memory than to overflow Luas number data type at around 21!, so making it tail-recursive is really just a matter of training yourself to write better code.

As the above answer and comments have pointed out, it is essential to have a base-case in a recursive function; otherwise, one ends up with an infinite loop.
Also, in the case of your factorial function, it is probably more efficient to use a helper function to perform the recursion, so as to take advantage of Lua's tail-call optimizations. Since Lua conveniently allows for local functions, you can define a helper within the scope of your factorial function.
Note that this example is not meant to handle the factorials of negative numbers.
-- Requires: n is an integer greater than or equal to 0.
-- Effects : returns the factorial of n.
function fact(n)
-- Local function that will actually perform the recursion.
local function fact_helper(n, i)
-- This is the base case.
if (i == 1) then
return n
end
-- Take advantage of tail calls.
return fact_helper(n * i, i - 1)
end
-- Check for edge cases, such as fact(0) and fact(1).
if ((n == 0) or (n == 1)) then
return 1
end
return fact_helper(n, n - 1)
end

Why are my Verilog output registers only outputting "x"?

I've written the following module to take in a 12-bit stream of input, and run it through the Exponential Moving Average (EMA) formula. When simulating the program, it appears as if my output is a 12-bit stream of "don't cares". Regardless of what my input is, it isn't realized by the output register.
I'm a beginner at Verilog, so I'm sure there are a slew of issues here, but I expected to at least get some kind of output. Why is this behavior occurring and how can I fix it from occurring in the future? I've linked a screenshot of my waveform results at the bottom of this post.
My verilog module:
`timescale 1ns / 1ps
module EMATesting ( input signed [11:0] signal_in,
output reg signed [11:0] signal_out,
input clock_in,
input reset,
input enable,
output reg [11:0] newEMA);
reg [11:0] prevEMA;
reg [11:0] y;
reg [11:0] temp;
integer count = 0;
integer t = 1;
integer one = 1;
integer alpha = 0.5;
always #(posedge clock_in) begin
if (reset) begin
count = 0;
end
else if (count < 64) begin
count = count + 1;
end
//set the output equal to the first value received
if (count == 1) begin
newEMA = signal_in;
end
//if not the first value, run through the formula
else begin
temp = newEMA;
prevEMA = temp;
newEMA = alpha * signal_in + (one-alpha) * prevEMA;
count = count + 1;
end
end
endmodule
My verilog testbench:
`timescale 10ns / 1ns
module testbench_EMA;
reg signal_in;
reg clock_in, reset, enable;
wire [11:0] signal_out;
wire [11:0] newEMA;
initial begin
signal_in = 0; reset = 0; enable = 0;
clock_in = 0;
end
EMATesting DUT(signal_in, signal_out, clock_in, reset, enable, newEMA);
initial
begin
//test some values with timing intervals here
#100;
reset = 1;
#100;
reset = 0;
enable = 1;
#100;
//set signal_in to "1" repeating 12 times
signal_in = { 12 { 1'b1}};
#100;
reset = 1;
#100; //buffer to end simulation
$finish;
end
always
#5 clock_in = !clock_in;
endmodule
Waveform results of my testbench. Note that both output registers are giving no values.

There are many mistakes in your code. #Mr. Snrub listed some of them. You can correct the mistakes that are listed in the comments, and by further editing your code you can make it work as you wish, but there are more important fundamental problems. You have to solve them before you make further and hard to debug mistakes in the future, it also gives advantage of understanding HDL concepts better.
First of all, you need to understand the difference between software programming languages and hardware description languages. You should make further research for that, the mistakes you made are commonly made by developers who assume HDL is similar to software programming languages. Check this link for further information.
You need to further understand how the Verilog HDL (or any other HDL) codes are synthesized and why they are used.
Need to be familiar when and how to use blocking = or non-blocking <= assignments and how they work. Check this link.
When you research about the listed topics, you will understand that in edge sensitive (synchronous or sequential) always block, it is bad idea to use blocking = assignment.
When you use blocking assignment, the assignments are done in series, as in common software programming languages. In the case of edge sensitive always block, you have very limited time to finish all required assignments inside the always block.
In your case, as all the assignments inside the edge sensitive always block are blocking (assuming no reset condition):
First, it checks if count is less than 64, and completes the assignment if that is the case. All other assignments are waiting for that assignment to finish.
By the time the first assignment is done, the always block with posedge clock_in sensitivity might have ended execution and waiting for the next edge of the clock. So, other assignments are not done.
As you have not initialized your output registers, they initially were filled with 'don't cares' and new value is never assigned again, so, as a result, you got that output (filled with 'don't cares').
Even when count becomes larger than or equal to 64, you have other blocking assignments before you assign value to newEMA.

Piecewise functions in the Octave symbolic package?

Unlike Matlab, Octave Symbolic has no piecewise function. Is there a work around? I would like to do something like this:
syms x
y = piecewise(x0, 1)
Relatedly, how does one get pieces of a piecewise function? I ran the following:
>> int (exp(-a*x), x, 0, t)
And got the following correct answer displayed and stored in a variable:
t for a = 0
-a*t
1 e
- - ----- otherwise
a a
But now I would like to access the "otherwise" part of the answer so I can factor it. How do I do that?
(Yes, I can factor it in my head, but I am practicing for when more complicated expressions come along. I am also only really looking for an approach using symbolic expressions -- even though in any single case numerics may work fine, I want to understand the symbolic approach.)
Thanks!

Matlab's piecewise function seems to be fairly new (introduced in 2016b), but it basically just looks like a glorified ternary operator. Unfortunately I don't have 2016 to check if it performs any checks on the inputs or not, but in general you can recreate a 'ternary' operator in octave by indexing into a cell using logical indexing. E.g.
{#() return_A(), #() return_B(), #() return_default()}([test1, test2, true]){1}()
Explanation:
Step 1: You put all the values of interest in a cell array. Wrap them in function handles if you want to prevent them being evaluated at the time of parsing (e.g. if you wanted the output of the ternary operator to be to produce an error)
Step 2: Index this cell array using logical indexing, where at each index you perform a logical test
Step 3: If you need a 'default' case, use a 'true' test for the last element.
Step 4: From the cell (sub)array that results from above, select the first element and 'run' the resulting function handle. Selecting the first element has the effect that if more than one tests succeed, you only pick the first result; given the 'default' test will always succeed, this also makes sure that this is not picked unless it's the first and only test that succeeds (which it does so by default).
Here are the above steps implemented into a function (appropriate sanity checks omitted here for brevity), following the same syntax as matlab's piecewise:
function Out = piecewise (varargin)
Conditions = varargin(1:2:end); % Select all 'odd' inputs
Values = varargin(2:2:end); % Select all 'even' inputs
N = length (Conditions);
if length (Values) ~= N % 'default' case has been provided
Values{end+1} = Conditions{end}; % move default return-value to 'Values'
Conditions{end} = true; % replace final (ie. default) test with true
end
% Wrap return-values into function-handles
ValFuncs = cell (1, N);
for n = 1 : N; ValFuncs{n} = #() Values{n}; end
% Grab funhandle for first successful test and call it to return its value
Out = ValFuncs([Conditions{:}]){1}();
end
Example use:
>> syms x t;
>> F = #(a) piecewise(a == 0, t, (1/a)*exp(-a*t)/a);
>> F(0)
ans = (sym) t
>> F(3)
ans = (sym)
-3⋅t
ℯ
─────
9

How to write arbitrary datatypes into Matlab cell array

This is a general question, not related to a particular operation. I would like to be able to write the results of an arbitrary function into elements of a cell array without regard for the data type the function returns. Consider this pseudocode:
zout = cell(n,m);
myfunc = str2func('inputname'); %assume myfunc puts out m values to match zout dimensions
zout(1,:) = myfunc(x,y);
That will work for "inputname" == "strcat" , for example, given that x and y are strings or cells of strings with appropriate dimension. But if "inputname" == "strcmp" then the output is a logical array, and Matlab throws an error. I'd need to do
zout(1,:) = num2cell(strcmp(x,y));
So my question is: is there a way to fill the cell array zout without having to test for the type of variable generated by myfunc(x,y ? Should I be using a struct in the first place (and if so, what's the best way to populate it)?
(I'm usually an R user, where I could just use a list variable without any pain)
Edit: To simplify the overall scope, add the following "requirement" :
Let's assume for now that, for a function which returns multiple outputs, only the first one need be captured in zout . But when this output is a vector of N values or a vector of cells (i.e. Nx1 cell array), these N values get mapped to zout(1,1:N) .

So my question is: is there a way to fill the cell array zout without having to test for the type of variable generated by myfunc(x,y) ? Should I be using a struct in the first place (and if so, what's the best way to populate it)?
The answer provided by #NotBoStyf is almost there, but not quite. Cell arrays are the right way to go. However, the answer very much depends on the number of outputs from the function.
Functions with only one output
The function strcmp has only one output, which is an array. The reason that
zout{1,:} = strcmp(x,y)
gives you an error message, when zout is dimensioned N x 2, is that the left-hand side (zout{1,:}) expects two outputs from the right-hand side. You can fix this with:
[zout{1,:}] = num2cell(strcmp(x,y)); % notice the square brackets on the LHS
However, there's really no reason to do this. You can simply define zout as an N x 1 cell array and capture the results:
zout = cell(1,1);
x = 'a';
y = { 'a', 'b' };
zout{1} = strcmp(x,y);
% Referring to the results:
x_is_y_1 = zout{1}(1);
x_is_y_2 = zout{1}(2);
There's one more case to consider...
Functions with multiple outputs
If your function produces multiple outputs (as opposed to a single output that is an array), then this will only capture the first output. Functions that produce multiple outputs are defined like this:
function [outA,outB] = do_something( a, b )
outA = a + 1;
outB = b + 2;
end
Here, you need to explicitly capture both output arguments. Otherwise, you just get a. For example:
outA = do_something( [1,2,3], [4,5,6] ); % outA is [2,3,4]
[outA,outB] = do_something( [1,2,3], [4,5,6] ); % outA is [2,3,4], outB is [6,7,8]
Z1 = cell(1,1);
Z1{1,1} = do_something( [1,2,3], [4,5,6] ); % Z1{1,1} is [2,3,4]
Z2 = cell(1,2);
Z2{1,1:2} = do_something( [1,2,3], [4,5,6] ); % Same error as above.
% NB: You really never want to have a cell expansion that is not surrounded
% by square brackets.
% Do this instead:
[Z2{1,1:2}] = do_something( [1,2,3], [4,5,6] ); % Z2{1,1} is [2,3,4], Z2{1,2} is [6,7,8]
This can also be done programmatically, with some limits. Let's say we're given function
func that takes one input and returns a constant (but unknown) number of outputs. We
have cell array inp that contains the inputs we want to process, and we want to collect the results in cell around outp:
N = numel(inp);
M = nargout(#func); % number of outputs produced by func
outp = cell(N,M);
for i=1:N
[ outp{i,:} ] = func( inp{i} );
end
This approach has a few caveats:
It captures all of the outputs. This is not always what you want.
Capturing all of the outputs can often change the behavior of the function. For example, the find function returns linear indices if only one output is used, row/column indices if two outputs are used, and row/column/value if three outputs are used.
It won't work for functions that have a variable number of outputs. These functions are defined as function [a,b,...,varargout] = func( ... ). nargout will return a negative number if the function has varargout declared in its output list, because there's no way for Matlab to know how many outputs will be produced.
Unpacking array and cell outputs into a cell
All true so far, but: what I am hoping for is a generic solution. I can't use num2cell if the function produces cell outputs. So what worked for strcmp will fail for strcat and vice versa. Let's assume for now that, for a function which returns multiple outputs, only the first one need be captured in zout – Carl Witthoft
To provide a uniform output syntax for all functions that return either a cell or an array, use an adapter function. Here is an example that handles numeric arrays and cells:
function [cellOut] = cellify(input)
if iscell(input)
cellOut = input;
elseif isnumeric(input)
cellOut = num2cell(input);
else
error('cellify currently does not support structs or objects');
end
end
To unpack the output into a 2-D cell array, the size of each output must be constant. Assuming M outputs:
N = numel(inp);
% M is known and constant
outp = cell(N,M);
for i=1:N
outp(i,:) = cellify( func( inp{i} ) ); % NB: parentheses instead of curlies on LHS
end
The output can then be addressed as outp{i,j}. An alternate approach allows the size of the output to vary:
N = numel(inp);
% M is not necessary here
outp = cell(N,1);
for i=1:N
outp{i} = cellify( func( inp{i} ) ); % NB: back to curlies on LHS
end
The output can then be addressed as outp{i}{j}, and the size of the output can vary.
A few things to keep in mind:
Matlab cells are basically inefficient pointers. The JIT compiler does not always optimize them as well as numeric arrays.
Splitting numeric arrays into cells can cost quite a bit of memory. Each split value is actually a numeric array, which has size and type information associated with it. In numeric array form, this occurs once for each array. When the array is split, this incurs once for each element.

Use curly braces instead when asigning a value.
Using
zout{1,:} = strcmp(x,y);
instead should work.

Translation from Complex-FFT to Finite-Field-FFT

Good afternoon!
I am trying to develop an NTT algorithm based on the naive recursive FFT implementation I already have.
Consider the following code (coefficients' length, let it be m, is an exact power of two):
/// <summary>
/// Calculates the result of the recursive Number Theoretic Transform.
/// </summary>
/// <param name="coefficients"></param>
/// <returns></returns>
private static BigInteger[] Recursive_NTT_Skeleton(
IList<BigInteger> coefficients,
IList<BigInteger> rootsOfUnity,
int step,
int offset)
{
// Calculate the length of vectors at the current step of recursion.
// -
int n = coefficients.Count / step - offset / step;
if (n == 1)
{
return new BigInteger[] { coefficients[offset] };
}
BigInteger[] results = new BigInteger[n];
IList<BigInteger> resultEvens =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset);
IList<BigInteger> resultOdds =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset + step);
for (int k = 0; k < n / 2; k++)
{
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
}
return results;
}
It worked for complex FFT (replace BigInteger with a complex numeric type (I had my own)). It doesn't work here even though I changed the procedure of finding the primitive roots of unity appropriately.
Supposedly, the problem is this: rootsOfUnity parameter passed originally contained only the first half of m-th complex roots of unity in this order:
omega^0 = 1, omega^1, omega^2, ..., omega^(n/2)
It was enough, because on these three lines of code:
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
I originally made use of the fact, that at any level of recursion (for any n and i), the complex root of unity -omega^(i) = omega^(i + n/2).
However, that property obviously doesn't hold in finite fields. But is there any analogue of it which would allow me to still compute only the first half of the roots?
Or should I extend the cycle from n/2 to n and pre-compute all the m-th roots of unity?
Maybe there are other problems with this code?..
Thank you very much in advance!

I recently wanted to implement NTT for fast multiplication instead of DFFT too. Read a lot of confusing things, different letters everywhere and no simple solution, and also my finite fields knowledge is rusty , but today i finally got it right (after 2 days of trying and analog-ing with DFT coefficients) so here are my insights for NTT:
Computation
X(i) = sum(j=0..n-1) of ( Wn^(i*j)*x(i) );
where X[] is NTT transformed x[] of size n where Wn is the NTT basis. All computations are on integer modulo arithmetics mod p no complex numbers anywhere.
Important values
Wn = r ^ L mod p is basis for NTT
Wn = r ^ (p-1-L) mod p is basis for INTT
Rn = n ^ (p-2) mod p is scaling multiplicative constant for INTT ~(1/n)
p is prime that p mod n == 1 and p>max'
max is max value of x[i] for NTT or X[i] for INTT
r = <1,p)
L = <1,p) and also divides p-1
r,L must be combined so r^(L*i) mod p == 1 if i=0 or i=n
r,L must be combined so r^(L*i) mod p != 1 if 0 < i < n
max' is the sub-result max value and depends on n and type of computation. For single (I)NTT it is max' = n*max but for convolution of two n sized vectors it is max' = n*max*max etc. See Implementing FFT over finite fields for more info about it.
functional combination of r,L,p is different for different n
this is important, you have to recompute or select parameters from table before each NTT layer (n is always half of the previous recursion).
Here is my C++ code that finds the r,L,p parameters (needs modular arithmetics which is not included, you can replace it with (a+b)%c,(a-b)%c,(a*b)%c,... but in that case beware of overflows especial for modpow and modmul) The code is not optimized yet there are ways to speed it up considerably. Also prime table is fairly limited so either use SoE or any other algo to obtain primes up to max' in order to work safely.
DWORD _arithmetics_primes[]=
{
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409,
419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,
661,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883,887,907,911,919,929,937,941,
947,953,967,971,977,983,991,997,1009,1013,1019,1021,1031,1033,1039,1049,1051,1061,1063,1069,1087,1091,1093,1097,1103,1109,1117,1123,1129,1151,
0}; // end of table is 0, the more primes are there the bigger numbers and n can be used
// compute NTT consts W=r^L%p for n
int i,j,k,n=16;
long w,W,iW,p,r,L,l,e;
long max=81*n; // edit1: max num for NTT for my multiplication purposses
for (e=1,j=0;e;j++) // find prime p that p%n=1 AND p>max ... 9*9=81
{
p=_arithmetics_primes[j];
if (!p) break;
if ((p>max)&&(p%n==1))
for (r=2;r<p;r++) // check all r
{
for (l=1;l<p;l++)// all l that divide p-1
{
L=(p-1);
if (L%l!=0) continue;
L/=l;
W=modpow(r,L,p);
e=0;
for (w=1,i=0;i<=n;i++,w=modmul(w,W,p))
{
if ((i==0) &&(w!=1)) { e=1; break; }
if ((i==n) &&(w!=1)) { e=1; break; }
if ((i>0)&&(i<n)&&(w==1)) { e=1; break; }
}
if (!e) break;
}
if (!e) break;
}
}
if (e) { error; } // error no combination r,l,p for n found
W=modpow(r, L,p); // Wn for NTT
iW=modpow(r,p-1-L,p); // Wn for INTT
and here is my slow NTT and INTT implementations (i havent got to fast NTT,INTT yet) they are both tested with Schönhage–Strassen multiplication successfully.
//---------------------------------------------------------------------------
void NTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wj,wi,a,n2=n>>1;
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=a;
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
void INTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wi=1,wj=1,rN,a,n2=n>>1;
rN=modpow(n,m-2,m);
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=modmul(a,rN,m);
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
dst is destination array
src is source array
n is array size
m is modulus (p)
w is basis (Wn)
hope this helps to someone. If i forgot something please write ...
[edit1: fast NTT/INTT]
Finally I manage to get fast NTT/INTT to work. Was little bit more tricky than normal FFT:
//---------------------------------------------------------------------------
void _NFTT(long *dst,long *src,long n,long m,long w)
{
if (n<=1) { if (n==1) dst[0]=src[0]; return; }
long i,j,a0,a1,n2=n>>1,w2=modmul(w,w,m);
// reorder even,odd
for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
for ( j=1;i<n ;i++,j+=2) dst[i]=src[j];
// recursion
_NFTT(src ,dst ,n2,m,w2); // even
_NFTT(src+n2,dst+n2,n2,m,w2); // odd
// restore results
for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w,m))
{
a0=src[i];
a1=modmul(src[j],w2,m);
dst[i]=modadd(a0,a1,m);
dst[j]=modsub(a0,a1,m);
}
}
//---------------------------------------------------------------------------
void _INFTT(long *dst,long *src,long n,long m,long w)
{
long i,rN;
rN=modpow(n,m-2,m);
_NFTT(dst,src,n,m,w);
for (i=0;i<n;i++) dst[i]=modmul(dst[i],rN,m);
}
//---------------------------------------------------------------------------
[edit3]
I have optimized my code (3x times faster than code above),but still i am not satisfied with it so i started new question with it. There I have optimized my code even further (about 40x times faster than code above) so its almost the same speed as FFT on floating point of the same bit size. Link to it is here:
Modular arithmetics and NTT (finite field DFT) optimizations

To turn Cooley-Tukey (complex) FFT into modular arithmetic approach, i.e. NTT, you must replace complex definition for omega. For the approach to be purely recursive, you also need to recalculate omega for each level based on current signal size. This is possible because min. suitable modulus decreases as we move down in the call tree, so modulus used for root is suitable for lower layers. Additionally, as we are using same modulus, the same generator may be used as we move down the call tree. Also, for inverse transform, you should take additional step to take recalculated omega a and instead use as omega: b = a ^ -1 (via using inverse modulo operation). Specifically, b = invMod(a, N) s.t. b * a == 1 (mod N), where N is the chosen prime modulus.
Rewriting an expression involving omega by exploiting periodicity still works in modular arithmetic realm. You also need to find a way to determine the modulus (a prime) for the problem and a valid generator.
We note that your code works, though it is not a MWE. We extended it using common sense, and got correct result for a polynomial multiplication application. You just have to provide correct values of omega raised to certain powers.
While your code works, though, like from many other sources, you double spacing for each level. This does not lead to recursion that is as clean, though; this turns out to be identical to recalculating omega based on current signal size because the power for omega definition is inversely proportional to signal size. To reiterate: halving signal size is like squaring omega, which is like giving doubled powers for omega (which is what one would do for doubling of spacing). The nice thing about the approach that deals with recalculating of omega is that each subproblem is more cleanly complete in its own right.
There is a paper that shows some of the math for modular approach; it is a paper by Baktir and Sunar from 2006. See the paper at the end of this post.
You do not need to extend the cycle from n / 2 to n.
So, yes, some sources which say to just drop in a different omega definition for modular arithmetic approach are sweeping under the rug many details.
Another issue is that it is important to acknowledge that the signal size must be large enough if we are to not have overflow for result time-domain signal if we are performing convolution. Additionally, it may be useful to find certain implementations for exponentiation subject to modulus exist that are fast, even if the power is quite large.
References
Baktir and Sunar - Achieving efficient polynomial multiplication in Fermat fields using the fast Fourier transform (2006)

You must make sure that roots of unity actually exist. In R there are only 2 roots of unity: 1 and -1, since only for them x^n=1 can be true.
In C you have infinitely many roots of unity: w=exp(2*pi*i/N) is a primitive N-th roots of unity and all w^k for 0<=k
Now to your problem: you have to make sure the ring you're working in offers the same property: enough roots of unity.
Schönhage and Strassen (http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) use integers modulo 2^N+1. This ring has enough roots of unity. 2^N == -1 is a 2nd root of unity, 2^(N/2) is a 4th root of unity and so on. Furthermore, these roots of unity have the advantage that they are powers of two and can be implemented as binary shifts (with a modulo operation afterwards, which comes down to a add/subtract).
I think QuickMul (http://www.cs.nyu.edu/exact/doc/qmul.ps) works modulo 2^N-1.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008