Trying to find a way to construct Julia `generator` - generator

I'm new to Julia.
I mainly program in python.
In python,
if you want to iterate over a large set of values,
it is typical to construct a so-called generator to save memory usage.
Here is one example code:
def generator(N):
for i in range(N):
yield i
I wonder if there is anything alike in Julia.
After reading julia manual,
#task macro seems to have the same (or similar) functionality as generator in python.
However,
after some experiments,
the memory usage seems to be larger than usual array in julia.
I use #time in IJulia to see the memory usage.
Here is my sample code:
[Update]: Add the code for generator method
(The generator method)
function generator(N::Int)
for i in 1:N
produce(i)
end
end
(generator version)
function fun_gener()
sum = 0
g = #task generator(100000)
for i in g
sum += i
end
sum
end
#time fun_gener()
elapsed time: 0.420731828 seconds (6507600 bytes allocated)
(array version)
function fun_arry()
sum = 0
c = [1:100000]
for i in c
sum += i
end
sum
end
#time fun_arry()
elapsed time: 0.000629629 seconds (800144 bytes allocated)
Could anyone tell me why #task will require more space in this case?
And if I want to save memory usage as dealing with a large set of values,
what can I do?

I recommend the "tricked out iterators" blogpost by Carl Vogel, which discusses julia's iterator protocol, tasks and co-routines in some detail.
See also task-aka-coroutines in the julia docs.
In this case you should use the Range type (which defines an iterator protocol):
julia> function fun_arry()
sum = 0
c = 1:100000 # remove the brackets, makes this a Range
for i in c
sum += i
end
sum
end
fun_arry (generic function with 1 method)
julia> fun_arry() # warm up
5000050000
julia> #time fun_arry()
elapsed time: 8.965e-6 seconds (192 bytes allocated)
5000050000
Faster and less memory allocated (just like xrange in python 2).
A snippet from the blogpost:
From https://github.com/JuliaLang/julia/blob/master/base/range.jl, here’s how a Range’s iterator protocol is defined:
start(r::Ranges) = 0
next{T}(r::Range{T}, i) = (oftype(T, r.start + i*step(r)), i+1)
next{T}(r::Range1{T}, i) = (oftype(T, r.start + i), i+1)
done(r::Ranges, i) = (length(r) <= i)
Notice that the next method calculates the value of the iterator in state i. This is different from an Array iterator, which just reads the element a[i] from memory.
Iterators that exploit delayed evaluation like this can have important performance benefits. If we want to iterate over the integers 1 to 10,000, iterating over an Array means we have to allocate about 80MB to hold it. A Range only requires 16 bytes; the same size as the range 1 to 100,000 or 1 to 100,000,000.
You can write a generator method (using Tasks):
julia> function generator(n)
for i in 1:n # Note: we're using a Range here!
produce(i)
end
end
generator (generic function with 2 methods)
julia> for x in Task(() -> generator(3))
println(x)
end
1
2
3
Note: if you replace the Range with this, the performance is much poorer (and allocates way more memory):
julia> #time fun_arry()
elapsed time: 0.699122659 seconds (9 MB allocated)
5000050000

This question was asked (and answered) quite a while ago. Since this question is ranked high on google searches, I'd like to mention that both the question and answer are outdated.
Nowadays, I'd suggest checking out https://github.com/BenLauwens/ResumableFunctions.jl for a Julia library with a macro that implements Python-like yield generators.
using ResumableFunctions
#resumable function fibonnaci(n::Int) :: Int
a = 0
b = 1
for i in 1:n-1
#yield a
a, b = b, a+b
end
a
end
for fib in fibonnaci(10)
println(fib)
end
Since its scope is much more limited than full coroutines, it is also an order of magnitude more efficient than pushing values into a channel since it can compile the generator into a FSM. (Channels have replaced the old produce() function mentioned in the question and previous answers).
With that said, I'd still suggest pushing into a channel as your first approach if performance isn't an issue, because resumablefunctions can sometimes be finicky when compiling your function and can occasionally hit some worst-case behaviour. In particular, because it is a macro that compiles to an FSM rather than a function, you currently need to annotate the types of all variables in the Resumablefunction to get good performance, unlike vanilla Julia functions where this is handled by JIT when the function is first called.

I think that Task has been superseded by Channel(). The usage in terms of Ben Lauwens's Fibonacci generator is:
fibonacci(n) = Channel(ctype=Int) do c
a = 1
b = 1
for i in 1:n
push!(c, a)
a, b = b, a + b
end
end
it can be used using
for a in fibonacci(10)
println(a)
end
1
1
2
3
5
8
13
21
34
55

Related

Is using base case variable in a recursive function important?

I'm currently learning about recursion, it's pretty hard to understand. I found a very common example for it:
function factorial(N)
local Value
if N == 0 then
Value = 1
else
Value = N * factorial(N - 1)
end
return Value
end
print(factorial(3))
N == 0 is the base case. But when i changed it into N == 1, the result is still remains the same. (it will print 6).
Is using the base case important? (will it break or something?)
What's the difference between using N == 0 (base case) and N == 1?
That's just a coincidence, since 1 * 1 = 1, so it ends up working either way.
But consider the edge-case where N = 0, if you check for N == 1, then you'd go into the else branch and calculate 0 * factorial(-1), which would lead to an endless loop.
The same would happen in both cases if you just called factorial(-1) directly, which is why you should either check for > 0 instead (effectively treating every negative value as 0 and returning 1, or add another if condition and raise an error when N is negative.
EDIT: As pointed out in another answer, your implementation is not tail-recursive, meaning it accumulates memory for every recursive functioncall until it finishes or runs out of memory.
You can make the function tail-recursive, which allows Lua to treat it pretty much like a normal loop that could run as long as it takes to calculate its result:
local function factorial(n, acc)
acc = acc or 1
if n <= 0 then
return acc
else
return factorial(n-1, acc*n)
end
return Value
end
print(factorial(3))
Note though, that in the case of factorial, it would take you way longer to run out of stack memory than to overflow Luas number data type at around 21!, so making it tail-recursive is really just a matter of training yourself to write better code.
As the above answer and comments have pointed out, it is essential to have a base-case in a recursive function; otherwise, one ends up with an infinite loop.
Also, in the case of your factorial function, it is probably more efficient to use a helper function to perform the recursion, so as to take advantage of Lua's tail-call optimizations. Since Lua conveniently allows for local functions, you can define a helper within the scope of your factorial function.
Note that this example is not meant to handle the factorials of negative numbers.
-- Requires: n is an integer greater than or equal to 0.
-- Effects : returns the factorial of n.
function fact(n)
-- Local function that will actually perform the recursion.
local function fact_helper(n, i)
-- This is the base case.
if (i == 1) then
return n
end
-- Take advantage of tail calls.
return fact_helper(n * i, i - 1)
end
-- Check for edge cases, such as fact(0) and fact(1).
if ((n == 0) or (n == 1)) then
return 1
end
return fact_helper(n, n - 1)
end

Julia: Base function 10,000x quicker then similar function

I'm playing around with the decimal to binary converter 'bin()' in Julia, wanting to improve performance. I need to use BigInts for this problem, and calling bin() with a bigInt from within my file outputs the correct binary representation; however, calling a function similar to the bin() function costs a minute in time, while bin() takes about .003 seconds. Why is there this huge difference?
function binBase(x::Unsigned, pad::Int, neg::Bool)
i = neg + max(pad,sizeof(x)<<3-leading_zeros(x))
a = Array(Uint8,i)
while i > neg
a[i] = '0'+(x&0x1)
x >>= 1
i -= 1
end
if neg; a[1]='-'; end
ASCIIString(a)
end
function bin1(x::BigInt, pad::Int)
y = bin(x)
end
function bin2(x::BigInt, pad::Int,a::Array{Uint8,1}, neg::Bool)
while pad > neg
a[pad] = '0'+(x&0x1)
x >>= 1
pad -= 1
end
if neg; a[1]='-'; end
ASCIIString(a)
end
function test()
a = Array(Uint8,1000001)
x::BigInt= 2
x = (x^1000000)
#time bin1(x,1000001)
#time bin2(x,1000001,a,true)
end
test()
As noted by Felipe Lema, Base delegates BigInt printing to GMP, which can print BigInts without doing any intermediate computations with them – doing lots of computations with BigInts to figure out their digits is quite slow and ends up allocating a lot of memory. The bottom line: doing x >>= 1 is extremely efficient for things like Int64 values but not that efficient for things like BigInts.
Using julia's profiling tools I can see that Base.bin is calling a C function from libGMP, which has all sorts of machine specific optimizations (somewhere here is mpn_get_str that is being called).
#profile bin1(x,1000001)
Profile.print()
Profile.clear()
#profile bin2(x,1000001,a,true)
Profile.print()
Profile.clear()
I could also see a huge difference in bytes allocates (bin1:1000106, bin2:62648125016) which would require some more profiling and tunning, but I guess the previous paragraph is enough for an answer.

Call function for all elements in an array

Let's say I have a function, like:
function [result] = Square( x )
result = x * x;
end
And I have an array like the following,
x = 0:0.1:1;
I want to have an y array, which stores the squares of x's using my Square function. Sure, one way would be the following,
y = zeros(1,10);
for i = 1:10
y(i) = Square(x(i));
end
However, I guess there should be a more elegant way of doing it. I tried some of my insights and made some search, however couldn't find any solution. Any suggestions?
For the example you give:
y = x.^2; % or
y = x.*x;
in which .* and .^ are the element-wise versions of * and ^. This is the simplest, fastest way there is.
More general:
y = arrayfun(#Square, x);
which can be elegant, but it's usually pretty slow compared to
y = zeros(size(x));
for ii = 1:numel(x)
y(ii) = Square(x(ii)); end
I'd actually advise to stay away from arrayfun until profiling has showed that it is faster than a plain loop. Which will be seldom, if ever.
In new Matlab versions (R2008 and up), the JIT accelerates loops so effectively that things like arrayfun might actually disappear in a future release.
As an aside: note that I've used ii instead of i as the loop variable. In Matlab, i and j are built-in names for the imaginary unit. If you use it as a variable name, you'll lose some performance due to the necessary name resolution required. Using anything other than i or j will prevent that.
You want arrayfun.
arrayfun(#Square, x)
See help arrayfun
(tested only in GNU Octave, I do not have MATLAB)
Have you considered the element-by-element operator .*?
See the documentation for arithmetic operators.
I am going to assume that you will not be doing something as simple as a square operation and what you are trying to do is not already vectorised in MATLAB.
It is better to call the function once, and do the loop in the function. As the number of elements increase, you will notice significant increase in operation time.
Let our functions be:
function result = getSquare(x)
result = x*x; % I did not use .* on purpose
end
function result = getSquareVec(x)
result = zeros(1,numel(x));
for idx = 1:numel(x)
result(:,idx) = x(idx)*x(idx);
end
end
And let's call them from a script:
y = 1:10000;
tic;
for idx = 1:numel(y)
res = getSquare(y(idx));
end
toc
tic;
res = getSquareVec(y);
toc
I ran the code a couple of times and turns out calling the function only once is at least twice as fast.
Elapsed time is 0.020524 seconds.
Elapsed time is 0.008560 seconds.
Elapsed time is 0.019019 seconds.
Elapsed time is 0.007661 seconds.
Elapsed time is 0.022532 seconds.
Elapsed time is 0.006731 seconds.
Elapsed time is 0.023051 seconds.
Elapsed time is 0.005951 seconds.

Addition as binary operations

I'm adding a pair of unsigned 32bit binary integers (including overflow). The addition is expressive rather than actually computed, so there's no need for an efficient algorithm, but since each component is manually specified in terms of individual bits, I need one with a compact representation. Any suggestions?
Edit: In terms of boolean operators. So I'm thinking that carry = a & b; sum = a ^ b; for the first bit, but the other 31?
Oh, and subtraction!
You can not perform addition with simple boolean operators, you need an adder. (Of course the adder can be built using some more complex boolean operators.)
The adder adds two bits plus carry, and passes carry out to next bit.
Pseudocode:
carry = 0
for i = 31 to 0
sum = a[i] + b[i] + carry
result[i] = sum & 1
carry = sum >> 1
next i
Here is an implementation using the macro language of VEDIT text editor.
The two numbers to be added are given as ASCII strings, one on each line.
The results are inserted on the third line.
Reg_Empty(10) // result as ASCII string
#0 = 0 // carry bit
for (#9=31; #9>=0; #9--) {
#1 = CC(#9)-'0' // a bit from first number
#2 = CC(#9+34)-'0' // a bit from second number
#3 = #0+#1+#2 // add with carry
#4 = #3 & 1 // resulting bit
#0 = #3 >> 1 // new carry
Num_Str(#4, 11, LEFT) // convert bit to ASCII
Reg_Set(10, #11, INSERT) // insert bit to start of string
}
Line(2)
Reg_Ins(10) IN
Return
Example input and output:
00010011011111110101000111100001
00110110111010101100101101110111
01001010011010100001110101011000
Edit:
Here is pseudocode where the adder has been implemented with boolean operations:
carry = 0
for i = 31 to 0
sum[i] = a[i] ^ b[i] ^ carry
carry = (a[i] & b[i]) | (a[i] & carry) | (b[i] & carry)
next i
Perhaps you can begin by stating addition for two 1-bit numbers, with overflow (=carry):
A | B | SUM | CARRY
===================
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
To generalize this further, you need a "full adder" which also takes a carry as an input, from the preceding stage. Then you can express the 32-bit addition as a chain of 32 such full adders (with the first stage's carry input tied to 0).
Regarding data structure part to represent these numbers. There are 4 ways
1) Bit Array
A bit array is an array data structure that compactly stores individual bits.
They are also known as bitmap, bitset or bitstring.
2) Bit Field
A bit field is a common idiom used in computer programming to compactly store multiple logical values as a short series of bits where each of the single bits can be addressed separately.
3) Bit Plane
A bit plane of a digital discrete signal (such as image or sound) is a set of bits corresponding to a given bit position in each of the binary numbers representing the signal.
4) Bit Board
A bitboard or bit field is a format that stuffs a whole group of related boolean variables into the same integer, typically representing positions on a board game.
Regarding implementation, you can check that at each step, we have following
S = a xor b xor c
S is result of sum of current bits a an b
c is input carry
Cout - output carry is (a & b) xor (c & (a xor b))

Translation from Complex-FFT to Finite-Field-FFT

Good afternoon!
I am trying to develop an NTT algorithm based on the naive recursive FFT implementation I already have.
Consider the following code (coefficients' length, let it be m, is an exact power of two):
/// <summary>
/// Calculates the result of the recursive Number Theoretic Transform.
/// </summary>
/// <param name="coefficients"></param>
/// <returns></returns>
private static BigInteger[] Recursive_NTT_Skeleton(
IList<BigInteger> coefficients,
IList<BigInteger> rootsOfUnity,
int step,
int offset)
{
// Calculate the length of vectors at the current step of recursion.
// -
int n = coefficients.Count / step - offset / step;
if (n == 1)
{
return new BigInteger[] { coefficients[offset] };
}
BigInteger[] results = new BigInteger[n];
IList<BigInteger> resultEvens =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset);
IList<BigInteger> resultOdds =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset + step);
for (int k = 0; k < n / 2; k++)
{
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
}
return results;
}
It worked for complex FFT (replace BigInteger with a complex numeric type (I had my own)). It doesn't work here even though I changed the procedure of finding the primitive roots of unity appropriately.
Supposedly, the problem is this: rootsOfUnity parameter passed originally contained only the first half of m-th complex roots of unity in this order:
omega^0 = 1, omega^1, omega^2, ..., omega^(n/2)
It was enough, because on these three lines of code:
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
I originally made use of the fact, that at any level of recursion (for any n and i), the complex root of unity -omega^(i) = omega^(i + n/2).
However, that property obviously doesn't hold in finite fields. But is there any analogue of it which would allow me to still compute only the first half of the roots?
Or should I extend the cycle from n/2 to n and pre-compute all the m-th roots of unity?
Maybe there are other problems with this code?..
Thank you very much in advance!
I recently wanted to implement NTT for fast multiplication instead of DFFT too. Read a lot of confusing things, different letters everywhere and no simple solution, and also my finite fields knowledge is rusty , but today i finally got it right (after 2 days of trying and analog-ing with DFT coefficients) so here are my insights for NTT:
Computation
X(i) = sum(j=0..n-1) of ( Wn^(i*j)*x(i) );
where X[] is NTT transformed x[] of size n where Wn is the NTT basis. All computations are on integer modulo arithmetics mod p no complex numbers anywhere.
Important values
Wn = r ^ L mod p is basis for NTT
Wn = r ^ (p-1-L) mod p is basis for INTT
Rn = n ^ (p-2) mod p is scaling multiplicative constant for INTT ~(1/n)
p is prime that p mod n == 1 and p>max'
max is max value of x[i] for NTT or X[i] for INTT
r = <1,p)
L = <1,p) and also divides p-1
r,L must be combined so r^(L*i) mod p == 1 if i=0 or i=n
r,L must be combined so r^(L*i) mod p != 1 if 0 < i < n
max' is the sub-result max value and depends on n and type of computation. For single (I)NTT it is max' = n*max but for convolution of two n sized vectors it is max' = n*max*max etc. See Implementing FFT over finite fields for more info about it.
functional combination of r,L,p is different for different n
this is important, you have to recompute or select parameters from table before each NTT layer (n is always half of the previous recursion).
Here is my C++ code that finds the r,L,p parameters (needs modular arithmetics which is not included, you can replace it with (a+b)%c,(a-b)%c,(a*b)%c,... but in that case beware of overflows especial for modpow and modmul) The code is not optimized yet there are ways to speed it up considerably. Also prime table is fairly limited so either use SoE or any other algo to obtain primes up to max' in order to work safely.
DWORD _arithmetics_primes[]=
{
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409,
419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,
661,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883,887,907,911,919,929,937,941,
947,953,967,971,977,983,991,997,1009,1013,1019,1021,1031,1033,1039,1049,1051,1061,1063,1069,1087,1091,1093,1097,1103,1109,1117,1123,1129,1151,
0}; // end of table is 0, the more primes are there the bigger numbers and n can be used
// compute NTT consts W=r^L%p for n
int i,j,k,n=16;
long w,W,iW,p,r,L,l,e;
long max=81*n; // edit1: max num for NTT for my multiplication purposses
for (e=1,j=0;e;j++) // find prime p that p%n=1 AND p>max ... 9*9=81
{
p=_arithmetics_primes[j];
if (!p) break;
if ((p>max)&&(p%n==1))
for (r=2;r<p;r++) // check all r
{
for (l=1;l<p;l++)// all l that divide p-1
{
L=(p-1);
if (L%l!=0) continue;
L/=l;
W=modpow(r,L,p);
e=0;
for (w=1,i=0;i<=n;i++,w=modmul(w,W,p))
{
if ((i==0) &&(w!=1)) { e=1; break; }
if ((i==n) &&(w!=1)) { e=1; break; }
if ((i>0)&&(i<n)&&(w==1)) { e=1; break; }
}
if (!e) break;
}
if (!e) break;
}
}
if (e) { error; } // error no combination r,l,p for n found
W=modpow(r, L,p); // Wn for NTT
iW=modpow(r,p-1-L,p); // Wn for INTT
and here is my slow NTT and INTT implementations (i havent got to fast NTT,INTT yet) they are both tested with Schönhage–Strassen multiplication successfully.
//---------------------------------------------------------------------------
void NTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wj,wi,a,n2=n>>1;
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=a;
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
void INTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wi=1,wj=1,rN,a,n2=n>>1;
rN=modpow(n,m-2,m);
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=modmul(a,rN,m);
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
dst is destination array
src is source array
n is array size
m is modulus (p)
w is basis (Wn)
hope this helps to someone. If i forgot something please write ...
[edit1: fast NTT/INTT]
Finally I manage to get fast NTT/INTT to work. Was little bit more tricky than normal FFT:
//---------------------------------------------------------------------------
void _NFTT(long *dst,long *src,long n,long m,long w)
{
if (n<=1) { if (n==1) dst[0]=src[0]; return; }
long i,j,a0,a1,n2=n>>1,w2=modmul(w,w,m);
// reorder even,odd
for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
for ( j=1;i<n ;i++,j+=2) dst[i]=src[j];
// recursion
_NFTT(src ,dst ,n2,m,w2); // even
_NFTT(src+n2,dst+n2,n2,m,w2); // odd
// restore results
for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w,m))
{
a0=src[i];
a1=modmul(src[j],w2,m);
dst[i]=modadd(a0,a1,m);
dst[j]=modsub(a0,a1,m);
}
}
//---------------------------------------------------------------------------
void _INFTT(long *dst,long *src,long n,long m,long w)
{
long i,rN;
rN=modpow(n,m-2,m);
_NFTT(dst,src,n,m,w);
for (i=0;i<n;i++) dst[i]=modmul(dst[i],rN,m);
}
//---------------------------------------------------------------------------
[edit3]
I have optimized my code (3x times faster than code above),but still i am not satisfied with it so i started new question with it. There I have optimized my code even further (about 40x times faster than code above) so its almost the same speed as FFT on floating point of the same bit size. Link to it is here:
Modular arithmetics and NTT (finite field DFT) optimizations
To turn Cooley-Tukey (complex) FFT into modular arithmetic approach, i.e. NTT, you must replace complex definition for omega. For the approach to be purely recursive, you also need to recalculate omega for each level based on current signal size. This is possible because min. suitable modulus decreases as we move down in the call tree, so modulus used for root is suitable for lower layers. Additionally, as we are using same modulus, the same generator may be used as we move down the call tree. Also, for inverse transform, you should take additional step to take recalculated omega a and instead use as omega: b = a ^ -1 (via using inverse modulo operation). Specifically, b = invMod(a, N) s.t. b * a == 1 (mod N), where N is the chosen prime modulus.
Rewriting an expression involving omega by exploiting periodicity still works in modular arithmetic realm. You also need to find a way to determine the modulus (a prime) for the problem and a valid generator.
We note that your code works, though it is not a MWE. We extended it using common sense, and got correct result for a polynomial multiplication application. You just have to provide correct values of omega raised to certain powers.
While your code works, though, like from many other sources, you double spacing for each level. This does not lead to recursion that is as clean, though; this turns out to be identical to recalculating omega based on current signal size because the power for omega definition is inversely proportional to signal size. To reiterate: halving signal size is like squaring omega, which is like giving doubled powers for omega (which is what one would do for doubling of spacing). The nice thing about the approach that deals with recalculating of omega is that each subproblem is more cleanly complete in its own right.
There is a paper that shows some of the math for modular approach; it is a paper by Baktir and Sunar from 2006. See the paper at the end of this post.
You do not need to extend the cycle from n / 2 to n.
So, yes, some sources which say to just drop in a different omega definition for modular arithmetic approach are sweeping under the rug many details.
Another issue is that it is important to acknowledge that the signal size must be large enough if we are to not have overflow for result time-domain signal if we are performing convolution. Additionally, it may be useful to find certain implementations for exponentiation subject to modulus exist that are fast, even if the power is quite large.
References
Baktir and Sunar - Achieving efficient polynomial multiplication in Fermat fields using the fast Fourier transform (2006)
You must make sure that roots of unity actually exist. In R there are only 2 roots of unity: 1 and -1, since only for them x^n=1 can be true.
In C you have infinitely many roots of unity: w=exp(2*pi*i/N) is a primitive N-th roots of unity and all w^k for 0<=k
Now to your problem: you have to make sure the ring you're working in offers the same property: enough roots of unity.
Schönhage and Strassen (http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) use integers modulo 2^N+1. This ring has enough roots of unity. 2^N == -1 is a 2nd root of unity, 2^(N/2) is a 4th root of unity and so on. Furthermore, these roots of unity have the advantage that they are powers of two and can be implemented as binary shifts (with a modulo operation afterwards, which comes down to a add/subtract).
I think QuickMul (http://www.cs.nyu.edu/exact/doc/qmul.ps) works modulo 2^N-1.