How to look for factorization of one integer in linear sieve algorithm without divisions? - prime-factoring

I learned an algorithm called "linear sieve" https://cp-algorithms.com/algebra/prime-sieve-linear.html that is able to get all primes numbers smaller than N in linear time.
This algorithm has one by-product since it has an array lp[x] that stores the minimal prime factor of the number x.
So we can follow lp[x] to find the first prime factor, and continue the division to get all factors.
In the meantime, the article also mentioned that with the help of one extra array, we can get all factors without division, how to achieve that?
The article said: "... Moreover, using just one extra array will allow us to avoid divisions when looking for factorization."

The algorithm is due to Pritchard. It is a variant of Algorithm 3.3 in Paul Pritchard: "Linear Prime-Number Sieves: a Famiy Tree", Science of Computer Programming, vol. 9 (1987), pp.17-35.
Here's the code with an unnecessary test removed, and an extra vector used to store the factor:
for (int i=2; i <= N; ++i) {
if (lp[i] == 0) {
lp[i] = i;
pr.push_back(i);
}
for (int j=0; i*pr[j] <= N; ++j) {
lp[i*pr[j]] = pr[j];
factor[i*pr[j]] = i;
if (pr[j] == lp[i]) break;
}
}
Afterwards, to get all the prime factors of a number x,
get the first prime factor as lp[x], then recursively get the prime factors of factor[x], stopping after lp[x] == x. E.g.with x=20, lp[x]=2, and factor[x]=10;
then lp[10]=2 and factor[10]=5; then lp[5]=5 and we stop. So the prime factorization is 20 = 2*2*5.

Related

How to interpret FFT data for making a spectrum visualizer

I am trying to visualize a spectrum where the frequency range is divided into N bars, either linearly or logarithmic. The FFT seems to work fine, but I am not sure how to interpret the values in order to decide the max height for the visualization.
I am using FMODAudio, a wrapper for C#. It's set up correctly.
In the case of a linear spectrum, the bars are defined as following:
public int InitializeSpectrum(int windowSize = 1024, int maxBars = 16)
{
numSamplesPerBar_Linear.Clear();
int barSamples = (windowSize / 2) / maxBars;
for (int i = 0; i < maxBars; ++i)
{
numSamplesPerBar_Linear.Add(barSamples);
}
IsInitialized = true;
Data = new float[numSamplesPerBar_Linear.Count];
return numSamplesPerBar_Linear.Count;
}
Data is the array which holds the spectrum values received from the update loop.
The update looks like this:
public unsafe void UpdateSpectrum(ref ParameterFFT* fftData)
{
int length = fftData->Length / 2;
if (length > 0)
{
int indexFFT = 0;
for (int index = 0; index < numSamplesPerBar_Linear.Count; ++index)
{
for (int frec = 0; frec < numSamplesPerBar_Linear[index]; ++frec)
{
for (int channel = 0; channel < fftData->ChannelCount; ++channel)
{
var floatspectrum = fftData->GetSpectrum(channel); //this is a readonlyspan<float> by default.
Data[index] += floatspectrum[indexFFT];
}
++indexFFT;
}
Data[index] /= (float)(numSamplesPerBar_Linear[index] * fftData->ChannelCount); // average of both channels for more meaningful values.
}
}
}
The values I get when testing a song are very low across the bands.
A randomly chosen moment when playing a song gives these values:
16 bars = 0,0326 0,0031 0,001 0,0003 0,0004 0,0003 0,0001 0,0002 0,0001 0,0001 0,0001 0 0 0 0 0
I realize it's more useful to use a logarithmic spectrum in many cases, and I intend to, but I still need to figure how how to find the max values for each bar so that I can setup the visualization on a proper scale.
Q: How can I know the potential max values for each bar based on this setup (it's not 1.0)?
output from FFT call is an array where each element is a complex number ( A + Bi ) where A is the real number component and B the imaginary number component ... element zero of this array represents frequency zero as in DC which is the offset bias can typically be ignored ... as you iterate across each element of this array you increment the frequency ... this freq increment is calculated using
Audio_samples <-- array of raw audio samples in PCM format which gets
fed into FFT call
num_fft_bins := float64(len(Audio_samples)) / 2.0 // using Nyquist theorem
freq_incr_per_bin := (input_audio_sample_rate / 2.0) / num_fft_bins
so to answer your question the output array from FFT call is a linear progression evenly spaced based in above freq increment constant
Depends on your input data to the FFT, and the scaling that your particular FFT implementation uses (not all FFTs use the same scale factor).
With an energy preserving forward-FFT, Parseval's theorem applies. So the energy (sum of squares) of the input vector equals the energy of the FFT result vector. Note that for a single integer periodic in aperture sinusoidal input (a pure tone), all that energy can appear in a single FFT result element. So if you know the maximum possible input energy, you can use that to compute the maximum possible result element magnitude for scaling purposes.
The range is often large enough that visualizers commonly need to use log scaling, or else typical input can get pixel quantized to a graph of all zeros.

Free-Pascal Implementation of the Sieve of Eratosthenes

My teacher gave me an assignment like this:
Using the number n given, find the largest prime number p with p<=n and n<=10^9.
I tried doing this by using the following function:
Const amax=1000000000
Var i,j,n:longint;
a:array [1..amax] of boolean;
Function lp(n:longint):longint;
Var max:longint;
Begin
For i:=1 to n do a[i]:=true;
For i:=2 to round(sqrt(n)) do
If (a[i]=true) then
For j:=1 to n div i do
If (i*i+(j-1)*i<=n) then
a[i*i+(j-1)*i]:=false;
max:=0;
i:=n;
While max=0 do
Begin
If a[i]=true then max:=i;
i:=i-1;
End;
lp:=max;
End;
This code worked flawlessly for numbers such as 1 million, but when i tried n=10^9, the program took a long time to print the output. So here's my question: Are there any ways to improve my code for lower delay? Or maybe a different code?
The most important aspect here is that the greatest prime that is not greater than n must be fairly close to n. A quick look at The Gaps Between Primes (at The Prime Pages - always worth a look for everything to do with primes) shows that for 32-bit numbers the gaps between primes cannot be greater than 335. This means that the greatest prime not greater than n must be in the range [n - 335, n]. In other words, at most 336 candidates need to be checked - for example via trial division - and this is bound to be lots faster than sieving a billion numbers.
Trial division is a reasonable choice for tasks of this kind, because the range to be scanned is so small. In my answer to Prime sieve implementation (using trial division) in C++ I analysed a couple of ways for speeding it up.
The Sieve of Eratosthenes is also a good choice, it just needs to be modified to sieve only the range of interest instead of all numbers from 1 to n. This is called a 'windowed sieve' because it sieves only a window. Since the window will most likely not contain all the primes up to the square root of n (i.e. all the primes that could be potential least prime factors of composites in the range to be scanned) it is best to sieve the factor primes via a separate, simple Sieve of Eratosthenes.
First I'm showing a simple rendition of normal (non-windowed) sieve, as a baseline for comparing the windowed code to. I'm using C# in order to show the algorithm more clearly than would be possible with Pascal.
List<uint> small_primes_up_to (uint n)
{
if (n == uint.MaxValue)
throw new ArgumentOutOfRangeException("n", "n must be less than UINT32_MAX");
var eliminated = new bool[n + 1]; // +1 because indexed by numbers
eliminated[0] = true;
eliminated[1] = true;
for (uint i = 2, sqrt_n = (uint)Math.Sqrt(n); i <= sqrt_n; ++i)
if (!eliminated[i])
for (uint j = i * i; j <= n; j += i)
eliminated[j] = true;
return remaining_unmarked_numbers(eliminated, 0);
}
The fuction has 'small' in its name because it is not really suited for sieving big ranges; I use similar code (with a few bells and whistles) only for sieving the small factor primes needed by more advanced sieves.
The code for extracting the sieved primes is equally simple:
List<uint> remaining_unmarked_numbers (bool[] eliminated, uint sieve_base)
{
var result = new List<uint>();
for (uint i = 0, e = (uint)eliminated.Length; i < e; ++i)
if (!eliminated[i])
result.Add(sieve_base + i);
return result;
}
Now, the windowed version. One difference is that the potential least factor primes need to be sieved separately (by the function just shown) as explained earlier. Another difference is that the starting point of the crossing-off sequence for a given prime may lie outside the range to be sieved. If the starting point lies before the start of the window then a bit of modulo magic is necessary to find the first 'hop' that lands in the window. From then on everything proceeds as usual.
List<uint> primes_between (uint m, uint n)
{
m = Math.Max(m, 2);
if (m > n)
return new List<uint>(); // empty range -> no primes
// index overflow in the inner loop unless `(sieve_bits - 1) + stride <= UINT32_MAX`
if (n - m > uint.MaxValue - 65521) // highest prime not greater than sqrt(UINT32_MAX)
throw new ArgumentOutOfRangeException("n", "(n - m) must be <= UINT32_MAX - 65521");
uint sieve_bits = n - m + 1;
var eliminated = new bool[sieve_bits];
foreach (uint prime in small_primes_up_to((uint)Math.Sqrt(n)))
{
uint start = prime * prime, stride = prime;
if (start >= m)
start -= m;
else
start = (stride - 1) - (m - start - 1) % stride;
for (uint j = start; j < sieve_bits; j += stride)
eliminated[j] = true;
}
return remaining_unmarked_numbers(eliminated, m);
}
The two '-1' terms in the modulo calculation may seem strange, but they bias the logic down by 1 to eliminate the inconvenient case stride - foo % stride == stride that would need to be mapped to 0.
With this, the greatest prime not exceeding n could be computed like this:
uint greatest_prime_not_exceeding (uint n)
{
return primes_between(n - Math.Min(n, 335), n).Last();
}
This takes less than a millisecond all told, including the sieving of the factor primes and so on, even though the code contains no optimisations whatsoever. A good overview of applicable optimisations can be found in my answer to prime number summing still slow after using sieve; with the techniques shown there the whole range up to 10^9 can be sieved in about half a second.

Is there an efficient way to optimize my serialized code?

This question have a lack of details. So, i decided to create another question instead edit this one. The new question is here: Can i parallelize my code or it is not worth?
I have a program running in CUDA, where one piece of the code is running within a loop (serialized, as you can see below). This piece of code is a search within an array that contain addresses and/or NULL pointers. All the threads execute this code below.
while (i < n) {
if (array[i] != NULL) {
return array[i];
}
i++;
}
return NULL;
Where n is the size of array and array is in shared memory. I'm only interested in the first address that is different from NULL (first match).
The whole code (i've posted only a piece, the whole code is big) is running fast, but the "heart" of the code (i.e, the part that is more repeated) is serialized, as you can see. I want to know if i can parallelize this part (the search) with some optimized algorithm.
Like i said, the program is already in CUDA (and the array in device), so it will not have memory transfers from host to device and vice versa.
My problem is: n is not big. Difficultly it will be greater than 8.
I've tried to parallelize it, but my "new" code took more time than the code above.
I was studying reduction and min operations, but i've checked that it's useful when n is big.
So, any tips? Can i parallelize it efficiently, i.e., with a low overhead?
Keeping things simple, one of the major limiting factors of GPGPU code is memory management. In most computers copying memory to the device (GPU) is a slow process.
As illustrated by http://www.ncsa.illinois.edu/~kindr/papers/ppac09_paper.pdf:
"The key requirement for obtaining effective
acceleration from GPU subroutine libraries is minimization of
I/O between the host and the GPU."
This is because I/O operations between host and device are SLOW!
Tying this back to your problem, it doesn't really make sense to run on the GPU since the amount of data you mention is so small. You would spend more time running the memcpy routines than it would take to run on the CPU in the first place - especially since you mention you are only interested in the first match.
One common misconception that many people have is that 'if I run it on the GPU, it has more cores so will run faster' and this just isn't the case.
When deciding if it is worth porting to CUDA or OpenCL you must think about if the process is inherently parallel or not - are you processing very large amounts of data etc.?
Since you say the array is a shared memory resource, the result of this search is the same for each thread of a block. This means a first and simple optimization would be to only let a single thread do the search. This will free all but the first warp of the block from doing any work (they still need to wait for the result, yet don't have to waste any computing resources):
__shared__ void *result = NULL;
if(tid == 0)
{
for(unsigned int i=0; i<n; ++i)
{
if (array[i] != NULL)
{
result = array[i];
break;
}
}
}
__syncthreads();
return result;
A step further would then be to let the threads perform the search in parallel as a classic intra-block reduction. If you can guarantee n to always be <= 64, you can do this in a single warp and don't need any synchronization during the search (except for the complete synchronization at the end, of course).
for(unsigned int i=n/2; i>32; i>>=1)
{
if(tid < i && !array[tid])
array[tid] = array[tid+i];
__syncthreads();
}
if(tid < 32)
{
if(n > 32 && !array[tid]) array[tid] = array[tid+32];
if(n > 16 && !array[tid]) array[tid] = array[tid+16];
if(n > 8 && !array[tid]) array[tid] = array[tid+8];
if(n > 4 && !array[tid]) array[tid] = array[tid+4];
if(n > 2 && !array[tid]) array[tid] = array[tid+2];
if(n > 1 && !array[tid]) array[tid] = array[tid+1];
}
__syncthreads();
return array[0];
Of course the example assumes n to be a power of two (and the array to be padded with NULLs accordingly), but feel free to tune it to your needs and optimize this further.

Space complexity confusion

I'm a bit confused about analyzing space complexity in general. I'm not sure the meaning of "extra space taken up by the algorithm". What counts as space of 1?
In the example here
int findMin(int[] x) {
int k = 0; int n = x.length;
for (int i = 1; i < n; i++) {
if (x[i] < x[k]) {
k = i;
}
}
return k;
}
The space complexity is O(n), and I'm guessing it's due to an array size of n.
But for something like heapsort, it takes O(1). Wouldn't an in-place heapsort also need to have an array of size n(n is size of input)? Or are we assuming the input is already in an array? Why is heapsort's space complexity O(1)?
Thanks!
Heapsort requires only a constant amount of auxiliary storage, hence O(1). The space used by the input to be sorted is of course O(n).
Actually extra space corresponds to extra stack space that an algo uses i.e. other dan the input and generally it requires stack in recursive function calls , if recursion is present in algo than surely it will use stack to store contents until it get solved by termination condition.
The size of the stack will be O(height of the recursion tree).
Hope this is helpful!!

Translation from Complex-FFT to Finite-Field-FFT

Good afternoon!
I am trying to develop an NTT algorithm based on the naive recursive FFT implementation I already have.
Consider the following code (coefficients' length, let it be m, is an exact power of two):
/// <summary>
/// Calculates the result of the recursive Number Theoretic Transform.
/// </summary>
/// <param name="coefficients"></param>
/// <returns></returns>
private static BigInteger[] Recursive_NTT_Skeleton(
IList<BigInteger> coefficients,
IList<BigInteger> rootsOfUnity,
int step,
int offset)
{
// Calculate the length of vectors at the current step of recursion.
// -
int n = coefficients.Count / step - offset / step;
if (n == 1)
{
return new BigInteger[] { coefficients[offset] };
}
BigInteger[] results = new BigInteger[n];
IList<BigInteger> resultEvens =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset);
IList<BigInteger> resultOdds =
Recursive_NTT_Skeleton(coefficients, rootsOfUnity, step * 2, offset + step);
for (int k = 0; k < n / 2; k++)
{
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
}
return results;
}
It worked for complex FFT (replace BigInteger with a complex numeric type (I had my own)). It doesn't work here even though I changed the procedure of finding the primitive roots of unity appropriately.
Supposedly, the problem is this: rootsOfUnity parameter passed originally contained only the first half of m-th complex roots of unity in this order:
omega^0 = 1, omega^1, omega^2, ..., omega^(n/2)
It was enough, because on these three lines of code:
BigInteger bfly = (rootsOfUnity[k * step] * resultOdds[k]) % NTT_MODULUS;
results[k] = (resultEvens[k] + bfly) % NTT_MODULUS;
results[k + n / 2] = (resultEvens[k] - bfly) % NTT_MODULUS;
I originally made use of the fact, that at any level of recursion (for any n and i), the complex root of unity -omega^(i) = omega^(i + n/2).
However, that property obviously doesn't hold in finite fields. But is there any analogue of it which would allow me to still compute only the first half of the roots?
Or should I extend the cycle from n/2 to n and pre-compute all the m-th roots of unity?
Maybe there are other problems with this code?..
Thank you very much in advance!
I recently wanted to implement NTT for fast multiplication instead of DFFT too. Read a lot of confusing things, different letters everywhere and no simple solution, and also my finite fields knowledge is rusty , but today i finally got it right (after 2 days of trying and analog-ing with DFT coefficients) so here are my insights for NTT:
Computation
X(i) = sum(j=0..n-1) of ( Wn^(i*j)*x(i) );
where X[] is NTT transformed x[] of size n where Wn is the NTT basis. All computations are on integer modulo arithmetics mod p no complex numbers anywhere.
Important values
Wn = r ^ L mod p is basis for NTT
Wn = r ^ (p-1-L) mod p is basis for INTT
Rn = n ^ (p-2) mod p is scaling multiplicative constant for INTT ~(1/n)
p is prime that p mod n == 1 and p>max'
max is max value of x[i] for NTT or X[i] for INTT
r = <1,p)
L = <1,p) and also divides p-1
r,L must be combined so r^(L*i) mod p == 1 if i=0 or i=n
r,L must be combined so r^(L*i) mod p != 1 if 0 < i < n
max' is the sub-result max value and depends on n and type of computation. For single (I)NTT it is max' = n*max but for convolution of two n sized vectors it is max' = n*max*max etc. See Implementing FFT over finite fields for more info about it.
functional combination of r,L,p is different for different n
this is important, you have to recompute or select parameters from table before each NTT layer (n is always half of the previous recursion).
Here is my C++ code that finds the r,L,p parameters (needs modular arithmetics which is not included, you can replace it with (a+b)%c,(a-b)%c,(a*b)%c,... but in that case beware of overflows especial for modpow and modmul) The code is not optimized yet there are ways to speed it up considerably. Also prime table is fairly limited so either use SoE or any other algo to obtain primes up to max' in order to work safely.
DWORD _arithmetics_primes[]=
{
2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173,
179,181,191,193,197,199,211,223,227,229,233,239,241,251,257,263,269,271,277,281,283,293,307,311,313,317,331,337,347,349,353,359,367,373,379,383,389,397,401,409,
419,421,431,433,439,443,449,457,461,463,467,479,487,491,499,503,509,521,523,541,547,557,563,569,571,577,587,593,599,601,607,613,617,619,631,641,643,647,653,659,
661,673,677,683,691,701,709,719,727,733,739,743,751,757,761,769,773,787,797,809,811,821,823,827,829,839,853,857,859,863,877,881,883,887,907,911,919,929,937,941,
947,953,967,971,977,983,991,997,1009,1013,1019,1021,1031,1033,1039,1049,1051,1061,1063,1069,1087,1091,1093,1097,1103,1109,1117,1123,1129,1151,
0}; // end of table is 0, the more primes are there the bigger numbers and n can be used
// compute NTT consts W=r^L%p for n
int i,j,k,n=16;
long w,W,iW,p,r,L,l,e;
long max=81*n; // edit1: max num for NTT for my multiplication purposses
for (e=1,j=0;e;j++) // find prime p that p%n=1 AND p>max ... 9*9=81
{
p=_arithmetics_primes[j];
if (!p) break;
if ((p>max)&&(p%n==1))
for (r=2;r<p;r++) // check all r
{
for (l=1;l<p;l++)// all l that divide p-1
{
L=(p-1);
if (L%l!=0) continue;
L/=l;
W=modpow(r,L,p);
e=0;
for (w=1,i=0;i<=n;i++,w=modmul(w,W,p))
{
if ((i==0) &&(w!=1)) { e=1; break; }
if ((i==n) &&(w!=1)) { e=1; break; }
if ((i>0)&&(i<n)&&(w==1)) { e=1; break; }
}
if (!e) break;
}
if (!e) break;
}
}
if (e) { error; } // error no combination r,l,p for n found
W=modpow(r, L,p); // Wn for NTT
iW=modpow(r,p-1-L,p); // Wn for INTT
and here is my slow NTT and INTT implementations (i havent got to fast NTT,INTT yet) they are both tested with Schönhage–Strassen multiplication successfully.
//---------------------------------------------------------------------------
void NTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wj,wi,a,n2=n>>1;
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=a;
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
void INTT(long *dst,long *src,long n,long m,long w)
{
long i,j,wi=1,wj=1,rN,a,n2=n>>1;
rN=modpow(n,m-2,m);
for (wj=1,j=0;j<n;j++)
{
a=0;
for (wi=1,i=0;i<n;i++)
{
a=modadd(a,modmul(wi,src[i],m),m);
wi=modmul(wi,wj,m);
}
dst[j]=modmul(a,rN,m);
wj=modmul(wj,w,m);
}
}
//---------------------------------------------------------------------------
dst is destination array
src is source array
n is array size
m is modulus (p)
w is basis (Wn)
hope this helps to someone. If i forgot something please write ...
[edit1: fast NTT/INTT]
Finally I manage to get fast NTT/INTT to work. Was little bit more tricky than normal FFT:
//---------------------------------------------------------------------------
void _NFTT(long *dst,long *src,long n,long m,long w)
{
if (n<=1) { if (n==1) dst[0]=src[0]; return; }
long i,j,a0,a1,n2=n>>1,w2=modmul(w,w,m);
// reorder even,odd
for (i=0,j=0;i<n2;i++,j+=2) dst[i]=src[j];
for ( j=1;i<n ;i++,j+=2) dst[i]=src[j];
// recursion
_NFTT(src ,dst ,n2,m,w2); // even
_NFTT(src+n2,dst+n2,n2,m,w2); // odd
// restore results
for (w2=1,i=0,j=n2;i<n2;i++,j++,w2=modmul(w2,w,m))
{
a0=src[i];
a1=modmul(src[j],w2,m);
dst[i]=modadd(a0,a1,m);
dst[j]=modsub(a0,a1,m);
}
}
//---------------------------------------------------------------------------
void _INFTT(long *dst,long *src,long n,long m,long w)
{
long i,rN;
rN=modpow(n,m-2,m);
_NFTT(dst,src,n,m,w);
for (i=0;i<n;i++) dst[i]=modmul(dst[i],rN,m);
}
//---------------------------------------------------------------------------
[edit3]
I have optimized my code (3x times faster than code above),but still i am not satisfied with it so i started new question with it. There I have optimized my code even further (about 40x times faster than code above) so its almost the same speed as FFT on floating point of the same bit size. Link to it is here:
Modular arithmetics and NTT (finite field DFT) optimizations
To turn Cooley-Tukey (complex) FFT into modular arithmetic approach, i.e. NTT, you must replace complex definition for omega. For the approach to be purely recursive, you also need to recalculate omega for each level based on current signal size. This is possible because min. suitable modulus decreases as we move down in the call tree, so modulus used for root is suitable for lower layers. Additionally, as we are using same modulus, the same generator may be used as we move down the call tree. Also, for inverse transform, you should take additional step to take recalculated omega a and instead use as omega: b = a ^ -1 (via using inverse modulo operation). Specifically, b = invMod(a, N) s.t. b * a == 1 (mod N), where N is the chosen prime modulus.
Rewriting an expression involving omega by exploiting periodicity still works in modular arithmetic realm. You also need to find a way to determine the modulus (a prime) for the problem and a valid generator.
We note that your code works, though it is not a MWE. We extended it using common sense, and got correct result for a polynomial multiplication application. You just have to provide correct values of omega raised to certain powers.
While your code works, though, like from many other sources, you double spacing for each level. This does not lead to recursion that is as clean, though; this turns out to be identical to recalculating omega based on current signal size because the power for omega definition is inversely proportional to signal size. To reiterate: halving signal size is like squaring omega, which is like giving doubled powers for omega (which is what one would do for doubling of spacing). The nice thing about the approach that deals with recalculating of omega is that each subproblem is more cleanly complete in its own right.
There is a paper that shows some of the math for modular approach; it is a paper by Baktir and Sunar from 2006. See the paper at the end of this post.
You do not need to extend the cycle from n / 2 to n.
So, yes, some sources which say to just drop in a different omega definition for modular arithmetic approach are sweeping under the rug many details.
Another issue is that it is important to acknowledge that the signal size must be large enough if we are to not have overflow for result time-domain signal if we are performing convolution. Additionally, it may be useful to find certain implementations for exponentiation subject to modulus exist that are fast, even if the power is quite large.
References
Baktir and Sunar - Achieving efficient polynomial multiplication in Fermat fields using the fast Fourier transform (2006)
You must make sure that roots of unity actually exist. In R there are only 2 roots of unity: 1 and -1, since only for them x^n=1 can be true.
In C you have infinitely many roots of unity: w=exp(2*pi*i/N) is a primitive N-th roots of unity and all w^k for 0<=k
Now to your problem: you have to make sure the ring you're working in offers the same property: enough roots of unity.
Schönhage and Strassen (http://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) use integers modulo 2^N+1. This ring has enough roots of unity. 2^N == -1 is a 2nd root of unity, 2^(N/2) is a 4th root of unity and so on. Furthermore, these roots of unity have the advantage that they are powers of two and can be implemented as binary shifts (with a modulo operation afterwards, which comes down to a add/subtract).
I think QuickMul (http://www.cs.nyu.edu/exact/doc/qmul.ps) works modulo 2^N-1.