kmer counts with cython implementation - cython

I have this function implemented in Cython:
def count_kmers_cython(str string, list alphabet, int kmin, int kmax):
"""
Count occurrence of kmers in a given string.
"""
counter = {}
cdef int i
cdef int j
cdef int N = len(string)
limits = range(kmin, kmax + 1)
for i in range(0, N - kmax + 1):
for j in limits:
kmer = string[i:i+j]
counter[kmer] = counter.get(kmer, 0) + 1
return counter
Can I do better with cython? Or Can I have any away to improve it?
I am new to cython, that is my first attempt.
I will use this to count kmers in DNA with alphabet restrict to 'ACGT'. The length of the general input string is the average bacterial genomes (130 kb to over 14 Mb, where each 1 kb = 1000 bp).
The size of the kmers will be 3 < kmer < 16.
I wish to know if I could go further and maybe use cython in this function to:
def compute_kmer_stats(kmer_list, counts, len_genome, max_e):
"""
This function computes the z_score to find under/over represented kmers
according to a cut off e-value.
Inputs:
kmer_list - a list of kmers
counts - a dictionary-type with k-mers as keys and counts as values.
len_genome - the total length of the sequence(s).
max_e - cut off e-values to report under/over represented kmers.
Outputs:
results - a list of lists as [k-mer, observed count, expected count, z-score, e-value]
"""
print(colored('Starting to compute the kmer statistics...\n',
'red',
attrs=['bold']))
results = []
# number of tests, used to convert p-value to e-value.
n = len(list(kmer_list))
for kmer in kmer_list:
k = len(kmer)
prefix, sufix, center = counts[kmer[:-1]], counts[kmer[1:]], counts[kmer[1:-1]]
# avoid zero division error
if center == 0:
expected = 0
else:
expected = (prefix * sufix) // center
observed = counts[kmer]
sigma = math.sqrt(expected * (1 - expected / (len_genome - k + 1)))
# avoid zero division error
if sigma == 0.0:
z_score = 0.0
else:
z_score = ((observed - expected) / sigma)
# pvalue for all kmers/palindromes under represented
p_value_under = (math.erfc(-z_score / math.sqrt(2)) / 2)
# pvalue for all kmers/palindromes over represented
p_value_over = (math.erfc(z_score / math.sqrt(2)) / 2)
# evalue for all kmers/palindromes under represented
e_value_under = (n * p_value_under)
# evalue for all kmers/palindromes over represented
e_value_over = (n * p_value_over)
if e_value_under <= max_e:
results.append([kmer, observed, expected, z_score, p_value_under, e_value_under])
elif e_value_over <= max_e:
results.append([kmer, observed, expected, z_score, p_value_over, e_value_over])
return results
OBS - Thank you CodeSurgeon by the help. I know there are other tools to count kmer efficiently but I am learning Python so I am trying to write my own functions and code.

Related

Julia CUDA - Reduce matrix columns

Consider the following kernel, which reduces along the rows of a 2-D matrix
function row_sum!(x, ncol, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:ncol
#inbounds out[row_idx] += x[row_idx, i]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2*N)
out = CUDA.zeros(Float64, N)
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out)
isapprox(out, sum(x, dims=2)) # true
How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?
Here is the code:
function col_sum!(x, nrow, out)
"""out = sum(x, dims=1)"""
col_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:nrow
#inbounds out[col_idx] += x[i, col_idx]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2N)
out = CUDA.zeros(Float64, 2N)
#cuda threads=256 blocks=8 col_sum!(x, size(x, 1), out)
And here is the test:
julia> isapprox(out, vec(sum(x, dims=1)))
true
As you can see the size of the result vector is now 2N instead of N, hence we had to adapt the number of blocks accordingly (that is multiply by 2 and now we have 8 instead of 4)
More materials can be found here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

How many binary numbers with N bits if no more than M zeros/ones in a row

Is there an equation I can use for arbitrary M and N?
Example, N=3 and M=2:
3 bits allow for 8 different combinations, but only 2 of them do not contain more than 2 same symbols in a row
000 - Fails
001 - Fails
010 - OK
011 - Fails
100 - Fails
101 - OK
110 - Fails
111 - Fails
One way to frame the problem is as follows: we would like to count binary words of length n without runs of length m or larger. Let g(n, m) denote the number of such words. In the example, n = 3 and m = 2.
If n < m, every binary word works, and we get g(n, m) = 2^n words in total.
When n >= m, we can choose to start with 1, 2, ... m-1 repeated values,
followed by g(n-1, m), g(n-2, m), ... g(n-m+1, m) choices respectively. Combined, we get the following recursion (in Python):
from functools import lru_cache
#lru_cache(None) # memoization
def g(n, m):
if n < m:
return 2 ** n
else:
return sum(g(n-j, m) for j in range(1, m))
To test for correctness, we can compute the number of such binary sequences directly:
from itertools import product, groupby
def brute_force(n, k):
# generate all binary sequences of length n
products = product([0,1], repeat=n)
count = 0
for prod in products:
has_run = False
# group consecutive digits
for _, gp in groupby(prod):
gp_size = sum(1 for _ in gp)
if gp_size >= k:
# there are k or more consecutive digits in a row
has_run = True
break
if not has_run:
count += 1
return count
assert 2 == g(3, 2) == brute_force(3, 2)
assert 927936 == g(20, 7) == brute_force(20, 7)

How to calculate a probability vector and an observation count vector for a range of bins?

I want to test the hypothesis whether some 30 occurrences should fit a Poisson distribution.
#GNU Octave
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; #30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; #each bin can be single value or multiple values
I am trying to use Pearson's chi-square statistics here and coded the below function. I want a Poisson vector to contain corresponding Poisson probabilities for each bin and count the observations for each bin. I feel the loop is rather redundant and ugly. Can you please let me know how can I re-factor the function without the loop and make the whole calculation cleaner and more vectorized?
function result= poissonGoodnessOfFit(bins, observed)
assert(iscell(bins), "bins should be a cell array");
assert(all(cellfun("ismatrix", bins)) == 1, "bin entries either scalars or matrices");
assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix");
lambda_head = mean(observed); #poisson lambda parameter estimate
k = length(bins); #number of bin groups
n = length(observed); #number of observations
poisson_probability = []; #variable for poisson probability for each bin
observations = []; #variable for observation counts for each bin
for i=1:k
if isscalar(bins{1,i}) #this bin contains a single value
poisson_probability(1,i) = poisspdf(bins{1, i}, lambda_head);
observations(1, i) = histc(observed, bins{1, i});
else #this bin contains a range of values
inner_bins = bins{1, i}; #retrieve the range
inner_bins_k = length(inner_bins); #number of values inside
inner_poisson_probability = []; #variable to store individual probability of each value inside this bin
inner_observations = []; #variable to store observation counts of each value inside this bin
for j=1:inner_bins_k
inner_poisson_probability(1,j) = poisspdf(inner_bins(1, j), lambda_head);
inner_observations(1, j) = histc(observed, inner_bins(1, j));
endfor
poisson_probability(1, i) = sum(inner_poisson_probability, 2); #assign over the sum of all inner probabilities
observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts
endif
endfor
expected = n .* poisson_probability; #expected observations if indeed poisson using lambda_head
chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson Chi-Square statistics
pvalue = 1 - chi2cdf(chisq, k-1-1);
result = struct("actual", observations, "expected", expected, "chi2", chisq, "pvalue", pvalue);
return;
endfunction
There's a couple of things worth noting in the code.
First, the 'scalar' case in your if block is actually identical to your 'range' case, since a scalar is simply a range of 1 element. So no special treatment is needed for it.
Second, you don't need to create such explicit subranges, your bin groups seem to be amenable to being used as indices into a larger result (as long as you add 1 to convert from 0-indexed to 1-indexed indices).
Therefore my approach would be to calculate the expected and observed numbers over the entire domain of interest (as inferred from your bin groups), and then use the bin groups themselves as 1-indices to obtain the desired subgroups, summing accordingly.
Here's an example code, written in the octave/matlab compatible subset of both languges:
function Result = poissonGoodnessOfFit( BinGroups, Observations )
% POISSONGOODNESSOFFIT( BinGroups, Observations) calculates the [... etc, etc.]
pkg load statistics; % only needed in octave; for matlab buy statistics toolbox.
assert( iscell( BinGroups ), 'Bins should be a cell array' );
assert( all( cellfun( #ismatrix, BinGroups ) ) == 1, 'Bin entries either scalars or matrices' );
assert( ismatrix( Observations ) && rows( Observations ) == 1, 'Observed data should be a 1xn matrix' );
% Define helpful variables
RangeMin = min( cellfun( #min, BinGroups ) );
RangeMax = max( cellfun( #max, BinGroups ) );
Domain = RangeMin : RangeMax;
LambdaEstimate = mean( Observations );
NBinGroups = length( BinGroups );
NObservations = length( Observations );
% Get expected and observed numbers per 'bin' (i.e. discrete value) over the *entire* domain.
Expected_Domain = NObservations * poisspdf( Domain, LambdaEstimate );
Observed_Domain = histc( Observations, Domain );
% Apply BinGroup values as indices
Expected_byBinGroup = cellfun( #(c) sum( Expected_Domain(c+1) ), BinGroups );
Observed_byBinGroup = cellfun( #(c) sum( Observed_Domain(c+1) ), BinGroups );
% Perform a Chi-Square test on the Bin-wise Expected and Observed outputs
O = Observed_byBinGroup; E = Expected_byBinGroup ; df = NBinGroups - 1 - 1;
ChiSquareTestStatistic = sum( (O - E) .^ 2 ./ E );
PValue = 1 - chi2cdf( ChiSquareTestStatistic, df );
Result = struct( 'actual', O, 'expected', E, 'chi2', ChiSquareTestStatistic, 'pvalue', PValue );
end
Running with your example gives:
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; % 30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; % each bin can be single value or multiple values
Result = poissonGoodnessOfFit( bins, X )
% Result =
% scalar structure containing the fields:
% actual = 6 5 8 6 5
% expected = 1.2643 4.0037 13.0304 8.6522 3.0493
% chi2 = 21.989
% pvalue = 0.000065574
A general comment about the code; it is always preferable to write self-explainable code, rather than code that does not make sense by itself in the absence of a comment. Comments generally should only be used to explain the 'why', rather than the 'how'.

Tweaking a Function in Python

I am trying to get the following code to do a few more tricks:
class App(Frame):
def __init__(self, master):
Frame.__init__(self, master)
self.grid()
self.create_widgets()
def create_widgets(self):
self.answerLabel = Label(self, text="Output List:")
self.answerLabel.grid(row=2, column=1, sticky=W)
def psiFunction(self):
j = int(self.indexEntry.get())
valueList = list(self.listEntry.get())
x = map(int, valueList)
if x[0] != 0:
x.insert(0, 0)
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
self.answer = Label(self, text=rtn)
self.answer.grid(row=2, column=2, sticky=W)
if __name__ == "__main__":
root = Tk()
In particular, I am trying to get it to calculate len(x) * j - 1 terms, and to work for a variety of parameter values. If you try running it you should find that you get errors for larger parameter values. For example with a list 0,1,2,3,4 and a parameter j=3 we should run through the program and get 0123456789101112. However, I get an error that the last value is 'out of range' if I try to compute it.
I believe it's an issue with my function as defined. It seems the issue with parameters has something to do with the way it ties the parameter to the n value. Consider 0123. It works great if I use 2 as my parameter (called index in the function) but fails if I use 3.
EDIT:
def psi_j(x, j):
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
if r == 0:
rtn.append(j * x[n])
else:
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
print 'n2 =', n2, ': n =', n, ' r =' , r, ' rtn =', rtn
return rtn
For example if we have psi_j(x,2) with x = [0,1,2,3,4] we will be able to get [0,1,2,3,4,5,6,7,8,9,10,11] with an error on 12.
The idea though is that we should be able to calculate that last term. It is the 12th term of our output sequence, and 12 = 3*4+0 => 3*x[4] + 0*(x[n+1]-x[n]). Now, there is no 5th term to calculate so that's definitely an issue but we do not need that term since the second part of the equation is zero. Is there a way to write this into the equation?
If we think about the example data [0, 1, 2, 3] and a j of 3, the problem is that we're trying to get x[4]` in the last iteration.
len(x) * j - 2 for this data is 10
range(0, 10) is 0 through 9.
Manually processing our last iteration, allows us to resolve the code to this.
n = 3 # or 9 / 3
r = 0 # or 9 - 3 * 3
rtn.append(3 * x[3] + 0 * (x[3 + 1] - x[3]))
We have code trying to reach x[3 + 1], which doesn't exist when we only have indices 0 through 3.
To fix this, we could rewrite the code like this.
n = n2 / j
r = n2 - n * j
if r == 0:
rtn.append(j * x[n])
else:
rtn.append(j * x[n] + r * (x[n + 1] - x[n]))
If r is 0, then (x[n + 1] - x[n]) is irrelevant.
Please correct me if my math is wrong on that. I can't see a case where n >= len(x) and r != 0, but if that's possible, then my solution is invalid.
Without understanding that the purpose of the function is (is it a kind of filter? or smoothing function?), I prickled it out of the GUI suff and tested it alone:
def psiFunction(j, valueList):
x = map(int, valueList)
if x[0] != 0:
x.insert(0, 0)
rtn = []
for n2 in range(0, len(x) * j - 2):
n = n2 / j
r = n2 - n * j
print "n =", n, "max_n2 =", len(x) * j - 2, "n2 =", n2, "lx =", len(x), "r =", r
val = j * x[n] + r * (x[n + 1] - x[n])
rtn.append(val)
print j * x[n], r * (x[n + 1] - x[n]), val
return rtn
if __name__ == '__main__':
print psiFunction(3, [0, 1, 2, 3, 4])
Calling this module leads to some debugging output and, at the end, the mentionned error message.
Obviously, your x[n + 1] access fails, as n is 4 there, so n + 1 is 5, one too much for accessing the x array, which has length 5 and thus indexes from 0 to 4.
EDIT: Your psi_j() gives me the same behaviour.
Let me continue guessing: Whatever we want to do, we have to ensure that n + 1 stays below len(x). So maybe a
for n2 in range(0, (len(x) - 1) * j):
would be helpful. It only produces the numbers 0..11, but I think this is the only thing which can be expected out of it: the last items only can be
3*3 + 0*(4-3)
3*3 + 1*(4-3)
3*3 + 2*(4-3)
and stop. And this is achieved with the limit I mention here.

Getting a specific digit from a ratio expansion in any base (nth digit of x/y)

Is there an algorithm that can calculate the digits of a repeating-decimal ratio without starting at the beginning?
I'm looking for a solution that doesn't use arbitrarily sized integers, since this should work for cases where the decimal expansion may be arbitrarily long.
For example, 33/59 expands to a repeating decimal with 58 digits. If I wanted to verify that, how could I calculate the digits starting at the 58th place?
Edited - with the ratio 2124679 / 2147483647, how to get the hundred digits in the 2147484600th through 2147484700th places.
OK, 3rd try's a charm :)
I can't believe I forgot about modular exponentiation.
So to steal/summarize from my 2nd answer, the nth digit of x/y is the 1st digit of (10n-1x mod y)/y = floor(10 * (10n-1x mod y) / y) mod 10.
The part that takes all the time is the 10n-1 mod y, but we can do that with fast (O(log n)) modular exponentiation. With this in place, it's not worth trying to do the cycle-finding algorithm.
However, you do need the ability to do (a * b mod y) where a and b are numbers that may be as large as y. (if y requires 32 bits, then you need to do 32x32 multiply and then 64-bit % 32-bit modulus, or you need an algorithm that circumvents this limitation. See my listing that follows, since I ran into this limitation with Javascript.)
So here's a new version.
function abmody(a,b,y)
{
var x = 0;
// binary fun here
while (a > 0)
{
if (a & 1)
x = (x + b) % y;
b = (2 * b) % y;
a >>>= 1;
}
return x;
}
function digits2(x,y,n1,n2)
{
// the nth digit of x/y = floor(10 * (10^(n-1)*x mod y) / y) mod 10.
var m = n1-1;
var A = 1, B = 10;
while (m > 0)
{
// loop invariant: 10^(n1-1) = A*(B^m) mod y
if (m & 1)
{
// A = (A * B) % y but javascript doesn't have enough sig. digits
A = abmody(A,B,y);
}
// B = (B * B) % y but javascript doesn't have enough sig. digits
B = abmody(B,B,y);
m >>>= 1;
}
x = x % y;
// A = (A * x) % y;
A = abmody(A,x,y);
var answer = "";
for (var i = n1; i <= n2; ++i)
{
var digit = Math.floor(10*A/y)%10;
answer += digit;
A = (A * 10) % y;
}
return answer;
}
(You'll note that the structures of abmody() and the modular exponentiation are the same; both are based on Russian peasant multiplication.)
And results:
js>digits2(2124679,214748367,214748300,214748400)
20513882650385881630475914166090026658968726872786883636698387559799232373208220950057329190307649696
js>digits2(122222,990000,100,110)
65656565656
js>digits2(1,7,1,7)
1428571
js>digits2(1,7,601,607)
1428571
js>digits2(2124679,2147483647,2147484600,2147484700)
04837181235122113132440537741612893408915444001981729642479554583541841517920532039329657349423345806
edit: (I'm leaving post here for posterity. But please don't upvote it anymore: it may be theoretically useful but it's not really practical. I have posted another answer which is much more useful from a practical point of view, doesn't require any factoring, and doesn't require the use of bignums.)
#Daniel Bruckner has the right approach, I think. (with a few additional twists required)
Maybe there's a simpler method, but the following will always work:
Let's use the examples q = x/y = 33/57820 and 44/65 in addition to 33/59, for reasons that may become clear shortly.
Step 1: Factor the denominator (specifically factor out 2's and 5's)
Write q = x/y = x/(2a25a5z). Factors of 2 and 5 in the denominator do not cause repeated decimals. So the remaining factor z is coprime to 10. In fact, the next step requires factoring z, so you might as well factor the whole thing.
Calculate a10 = max(a2, a5) which is the smallest exponent of 10 that is a multiple of the factors of 2 and 5 in y.
In our example 57820 = 2 * 2 * 5 * 7 * 7 * 59, so a2 = 2, a5 = 1, a10 = 2, z = 7 * 7 * 59 = 2891.
In our example 33/59, 59 is a prime and contains no factors of 2 or 5, so a2 = a5 = a10 = 0.
In our example 44/65, 65 = 5*13, and a2 = 0, a5 = a10 = 1.
Just for reference I found a good online factoring calculator here. (even does totients which is important for the next step)
Step 2: Use Euler's Theorem or Carmichael's Theorem.
What we want is a number n such that 10n - 1 is divisible by z, or in other words, 10n ≡ 1 mod z. Euler's function φ(z) and Carmichael's function λ(z) will both give you valid values for n, with λ(z) giving you the smaller number and φ(z) being perhaps a little easier to calculate. This isn't too hard, it just means factoring z and doing a little math.
φ(2891) = 7 * 6 * 58 = 2436
λ(2891) = lcm(7*6, 58) = 1218
This means that 102436 ≡ 101218 ≡ 1 (mod 2891).
For the simpler fraction 33/59, φ(59) = λ(59) = 58, so 1058 ≡ 1 (mod 59).
For 44/65 = 44/(5*13), φ(13) = λ(13) = 12.
So what? Well, the period of the repeating decimal must divide both φ(z) and λ(z), so they effectively give you upper bounds on the period of the repeating decimal.
Step 3: More number crunching
Let's use n = λ(z). If we subtract Q' = 10a10x/y from Q'' = 10(a10 + n)x/y, we get:
m = 10a10(10n - 1)x/y
which is an integer because 10a10 is a multiple of the factors of 2 and 5 of y, and 10n-1 is a multiple of the remaining factors of y.
What we've done here is to shift left the original number q by a10 places to get Q', and shift left q by a10 + n places to get Q'', which are repeating decimals, but the difference between them is an integer we can calculate.
Then we can rewrite x/y as m / 10a10 / (10n - 1).
Consider the example q = 44/65 = 44/(5*13)
a10 = 1, and λ(13) = 12, so Q' = 101q and Q'' = 1012+1q.
m = Q'' - Q' = (1012 - 1) * 101 * (44/65) = 153846153846*44 = 6769230769224
so q = 6769230769224 / 10 / (1012 - 1).
The other fractions 33/57820 and 33/59 lead to larger fractions.
Step 4: Find the nonrepeating and repeating decimal parts.
Notice that for k between 1 and 9, k/9 = 0.kkkkkkkkkkkkk...
Similarly note that a 2-digit number kl between 1 and 99, k/99 = 0.klklklklklkl...
This generalizes: for k-digit patterns abc...ij, this number abc...ij/(10k-1) = 0.abc...ijabc...ijabc...ij...
If you follow the pattern, you'll see that what we have to do is to take this (potentially) huge integer m we got in the previous step, and write it as m = s*(10n-1) + r, where 1 ≤ r < 10n-1.
This leads to the final answer:
s is the non-repeating part
r is the repeating part (zero-padded on the left if necessary to ensure that it is n digits)
with a10 =
0, the decimal point is between the
nonrepeating and repeating part; if
a10 > 0 then it is located
a10 places to the left of
the junction between s and r.
For 44/65, we get 6769230769224 = 6 * (1012-1) + 769230769230
s = 6, r = 769230769230, and 44/65 = 0.6769230769230 where the underline here designates the repeated part.
You can make the numbers smaller by finding the smallest value of n in step 2, by starting with the Carmichael function λ(z) and seeing if any of its factors lead to values of n such that 10n ≡ 1 (mod z).
update: For the curious, the Python interpeter seems to be the easiest way to calculate with bignums. (pow(x,y) calculates xy, and // and % are integer division and remainder, respectively.) Here's an example:
>>> N = pow(10,12)-1
>>> m = N*pow(10,1)*44//65
>>> m
6769230769224
>>> r=m%N
>>> r
769230769230
>>> s=m//N
>>> s
6
>>> 44/65
0.67692307692307696
>>> N = pow(10,58)-1
>>> m=N*33//59
>>> m
5593220338983050847457627118644067796610169491525423728813
>>> r=m%N
>>> r
5593220338983050847457627118644067796610169491525423728813
>>> s=m//N
>>> s
0
>>> 33/59
0.55932203389830504
>>> N = pow(10,1218)-1
>>> m = N*pow(10,2)*33//57820
>>> m
57073676928398478035281909373919059149083362158422691110342442061570390868211691
45624351435489450017295053614666205465236942234520927014873746108612936700103770
32168799723279142165340712556208924247665167762020062262193012798339674852992044
27533725354548599100657212037357315807679003804911795226565202352127291594603943
27222414389484607402282947077135939121411276374956762365963334486336907644413697
68246281563472846765824974057419578000691802144586648218609477689380837080594949
84434451746800415081286751988931165686613628502248356969906606710480802490487720
51193358699411968177101349014181943964026288481494292632307160152196471809062608
09408509166378415773088896575579384296091317883085437564856451054998270494638533
37945347630577654790729851262538913870632998962296783120027672085783465928744379
10757523348322379799377378069872016603251470079557246627464545140089934278796264
26841923209961950882047734347976478727084053960567277758561051539259771705292286
40608785887236250432376340366655136630923555863023175371843652715323417502594258
04219993081978554133517813905223106191629194050501556554825319958491871324801106
88343133863714977516430300933932895191975095122794880664130058803182289865098581
80560359737115185
>>> r=m%N
>>> r
57073676928398478035281909373919059149083362158422691110342442061570390868211691
45624351435489450017295053614666205465236942234520927014873746108612936700103770
32168799723279142165340712556208924247665167762020062262193012798339674852992044
27533725354548599100657212037357315807679003804911795226565202352127291594603943
27222414389484607402282947077135939121411276374956762365963334486336907644413697
68246281563472846765824974057419578000691802144586648218609477689380837080594949
84434451746800415081286751988931165686613628502248356969906606710480802490487720
51193358699411968177101349014181943964026288481494292632307160152196471809062608
09408509166378415773088896575579384296091317883085437564856451054998270494638533
37945347630577654790729851262538913870632998962296783120027672085783465928744379
10757523348322379799377378069872016603251470079557246627464545140089934278796264
26841923209961950882047734347976478727084053960567277758561051539259771705292286
40608785887236250432376340366655136630923555863023175371843652715323417502594258
04219993081978554133517813905223106191629194050501556554825319958491871324801106
88343133863714977516430300933932895191975095122794880664130058803182289865098581
80560359737115185
>>> s=m//N
>>> s
0
>>> 33/57820
0.00057073676928398479
with the overloaded Python % string operator usable for zero-padding, to see the full set of repeated digits:
>>> "%01218d" % r
'0570736769283984780352819093739190591490833621584226911103424420615703908682116
91456243514354894500172950536146662054652369422345209270148737461086129367001037
70321687997232791421653407125562089242476651677620200622621930127983396748529920
44275337253545485991006572120373573158076790038049117952265652023521272915946039
43272224143894846074022829470771359391214112763749567623659633344863369076444136
97682462815634728467658249740574195780006918021445866482186094776893808370805949
49844344517468004150812867519889311656866136285022483569699066067104808024904877
20511933586994119681771013490141819439640262884814942926323071601521964718090626
08094085091663784157730888965755793842960913178830854375648564510549982704946385
33379453476305776547907298512625389138706329989622967831200276720857834659287443
79107575233483223797993773780698720166032514700795572466274645451400899342787962
64268419232099619508820477343479764787270840539605672777585610515392597717052922
86406087858872362504323763403666551366309235558630231753718436527153234175025942
58042199930819785541335178139052231061916291940505015565548253199584918713248011
06883431338637149775164303009339328951919750951227948806641300588031822898650985
8180560359737115185'
As a general technique, rational fractions have a non-repeating part followed by a repeating part, like this:
nnn.xxxxxxxxrrrrrr
xxxxxxxx is the nonrepeating part and rrrrrr is the repeating part.
Determine the length of the nonrepeating part.
If the digit in question is in the nonrepeating part, then calculate it directly using division.
If the digit in question is in the repeating part, calculate its position within the repeating sequence (you now know the lengths of everything), and pick out the correct digit.
The above is a rough outline and would need more precision to implement in an actual algorithm, but it should get you started.
AHA! caffiend: your comment to my other (longer) answer (specifically "duplicate remainders") leads me to a very simple solution that is O(n) where n = the sum of the lengths of the nonrepeating + repeating parts, and requires only integer math with numbers between 0 and 10*y where y is the denominator.
Here's a Javascript function to get the nth digit to the right of the decimal point for the rational number x/y:
function digit(x,y,n)
{
if (n == 0)
return Math.floor(x/y)%10;
return digit(10*(x%y),y,n-1);
}
It's recursive rather than iterative, and is not smart enough to detect cycles (the 10000th digit of 1/3 is obviously 3, but this keeps on going until it reaches the 10000th iteration), but it works at least until the stack runs out of memory.
Basically this works because of two facts:
the nth digit of x/y is the (n-1)th digit of 10x/y (example: the 6th digit of 1/7 is the 5th digit of 10/7 is the 4th digit of 100/7 etc.)
the nth digit of x/y is the nth digit of (x%y)/y (example: the 5th digit of 10/7 is also the 5th digit of 3/7)
We can tweak this to be an iterative routine and combine it with Floyd's cycle-finding algorithm (which I learned as the "rho" method from a Martin Gardner column) to get something that shortcuts this approach.
Here's a javascript function that computes a solution with this approach:
function digit(x,y,n,returnstruct)
{
function kernel(x,y) { return 10*(x%y); }
var period = 0;
var x1 = x;
var x2 = x;
var i = 0;
while (n > 0)
{
n--;
i++;
x1 = kernel(x1,y); // iterate once
x2 = kernel(x2,y);
x2 = kernel(x2,y); // iterate twice
// have both 1x and 2x iterations reached the same state?
if (x1 == x2)
{
period = i;
n = n % period;
i = 0;
// start again in case the nonrepeating part gave us a
// multiple of the period rather than the period itself
}
}
var answer=Math.floor(x1/y);
if (returnstruct)
return {period: period, digit: answer,
toString: function()
{
return 'period='+this.period+',digit='+this.digit;
}};
else
return answer;
}
And an example of running the nth digit of 1/700:
js>1/700
0.0014285714285714286
js>n=10000000
10000000
js>rs=digit(1,700,n,true)
period=6,digit=4
js>n%6
4
js>rs=digit(1,700,4,true)
period=0,digit=4
Same thing for 33/59:
js>33/59
0.559322033898305
js>rs=digit(33,59,3,true)
period=0,digit=9
js>rs=digit(33,59,61,true)
period=58,digit=9
js>rs=digit(33,59,61+58,true)
period=58,digit=9
And 122222/990000 (long nonrepeating part):
js>122222/990000
0.12345656565656565
js>digit(122222,990000,5,true)
period=0,digit=5
js>digit(122222,990000,7,true)
period=6,digit=5
js>digit(122222,990000,9,true)
period=2,digit=5
js>digit(122222,990000,9999,true)
period=2,digit=5
js>digit(122222,990000,10000,true)
period=2,digit=6
Here's another function that finds a stretch of digits:
// find digits n1 through n2 of x/y
function digits(x,y,n1,n2,returnstruct)
{
function kernel(x,y) { return 10*(x%y); }
var period = 0;
var x1 = x;
var x2 = x;
var i = 0;
var answer='';
while (n2 >= 0)
{
// time to print out digits?
if (n1 <= 0)
answer = answer + Math.floor(x1/y);
n1--,n2--;
i++;
x1 = kernel(x1,y); // iterate once
x2 = kernel(x2,y);
x2 = kernel(x2,y); // iterate twice
// have both 1x and 2x iterations reached the same state?
if (x1 == x2)
{
period = i;
if (n1 > period)
{
var jumpahead = n1 - (n1 % period);
n1 -= jumpahead, n2 -= jumpahead;
}
i = 0;
// start again in case the nonrepeating part gave us a
// multiple of the period rather than the period itself
}
}
if (returnstruct)
return {period: period, digits: answer,
toString: function()
{
return 'period='+this.period+',digits='+this.digits;
}};
else
return answer;
}
I've included the results for your answer (assuming that Javascript #'s didn't overflow):
js>digit(1,7,1,7,true)
period=6,digits=1428571
js>digit(1,7,601,607,true)
period=6,digits=1428571
js>1/7
0.14285714285714285
js>digit(2124679,214748367,214748300,214748400,true)
period=1759780,digits=20513882650385881630475914166090026658968726872786883636698387559799232373208220950057329190307649696
js>digit(122222,990000,100,110,true)
period=2,digits=65656565656
Ad hoc I have no good idea. Maybe continued fractions can help. I am going to think a bit about it ...
UPDATE
From Fermat's little theorem and because 39 is prime the following holds. (= indicates congruence)
10^39 = 10 (39)
Because 10 is coprime to 39.
10^(39 - 1) = 1 (39)
10^38 - 1 = 0 (39)
[to be continued tomorow]
I was to tiered to recognize that 39 is not prime ... ^^ I am going to update and the answer in the next days and present the whole idea. Thanks for noting that 39 is not prime.
The short answer for a/b with a < b and an assumed period length p ...
calculate k = (10^p - 1) / b and verify that it is an integer, else a/b has not a period of p
calculate c = k * a
convert c to its decimal represenation and left pad it with zeros to a total length of p
the i-th digit after the decimal point is the (i mod p)-th digit of the paded decimal representation (i = 0 is the first digit after the decimal point - we are developers)
Example
a = 3
b = 7
p = 6
k = (10^6 - 1) / 7
= 142,857
c = 142,857 * 3
= 428,571
Padding is not required and we conclude.
3 ______
- = 0.428571
7