How to reduce calculation of average to sub-sets in a general way? - language-agnostic

Edit: Since it appears nobody is reading the original question this links to, let me bring in a synopsis of it here.
The original problem, as asked by someone else, was that, given a large number of values, where the sum would exceed what a data type of Double would hold, how can one calculate the average of those values.
There was several answers that said to calculate in sets, like taking 50 and 50 numbers, and calculating the average inside those sets, and then finally take the average of all those sets and combine those to get the final average value.
My position was that unless you can guarantee that all those values can be split into a number of equally sized sets, you cannot use this approach. Someone dared me to ask the question here, in order to provide the answer, so here it is.
Basically, given an arbitrary number of values, where:
I know the number of values beforehand (but again, how would your answer change if you didn't?`)
I cannot gather up all the numbers, nor can I sum them (the sum will be too big for a normal data type in your programming language)
how can I calculate the average?
The rest of the question here outlines how, and the problems with, the approach to split into equally sized sets, but I'd really just like to know how you can do it.
Note that I know perfectly well enough math to know that in math theory terms, calculating the sum of A[1..N]/N will give me the average, let's assume that there are reasons that it isn't just as simple, and I need to split up the workload, and that the number of values isn't necessarily going to be divisable by 3, 7, 50, 1000 or whatever.
In other words, the solution I'm after will have to be general.
From this question:
What is a good solution for calculating an average where the sum of all values exceeds a double’s limits?
my position was that splitting the workload up into sets is no good, unless you can ensure that the size of those sets are equal.
Edit: The original question was about the upper limit that a particular data type could hold, and since he was summing up a lot of numbers (count that was given as example was 10^9), the data type could not hold the sum. Since this was a problem in the original solution, I'm assuming (and this is a prerequisite for my question, sorry for missing that) that the numbers are too big to give any meaningful answers.
So, dividing by the total number of values directly is out. The original reason for why a normal SUM/COUNT solution was out was that SUM would overflow, but let's assume, for this question that SET-SET/SET-SIZE will underflow, or whatever.
The important part is that I cannot simply sum, I cannot simply divide by the number of total values. If I cannot do that, will my approach work, or not, and what can I do to fix it?
Let me outline the problem.
Let's assume you're going to calculate the average of the numbers 1 through 6, but you cannot (for whatever reason) do so by summing the numbers, counting the numbers, and then dividing the sum by the count. In other words, you cannot simply do (1+2+3+4+5+6)/6.
In other words, SUM(1..6)/COUNT(1..6) is out. We're not considering NULL's (as in database NULL's) here.
Several of the answers to that question alluded to being able to split the numbers being averaged into sets, say 3 or 50 or 1000 numbers, then calculating some number for that, and then finally combining those values to get the final average.
My position is that this is not possible in the general case, since this will make some numbers, the ones appearing in the final set, more or less valuable than all the ones in the previous sets, unless you can split all the numbers into equally sized sets.
For instance, to calculate the average of 1-6, you can split it up into sets of 3 numbers like this:
/ 1 2 3 \ / 4 5 6 \
| - + - + - | + | - + - + - |
\ 3 3 3 / \ 3 3 3 / <-- 3 because 3 numbers in the set
---------- -----------
2 2 <-- 2 because 2 equally sized groups
Which gives you this:
2 5
- + - = 3.5
2 2
(note: (1+2+3+4+5+6)/6 = 3.5, so this is correct here)
However, my point is that once the number of values cannot be split into a number of equally sized sets, this method falls apart. For instance, what about the sequence 1-7, which contains a prime number of values.
Can a similar approach, that won't sum all the values, and count all the values, in one go, work?
So, is there such an approach? How do I calculate the average of an arbitrary number of values in which the following holds true:
I cannot do a normal sum/count approach, for whatever reason
I know the number of values beforehand (what if I don't, will that change the answer?)

Well, suppose you added three numbers and divided by three, and then added two numbers and divided by two. Can you get the average from these?
x = (a + b + c) / 3
y = (d + e) / 2
z = (f + g) / 2
And you want
r = (a + b + c + d + e + f + g) / 7
That is equal to
r = (3 * (a + b + c) / 3 + 2 * (d + e) / 2 + 2 * (f + g) / 2) / 7
r = (3 * x + 2 * y + 2 * z) / 7
Both lines above overflow, of course, but since division is distributive, we do
r = (3.0 / 7.0) * x + (2.0 / 7.0) * y + (2.0 / 7.0) * z
Which guarantees that you won't overflow, as I'm multiplying x, y and z by fractions less than one.
This is the fundamental point here. Neither I'm dividing all numbers beforehand by the total count, nor am I ever exceeding the overflow.
So... if you you keep adding to an accumulator, keep track of how many numbers you have added, and always test if the next number will cause an overflow, you can then get partial averages, and compute the final average.
And no, if you don't know the values beforehand, it doesn't change anything (provided that you can count them as you sum them).
Here is a Scala function that does it. It's not idiomatic Scala, so that it can be more easily understood:
def avg(input: List[Double]): Double = {
var partialAverages: List[(Double, Int)] = Nil
var inputLength = 0
var currentSum = 0.0
var currentCount = 0
var numbers = input
while (numbers.nonEmpty) {
val number = numbers.head
val rest = numbers.tail
if (number > 0 && currentSum > 0 && Double.MaxValue - currentSum < number) {
partialAverages = (currentSum / currentCount, currentCount) :: partialAverages
currentSum = 0
currentCount = 0
} else if (number < 0 && currentSum < 0 && Double.MinValue - currentSum > number) {
partialAverages = (currentSum / currentCount, currentCount) :: partialAverages
currentSum = 0
currentCount = 0
}
currentSum += number
currentCount += 1
inputLength += 1
numbers = rest
}
partialAverages = (currentSum / currentCount, currentCount) :: partialAverages
var result = 0.0
while (partialAverages.nonEmpty) {
val ((partialSum, partialCount) :: rest) = partialAverages
result += partialSum * (partialCount.toDouble / inputLength)
partialAverages = rest
}
result
}
EDIT:
Won't multiplying with 2, and 3, get me back into the range of "not supporter by the data type?"
No. If you were diving by 7 at the end, absolutely. But here you are dividing at each step of the sum. Even in your real case the weights (2/7 and 3/7) would be in the range of manageble numbers (e.g. 1/10 ~ 1/10000) which wouldn't make a big difference compared to your weight (i.e. 1).
PS: I wonder why I'm working on this answer instead of writing mine where I can earn my rep :-)

If you know the number of values beforehand (say it's N), you just add 1/N + 2/N + 3/N etc, supposing that you had values 1, 2, 3. You can split this into as many calculations as you like, and just add up your results. It may lead to a slight loss of precision, but this shouldn't be an issue unless you also need a super-accurate result.
If you don't know the number of items ahead of time, you might have to be more creative. But you can, again, do it progressively. Say the list is 1, 2, 3, 4. Start with mean = 1. Then mean = mean*(1/2) + 2*(1/2). Then mean = mean*(2/3) + 3*(1/3). Then mean = mean*(3/4) + 4*(1/4) etc. It's easy to generalize, and you just have to make sure the bracketed quantities are calculated in advance, to prevent overflow.
Of course, if you want extreme accuracy (say, more than 0.001% accuracy), you may need to be a bit more careful than this, but otherwise you should be fine.

Let X be your sample set. Partition it into two sets A and B in any way that you like. Define delta = m_B - m_A where m_S denotes the mean of a set S. Then
m_X = m_A + delta * |B| / |X|
where |S| denotes the cardinality of a set S. Now you can repeatedly apply this to partition and calculate the mean.
Why is this true? Let s = 1 / |A| and t = 1 / |B| and u = 1 / |X| (for convenience of notation) and let aSigma and bSigma denote the sum of the elements in A and B respectively so that:
m_A + delta * |B| / |X|
= s * aSigma + u * |B| * (t * bSigma - s * aSigma)
= s * aSigma + u * (bSigma - |B| * s * aSigma)
= s * aSigma + u * bSigma - u * |B| * s * aSigma
= s * aSigma * (1 - u * |B|) + u * bSigma
= s * aSigma * (u * |X| - u * |B|) + u * bSigma
= s * u * aSigma * (|X| - |B|) + u * bSigma
= s * u * aSigma * |A| + u * bSigma
= u * aSigma + u * bSigma
= u * (aSigma + bSigma)
= u * (xSigma)
= xSigma / |X|
= m_X
The proof is complete.
From here it is obvious how to use this to either recursively compute a mean (say by repeatedly splitting a set in half) or how to use this to parallelize the computation of the mean of a set.
The well-known on-line algorithm for calculating the mean is just a special case of this. This is the algorithm that if m is the mean of {x_1, x_2, ... , x_n} then the mean of {x_1, x_2, ..., x_n, x_(n+1)} is m + ((x_(n+1) - m)) / (n + 1). So with X = {x_1, x_2, ..., x_(n+1)}, A = {x_(n+1)}, and B = {x_1, x_2, ..., x_n} we recover the on-line algorithm.

Thinking outside the box: Use the median instead. It's much easier to calculate - there are tons of algorithms out there (e.g. using queues), you can often construct good arguments as to why it's more meaningful for data sets (less swayed by extreme values; etc) and you will have zero problems with numerical accuracy. It will be fast and efficient. Plus, for large data sets (which it sounds like you have), unless the distributions are truly weird, the values for the mean and median will be similar.

When you split the numbers into sets you're just dividing by the total number or am I missing something?
You have written it as
/ 1 2 3 \ / 4 5 6 \
| - + - + - | + | - + - + - |
\ 3 3 3 / \ 3 3 3 /
---------- -----------
2 2
but that's just
/ 1 2 3 \ / 4 5 6 \
| - + - + - | + | - + - + - |
\ 6 6 6 / \ 6 6 6 /
so for the numbers from 1 to 7 one possible grouping is just
/ 1 2 3 \ / 4 5 6 \ / 7 \
| - + - + - | + | - + - + - | + | - |
\ 7 7 7 / \ 7 7 7 / \ 7 /

Average of x_1 .. x_N
= (Sum(i=1,N,x_i)) / N
= (Sum(i=1,M,x_i) + Sum(i=M+1,N,x_i)) / N
= (Sum(i=1,M,x_i)) / N + (Sum(i=M+1,N,x_i)) / N
This can be repeatedly applied, and is true regardless of whether the summations are of equal size. So:
Keep adding terms until both:
adding another one will overflow (or otherwise lose precision)
dividing by N will not underflow
Divide the sum by N
Add the result to the average-so-far
There's one obvious awkward case, which is that there are some very small terms at the end of the sequence, such that you run out of values before you satisfy the condition "dividing by N will not underflow". In which case just discard those values - if their contribution to the average cannot be represented in your floating type, then it is in particular smaller than the precision of your average. So it doesn't make any difference to the result whether you include those terms or not.
There are also some less obvious awkward cases to do with loss of precision on individual summations. For example, what's the average of the values:
10^100, 1, -10^100
Mathematics says it's 1, but floating-point arithmetic says it depends what order you add up the terms, and in 4 of the 6 possibilities it's 0, because (10^100) + 1 = 10^100. But I think that the non-commutativity of floating-point arithmetic is a different and more general problem than this question. If sorting the input is out of the question, I think there are things you can do where you maintain lots of accumulators of different magnitudes, and add each new value to whichever one of them will give best precision. But I don't really know.

Here's another approach. You're 'receiving' numbers one-by-one from some source, but you can keep track of the mean at each step.
First, I will write out the formula for mean at step n+1:
mean[n+1] = mean[n] - (mean[n] - x[n+1]) / (n+1)
with the initial condition:
mean[0] = x[0]
(the index starts at zero).
The first equation can be simplified to:
mean[n+1] = n * mean[n] / (n+1) + x[n+1]/(n+1)
The idea is that you keep track of the mean, and when you 'receive' the next value in your sequence, you figure out its offset from the current mean, and divide it equally between the n+1 samples seen so far, and adjust your mean accordingly. If your numbers don't have a lot of variance, your running mean will need to be adjusted very slightly with the new numbers as n becomes large.
Obviously, this method works even if you don't know the total number of values when you start. It has an additional advantage that you know the value of the current mean at all times. One disadvantage that I can think of is the it probably gives more 'weight' to the numbers seen in the beginning (not in a strict mathematical sense, but because of floating point representations).
Finally, all such calculations are bound to run into floating-point 'errors' if one is not careful enough. See my answer to another question for some of the problems with floating point calculations and how to test for potential problems.
As a test, I generated N=100000 normally distributed random numbers with mean zero and variance 1. Then I calculated their mean by three methods.
sum(numbers) / N, call it m1,
my method above, call it m2,
sort the numbers, and then use my method above, call it m3.
Here's what I found: m1 − m2 ∼ −4.6×10−17, m1 − m3 ∼ −3×10−15, m2 − m3 ∼ −3×10−15. So, if your numbers are sorted, the error might not be small enough for you. (Note however that even the worst error is 10−15 parts in 1 for 100000 numbers, so it might be good enough anyway.)

Some of the mathematical solutions here are very good. Here's a simple technical solution.
Use a larger data type. This breaks down into two possibilities:
Use a high-precision floating point library. One who encounters a need to average a billion numbers probably has the resources to purchase, or the brain power to write, a 128-bit (or longer) floating point library.
I understand the drawbacks here. It would certainly be slower than using intrinsic types. You still might over/underflow if the number of values grows too high. Yada yada.
If your values are integers or can be easily scaled to integers, keep your sum in a list of integers. When you overflow, simply add another integer. This is essentially a simplified implementation of the first option. A simple (untested) example in C# follows
class BigMeanSet{
List<uint> list = new List<uint>();
public double GetAverage(IEnumerable<uint> values){
list.Clear();
list.Add(0);
uint count = 0;
foreach(uint value in values){
Add(0, value);
count++;
}
return DivideBy(count);
}
void Add(int listIndex, uint value){
if((list[listIndex] += value) < value){ // then overflow has ocurred
if(list.Count == listIndex + 1)
list.Add(0);
Add(listIndex + 1, 1);
}
}
double DivideBy(uint count){
const double shift = 4.0 * 1024 * 1024 * 1024;
double rtn = 0;
long remainder = 0;
for(int i = list.Count - 1; i >= 0; i--){
rtn *= shift;
remainder <<= 32;
rtn += Math.DivRem(remainder + list[i], count, out remainder);
}
rtn += remainder / (double)count;
return rtn;
}
}
Like I said, this is untested—I don't have a billion values I really want to average—so I've probably made a mistake or two, especially in the DivideBy function, but it should demonstrate the general idea.
This should provide as much accuracy as a double can represent and should work for any number of 32-bit elements, up to 232 - 1. If more elements are needed, then the count variable will need be expanded and the DivideBy function will increase in complexity, but I'll leave that as an exercise for the reader.
In terms of efficiency, it should be as fast or faster than any other technique here, as it only requires iterating through the list once, only performs one division operation (well, one set of them), and does most of its work with integers. I didn't optimize it, though, and I'm pretty certain it could be made slightly faster still if necessary. Ditching the recursive function call and list indexing would be a good start. Again, an exercise for the reader. The code is intended to be easy to understand.
If anybody more motivated than I am at the moment feels like verifying the correctness of the code, and fixing whatever problems there might be, please be my guest.
I've now tested this code, and made a couple of small corrections (a missing pair of parentheses in the List<uint> constructor call, and an incorrect divisor in the final division of the DivideBy function).
I tested it by first running it through 1000 sets of random length (ranging between 1 and 1000) filled with random integers (ranging between 0 and 232 - 1). These were sets for which I could easily and quickly verify accuracy by also running a canonical mean on them.
I then tested with 100* large series, with random length between 105 and 109. The lower and upper bounds of these series were also chosen at random, constrained so that the series would fit within the range of a 32-bit integer. For any series, the results are easily verifiable as (lowerbound + upperbound) / 2.
*Okay, that's a little white lie. I aborted the large-series test after about 20 or 30 successful runs. A series of length 109 takes just under a minute and a half to run on my machine, so half an hour or so of testing this routine was enough for my tastes.
For those interested, my test code is below:
static IEnumerable<uint> GetSeries(uint lowerbound, uint upperbound){
for(uint i = lowerbound; i <= upperbound; i++)
yield return i;
}
static void Test(){
Console.BufferHeight = 1200;
Random rnd = new Random();
for(int i = 0; i < 1000; i++){
uint[] numbers = new uint[rnd.Next(1, 1000)];
for(int j = 0; j < numbers.Length; j++)
numbers[j] = (uint)rnd.Next();
double sum = 0;
foreach(uint n in numbers)
sum += n;
double avg = sum / numbers.Length;
double ans = new BigMeanSet().GetAverage(numbers);
Console.WriteLine("{0}: {1} - {2} = {3}", numbers.Length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
for(int i = 0; i < 100; i++){
uint length = (uint)rnd.Next(100000, 1000000001);
uint lowerbound = (uint)rnd.Next(int.MaxValue - (int)length);
uint upperbound = lowerbound + length;
double avg = ((double)lowerbound + upperbound) / 2;
double ans = new BigMeanSet().GetAverage(GetSeries(lowerbound, upperbound));
Console.WriteLine("{0}: {1} - {2} = {3}", length, avg, ans, avg - ans);
if(avg != ans)
Debugger.Break();
}
}

Related

Time complexity of this function?

algo(n)
for i in 0 to n {
for 0 to 8^i {
}
}
for i to 8^d {
}
Any kind of analysis or information about the time complexity of this algorithm will be usefull. Worst case, best case, lower/upper bounds, theta/omega/big-o, recurrence relation....etc.
Your algorithm runs in exponential time (T ∈ Θ(c^n), c>1). You can analyse the number of iterations of the inner for loop (... for 0 to 8^i) using Sigma notation:
Since your algorithm is in Θ(8^n), it is also in O(8^n) (upper asymptotic bound) and Ω(8^n) (lower asymptotic bound).
The above analysis is performed under the assumption that the d in the final for loop analysis is less or equal to n, in which case the nested two for loop prior to it will dominate (and hence we needn't analyze the last non-dominant for loop explicitly).
algo(n) is basically made of two parts:
for i in 0 to n
for 0 to 8^i
and
for i to 8^d
Let's start with the first. Assuming each iteration of the inner loop takes constant time, it's complexity is C*8^i.
Now, if we sum it across possible values of i we get:
8^0 + 8^1 + 8^2 + .... + 8^n-1
This is sum of geometric series with a=1, r=8, and its sum is:
1 * (1-8 ^(n-1)) / (1-8) = 1 * (-1/7 + 8^(n-1)/7)
For n->infinity, this can be approximated as 8^(n-1)/7, and we can conclude the complexity is Θ(8^(n-1)/7) = Θ(8^n)
As for the 2nd part, it is pretty straight forward and is 8^d.
This gives total complexity of T(n) is in Θ(8^d + 8^n)

as3 Number precision

I am trying to round some numbers in two decimal point and I run into a bizare behavior.
please try the following code:
var num:Number = 30.25
for (var i = 0 ; i < 100 ; i++){
var a:Number = (Math.round(num * 100) / 100)
var b:Number = (Math.round(num * 100) * 0.01 )
trace (num.toString() + " -- " + a.toString() + " -- " + b.toString())
num += 0.999;
}
x = y /100 and x = y * 0.01 should be equal.
(And x = y * 0.01 should be faster).
But if I run the above code the result is not always equal.
I get for example
46.23400000000003 -- 46.23 -- 46.230000000000004
47.23300000000003 -- 47.23 -- 47.230000000000004
48.232000000000035 -- 48.23 -- 48.230000000000004
49.23100000000004 -- 49.23 -- 49.230000000000004
while x=y/100 is always correct x=y*0.01 sometimes adds a small value like 0.000000000000004 at the end.
Am I doing something wrong?
Has anyone else observed this behavior?
In general, in floating point computations you should try to avoid numbers of really different magnitude in the same calculation. That's precisely the issue with these types: the point "floats", so you want to keep the point of one number of the computation close to the point of the other number.
Your question is simply put as
Why is 4623/100 == 46.23 but 4623*0.01 == 46.230000000000004?
For the specific reason, you can dig in the specific of floating point computation, for example here.
4623 is 4.623*10^3 while 0.01 is 1*10^{-3}, notice how the exponent is really different (6 orders of magnitude of difference). While 100 is just 1*10^{2}, much "closer" to 4.623*10^3.

Most efficient way to search a sorted matrix?

I have an assignment to write an algorithm (not in any particular language, just pseudo-code) that receives a matrix [size: M x N] that is sorted in a way that all of it's rows are sorted and all of it's columns are sorted individually, and finds a certain value within this matrix. I need to write the most time-efficient algorithm I can think of.
The matrix looks something like:
1 3 5
4 6 8
7 9 10
My idea is to start at the first row and last column and simply check the value, if it's bigger go down and if it's smaller than go left and keep doing so until the value is found or until the indexes are out of bounds (in case the value does not exist). This algorithm works at linear complexity O(m+n). I've been told that it's possible to do so with a logarithmic complexity. Is it possible? and if so, how?
Your matrix looks like this:
a ..... b ..... c
. . . . .
. 1 . 2 .
. . . . .
d ..... e ..... f
. . . . .
. 3 . 4 .
. . . . .
g ..... h ..... i
and has following properties:
a,c,g < i
a,b,d < e
b,c,e < f
d,e,g < h
e,f,h < i
So value in lowest-rigth most corner (eg. i) is always the biggest in whole matrix
and this property is recursive if you divide matrix into 4 equal pieces.
So we could try to use binary search:
probe for value,
divide into pieces,
choose correct piece (somehow),
goto 1 with new piece.
Hence algorithm could look like this:
input: X - value to be searched
until found
divide matrix into 4 equal pieces
get e,f,h,i as shown on picture
if (e or f or h or i) equals X then
return found
if X < e then quarter := 1
if X < f then quarter := 2
if X < h then quarter := 3
if X < i then quarter := 4
if no quarter assigned then
return not_found
make smaller matrix from chosen quarter
This looks for me like a O(log n) where n is number of elements in matrix. It is kind of binary search but in two dimensions. I cannot prove it formally but resembles typical binary search.
and that's how the sample input looks? Sorted by diagonals? That's an interesting sort, to be sure.
Since the following row may have a value that's lower than any value on this row, you can't assume anything in particular about a given row of data.
I would (if asked to do this over a large input) read the matrix into a list-struct that took the data as one pair of a tuple, and the mxn coord as the part of the tuple, and then quicksort the matrix once, then find it by value.
Alternately, if the value of each individual location is unique, toss the MxN data into a dictionary keyed on the value, then jump to the dictionary entry of the MxN based on the key of the input (or the hash of the key of the input).
EDIT:
Notice that the answer I give above is valid if you're going to look through the matrix more than once. If you only need to parse it once, then this is as fast as you can do it:
for (int i = 0; i<M; i++)
for (int j=0; j<N; j++)
if (mat[i][j] == value) return tuple(i,j);
Apparently my comment on the question should go down here too :|
#sagar but that's not the example given by the professor. otherwise he had the fastest method above (check the end of the row first, then proceed) additionally, checking the end of the middlest row first would be faster, a bit of a binary search.
Checking the end of each row (and starting on the end of the middle row) to find a number higher than the checked for number on an in memory array would be fastest, then doing a binary search on each matching row till you find it.
in log M you can get a range of rows able to contain the target (binary search on the first value of rows, binary search on last value of rows, keep only those rows whose first <= target and last >= target) two binary searches is still O(log M)
then in O(log N) you can explore each of these rows, with again, a binary search!
that makes it O(logM x logN)
tadaaaa
public static boolean find(int a[][],int rows,int cols,int x){
int m=0;
int n=cols-1;
while(m<rows&&n>=0){
if(a[m][n]==x)
return1;
else if(a[m][n]>x)
n--;
else m++;
}
}
what about getting the diagonal out, then binary search over the diagonal, start bottom right check if it is above, if yes take the diagonal array position as the column it is in, if not then check if it is below. each time running a binary search on the column once you have a hit on the diagonal (using the array position of the diagonal as the column index). I think this is what was stated by #user942640
you could get the running time of the above and when required (at some point) swap the algo to do a binary search on the initial diagonal array (this is taking into consideration its n * n elements and getting x or y length is O(1) as x.length = y.length. even on a million * million binary search the diagonal if it is less then half step back up the diagonal, if it is not less then binary search back towards where you where (this is a slight change to the algo when doing a binary search along the diagonal). I think the diagonal is better than the binary search down the rows, Im just to tired at the moment to look at the maths :)
by the way I believe running time is slightly different to analysis which you would describe in terms of best/worst/avg case, and time against memory size etc. so the question would be better stated as in 'what is the best running time in worst case analysis', because in best case you could do a brute linear scan and the item could be in the first position and this would be a better 'running time' than binary search...
Here is a lower bound of n. Start with an unsorted array A of length n. Construct a new matrix M according to the following rule: the secondary diagonal contains the array A, everything above it is minus infinity, everything below it is plus infinity. The rows and columns are sorted, and looking for an entry in M is the same as looking for an entry in A.
This is in the vein of Michal's answer (from which I will steal the nice graphic).
Matrix:
min ..... b ..... c
. . .
. II . I .
. . .
d .... mid .... f
. . .
. III . IV .
. . .
g ..... h ..... max
Min and max are the smallest and largest values, respectively. "mid" is not necessarily the average/median/whatever value.
We know that the value at mid is >= all values in quadrant II, and <= all values in quadrant IV. We cannot make such claims for quadrants I and III. If we recurse, we can eliminate one quadrant at each level.
Thus, if the target value is less than mid, we must search quadrants I, II, and III. If the target value is greater than mid, we must search quadrants I, III, and IV.
The space reduces to 3/4 its previous at each step:
n * (3/4)x = 1
n = (4/3)x
x = log4/3(n)
Logarithms differ by a constant factor, so this is O(log(n)).
find(min, max, target)
if min is max
if target == min
return min
else
return not found
else if target < min or target > max
return not found
else
set mid to average of min and max
if target == mid
return mid
else
find(b, f, target), return if found
find(d, h, target), return if found
if target < mid
return find(min, mid, target)
else
return find(mid, max, target)
JavaScript solution:
//start from the top right corner
//if value = el, element is found
//if value < el, move to the next row, element can't be in that row since row is sorted
//if value > el, move to the previous column, element can't be in that column since column is sorted
function find(matrix, el) {
//some error checking
if (!matrix[0] || !matrix[0].length){
return false;
}
if (!el || isNaN(el)){
return false;
}
var row = 0; //first row
var col = matrix[0].length - 1; //last column
while (row < matrix.length && col >= 0) {
if (matrix[row][col] === el) { //element is found
return true;
} else if (matrix[row][col] < el) {
row++; //move to the next row
} else {
col--; //move to the previous column
}
}
return false;
}
this is wrong answer
I am really not sure if any of the answers are the optimal answers. I am going at it.
binary search first row, and first column and find out the row and column where "x" could be. you will get 0,j and i,0. x will be on i row or j column if x is not found in this step.
binary search on the row i and the column j you found in step 1.
I think the time complexity is 2* (log m + log n).
You can reduce the constant, if the input array is a square (n * n), by binary searching along the diagonal.

Least amount of voters, given two halves

One of my former students sent me a message about this interview question he got while applying for a job as a Junior Developer.
There are two candidates running for president in a mock classroom election. Given the two percentages of voters, find out the least amount of possible voters in the classroom.
Examples:
Input: 50.00,50.00
Output: 2
Input: 25.00,75.00
Output: 4
Input: 53.23, 46.77
Output: 124 // The first value, 1138 was wrong. Thanks to Loïc for the correct value
Note: The sum of the input percentages are always 100.00%, two decimal places
The last example got me scratching my head. It was the first time I heard about this problem, and I'm kindof stumped on how to solve this.
EDIT: I called my student about the problem, and told me that he was not sure about the last value. He said, and I quote, "It was an absurdly large number output" :( sorry! I should've researched more before posting it online~ I'm guessing 9,797 is the output on the last example though..
You can compute these values by using the best rational approximations of the voter percentages. Wikipedia describes how to obtain these values from the continued fraction (which can be computed these using the euclidean algorithm). The desired result is the first approximation which is within 0.005% of the expected value.
Here's an example with 53.23%:
10000 = 1 * 5323 + 4677
5323 = 1 * 4677 + 646
4677 = 7 * 646 + 155
646 = 4 * 155 + 26
155 = 5 * 26 + 25
26 = 1 * 25 + 1
25 = 25* 1 + 0
Approximations:
1: 1 / 1
-> 1 = 100%
2: 1 / (1 + 1/1)
-> 1/2 = 50%
2.5: 1 / (1 + 1 / (1 + 1/6))
-> 7/1 = 53.75%
3: 1 / (1 + 1 / (1 + 1/7))
-> 8/15 = 53.33%
3.5: 1 / (1 + 1 / (1 + 1 / (7 + 1/3)))
-> 25/47 = 53.19%
4: 1 / (1 + 1 / (1 + 1 / (7 + 1/4)))
-> 33/62 = 53.23%
The reason we have extra values before the 3rd and 4th convergents is that their last terms (7 and 4 respectively) are greater than 1, so we must test the approximation with the last term decremented.
The desired result is the denominator of the first value which rounds to the desired value, which in this vase is 62.
Sample Ruby implementation available here (using the formulae from the Wikipedia page here, so it looks slightly different to the above example).
First you can notice that a trivial solution is to have 10.000 voters. Now let's try to find something lower than that.
For each value of N starting à 1
For Each value of i starting à 1
If i/N = 46.77
return N
Always choose the minimum of the two percentages to be faster.
Or faster :
For each value of N starting à 1
i = floor(N*46.77/100)
For j = i or i+1
If round(j/N) = 46.77 and round((N-j)/N) = 53.23
return N
For the third example :
605/1138 = .5316344464
(1138-605)/1138 = .4683655536
but
606/1138 = .5325131810
(1138-606)/1138 = .4674868190
It can't be 1138...
But 62 is working :
33/62 = .5322580645
(62-33)/62 = .4677419355
Rounded it's giving you the good values.
(After some extensive edits:)
If you only have 2 voters, then you can only generate the following percentages for candidates A and B:
0+100, 100+0, or 50+50
If you have 3 voters, then you have
0+100, 100+0, 33.33+66.67, 66.67+33.33 [notice the rounding]
So this is a fun problem about fractions.
If you can make 25% then you have to have at least 4 people (so you can do 1/4, since 1/2 and 1/3 won't cut it). You can do it with more (i.e. 2/8 = 25%) but the problem asks for the least.
However, more interesting fractions require numbers greater than 1 in the numerator:
2/5 = 40%
Since you can't get that with anything but a 2 or more in the numerator (1/x will never cut it).
You can compare at each step and increase either the numerator or denominator, which is much more efficient than iterating over the whole sample space for j and then incrementing i;
i.e. if you have a percentage of 3%, checking solutions all the way up in the fashion of 96/99, 97/99, 98/99 before even getting to x/100 is a waste of time. Instead, you can increment the numerator or denominator based on how well your current guess is doing (greater than or less than) like so
int max = 5000; //we only need to go half-way at most.
public int minVoters (double onePercentage) {
double checkPercentage = onePercentage;
if (onePercentage > 50.0)
checkPercentage = 100-onePercentage; //get the smaller percentage value
double i=1;
double j=1; //arguments of Math.round must be double or float
double temp = 0;
while (j<max || i<max-1) { //we can go all the way to 4999/5000 for the lesser value
temp = (i/j)*100;
temp = Math.round(temp);
temp = temp/100;
if (temp == checkPercentage)
return j;
else if (temp > checkPercentage) //we passed up our value and need to increase the denominator
j++;
else if (temp < checkPercentage) //we are too low and increase the numerator
i++;
}
return 0; //no such solution
}
Step-wise example for finding the denominator that can yield 55%
55/100 = 11/20
100-55 = 45 = 9/20 (checkPercentage will be 45.0)
1/1 100.0%
1/2 50.00%
1/3 33.33%
2/3 66.67%
2/4 50.00%
2/5 40.00%
3/5 60.00%
3/6 50.00%
3/7 42.86% (too low, increase numerator)
4/7 57.14% (too high, increase denominator)
4/8 50.00%
4/9 44.44%
5/9 55.56%
5/10 50.00%
5/11 45.45%
6/11 54.54%
6/12 50.00%
6/13 46.15%
6/14 42.86%
7/14 50.00%
7/15 46.67%
7/16 43.75%
8/16 50.00%
8/17 47.06%
8/19 42.11%
9/19 47.37%
9/20 45.00% <-bingo
The nice thing about this method is that it will only take (i+j) steps where i is the numerator and j is the denominator.
I cannot see the relevance of this question to a position as junior developer.
Then answer that jumped into my head was more of a brute-force approach. There can be at most 5001 unique answers because there 5001 unique numbers between 00.00 and 50.00 . Consequently, why not create and save a look-up table. Obviously, there won't be 5001 unique answer because some answers will be repeated. The point is, there are only 5001 valid fractions because we are rounding to two digits.
int[] minPossible = new int[5001];
int numSolutionsFound = 0;
N = 2;
while(numSolutionsFound < 5001) {
for(int i = 0 ; i <= N/2 ; i++) {
//compute i/N
//see if the corresponding table entry is set
//if not write N there and increment numSolutionsFound
}
N++;
}
//Save answer here
Now the solution is merely a table look up.
FWIW I realize the euclidean solution is "correct". But I'd NEVER come up with that mid interview. However, I'd know something like that was possible -- but I won't be able to whip it out on the spot.

Howto convert decimal (xx.xx) to binary

This isn't necessarily a programming question but i'm sure you folks know how to do it. How would i convert floating point numbers into binary.
The number i am looking at is 27.625.
27 would be 11011, but what do i do with the .625?
On paper, a good algorithm to convert the fractional part of a decimal number is the "repeated multiplication by 2" algorithm (see details at http://www.exploringbinary.com/base-conversion-in-php-using-bcmath/, under the heading "dec2bin_f()"). For example, 0.8125 converts to binary as follows:
1. 0.8125 * 2 = 1.625
2. 0.625 * 2 = 1.25
3. 0.25 * 2 = 0.5
4. 0.5 * 2 = 1.0
The integer parts are stripped off and saved at each step, forming the binary result: 0.1101.
If you want a tool to do these kinds of conversions automatically, see my decimal/binary converter.
Assuming you are not thinking about inside a PC, just thinking about binary vs decimal as physically represented on a piece of paper:
You know .1 in binary should be .5 in decimal, so the .1's place is worth .5 (1/2)
the .01 is worth .25 (1/4) (half of the previous one)
the .001 is worth (1/8) (Half of 1/4)
Notice how the denominator is progressing just like the whole numbers to the left of the decimal do--standard ^2 pattern? The next should be 1/16...
So you start with your .625, is it higher than .5? Yes, so set the first bit and subtract the .5
.1 binary with a decimal remainder of .125
Now you have the next spot, it's worth .25dec, is that less than your current remainder of .125? No, so you don't have enough decimal "Money" to buy that second spot, it has to be a 0
.10 binary, still .125 remainder.
Now go to the third position, etc. (Hint: I don't think there will be too much etc.)
There are several different ways to encode a non-integral number in binary. By far the most common type are floating point representations, especially the one codified in IEEE 754.
the code works for me is as below , you can use this code to convert any type of dobule values:
private static String doubleToBinaryString( double n ) {
String val = Integer.toBinaryString((int)n)+"."; // Setting up string for result
String newN ="0" + (""+n).substring((""+n).indexOf("."));
n = Double.parseDouble(newN);
while ( n > 0 ) { // While the fraction is greater than zero (not equal or less than zero)
double r = n * 2; // Multiply current fraction (n) by 2
if( r >= 1 ) { // If the ones-place digit >= 1
val += "1"; // Concat a "1" to the end of the result string (val)
n = r - 1; // Remove the 1 from the current fraction (n)
}else{ // If the ones-place digit == 0
val += "0"; // Concat a "0" to the end of the result string (val)
n = r; // Set the current fraction (n) to the new fraction
}
}
return val; // return the string result with all appended binary values
}