Generating Uniform Random Deviates within a given range - language-agnostic

I'd like to generate uniformly distributed random integers over a given range. The interpreted language I'm using has a builtin fast random number generator that returns a floating point number in the range 0 (inclusive) to 1 (inclusive). Unfortunately this means that I can't use the standard solution seen in another SO question (when the RNG returns numbers between 0 (inclusive) to 1 (exclusive) ) for generating uniformly distributed random integers in a given range:
result=Int((highest - lowest + 1) * RNG() + lowest)
The only sane method I can see at the moment is in the rare case that the random number generator returns 1 to just ask for a new number.
But if anyone knows a better method I'd be glad to hear it.
Rob
NB: Converting an existing random number generator to this language would result in something infeasibly slow so I'm afraid that's not a viable solution.
Edit: To link to the actual SO answer.

Presumably you are desperately interested in speed, or else you would just suck up the conditional test with every RNG call. Any other alternative is probably going to be slower than the branch anyway...
...unless you know exactly what the internal structure of the RNG is. Particularly, what are its return values? If they're not IEEE-754 floats or doubles, you have my sympathies. If they are, how many real bits of randomness are in them? You would expect 24 for floats and 53 for doubles (the number of mantissa bits). If those are naively generated, you may be able to use shifts and masks to hack together a plain old random integer generator out of them, and then use that in your function (depending on the size of your range, you may be able to use more shifts and masks to avoid any branching if you have such a generator). If you have a high-quality generator that produces full quality 24- or 53-bit random numbers, then with a single multiply you can convert them from [0,1] to [0,1): just multiply by the largest generatable floating-point number that is less than 1, and your range problem is gone. This trick will still work if the mantissas aren't fully populated with random bits, but you'll need to do a bit more work to find the right multiplier.
You may want to look at the C source to the Mersenne Twister to see their treatment of similar problems.

I don't see why the + 1 is needed. If the random number generator delivers a uniform distribution of values in the [0,1] interval then...
result = lowest + (rng() * (highest - lowest))
should give you a unform distribution of values between lowest
rng() == 0, result = lowest + 0 = lowest
and highest
rng() == 1, result = lowest + highest - lowest = highest
Including + 1 means that the upper bound on the generated number can be above highest
rng() == 1, result = lowest + highest - lowest + 1 = highest + 1.
The resulting distribution of values will be identical to the distribution of the random numbers, so uniformity depends on the quality of your random number generator.
Following on from your comment below you are right to point out that Int() will be the source of a lop-sided distribution at the tails. Better to use Round() to the nearest integer or whatever equivalent you have in your scripting language.

Related

Generate unique serial from id number

I have a database that increases id incrementally. I need a function that converts that id to a unique number between 0 and 1000. (the actual max is much larger but just for simplicity's sake.)
1 => 3301,
2 => 0234,
3 => 7928,
4 => 9821
The number generated cannot have duplicates.
It can not be incremental.
Need it generated on the fly (not create a table of uniform numbers to read from)
I thought a hash function but there is a possibility for collisions.
Random numbers could also have duplicates.
I need a minimal perfect hash function but cannot find a simple solution.
Since the criteria are sort of vague (good enough to fool the average person), I am unsure exactly which route to take. Here are some ideas:
You could use a Pearson hash. According to the Wikipedia page:
Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.
You could just use a complicated looking one-to-one mathematical function. The drawback of this is that it would be difficult to make one that was not strictly increasing or strictly decreasing due to the one-to-one requirement. If you did something like (id ^ 2) + id * 2, the interval between ids would change and it wouldn't be immediately obvious what the function was without knowing the original ids.
You could do something like this:
new_id = (old_id << 4) + arbitrary_4bit_hash(old_id);
This would give the unique IDs and it wouldn't be immediately obvious that the first 4 bits are just garbage (especially when reading the numbers in decimal format). Like the last option, the new IDs would be in the same order as the old ones. I don't know if that would be a problem.
You could just hardcode all ID conversions by making a lookup array full of "random" numbers.
You could use some kind of hash function generator like gperf.
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
You could encrypt the ids with a key using a cryptographically secure mechanism.
Hopefully one of these works for you.
Update
Here is the rotational shift the OP requested:
function map($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
// Also, mask out all but 10 bits. This allows unique mappings
// from 0-1023 to 0-1023
$high_bits = 0b0000001111111000 & $number;
$new_low_bits = $high_bits >> 3;
$low_bits = 0b0000000000000111 & $number;
$new_high_bits = $low_bits << 7;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
function demap($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
$high_bits = 0b0000001110000000 & $number;
$new_low_bits = $high_bits >> 7;
$low_bits = 0b0000000001111111 & $number;
$new_high_bits = $low_bits << 3;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
This method has its advantages and disadvantages. The main disadvantage that I can think of (besides the security aspect) is that for lower IDs consecutive numbers will be exactly the same (multiplicative) interval apart until digits start wrapping around. That is to say
map(1) * 2 == map(2)
map(1) * 3 == map(3)
This happens, of course, because with lower numbers, all the higher bits are 0, so the map function is equivalent to just shifting. This is why I suggested using pseudo-random data for the lower bits rather than the higher bits of the number. It would make the regular interval less noticeable. To help mitigate this problem, the function I wrote shifts only the first 3 bits and rotates the rest. By doing this, the regular interval will be less noticeable for all IDs greater than 7.
It seems that it doesn't have to be numerical? What about an MD5-Hash?
select md5(id+rand(10000)) from ...

Will this random double generator work?

Intuitively one might write a random double generator as follows:
double randDouble(double lowerBound, double upperBound)
{
double range = upperBound - lowerBound;
return lowerBound + range * rand();
}
Suppose we assume that rand() returns an evenly distributed pseudorandom double on the interval [0, 1).
Is this method guaranteed to return a random double within [lowerBound, upperBound) with a uniform probability distribution? I'm specifically interested in whether the nature of floating point calculations might cause spikes or dips in the final distribution for some ranges.
First, rand() generates pseudo-random numbers and not truly random. Thus, I will assume you are asking if your function generates pseudo-random numbers within the specified range.
Second, like Oli Charlesworth said, many rand implementations return a number between 0 and RAND_MAX, where RAND_MAX is the largest possible value it can take. In these cases, you can obtain a value in [0, 1) with
double r = rand()/((double)RAND_MAX+1);
the +1 is there so that r can't be 1.
Other languages have a rand that returns a value between 0 and 1, in which case you don't need to do the above division. Either way, it turns out that your function returns a decent approximation of a random distribution. See the following link for more details: http://www.thinkage.ca/english/gcos/expl/c/lib/rand.html Note that this link gives you slightly different functions which they claim work a bit better, but the one you have probably works good enough.
If your upper and lower bounds are adjacent powers of two, then your resulting distribution will be as good as the one you get from rand(), since you're effectively just altering the exponent of what rand() gives you, without altering the mantissa.
If you want to stretch the range to cover more than one power of two, then there will be valid floating point numbers in the lower half of your range that will never be generated by your method. (You're effectively shifting one or more bits of the mantissa into the exponent, leaving the least significant bit(s) of the mantissa as non-random.)
If you use the method on a more general range (such that the mantissa is modified by the calculation), then you also run in to the same non-uniformity you get when trying to convert convert a random integer to a random integer modulo n without using rejection sampling.
Any correct method for generating a uniform distribution of floating point numbers has to take in to account that the interval of real numbers that round to any given floating point number is not always the same width. In the lower part of a range, floating point numbers will be more dense, so each individual floating point number in that part of the range should be selected less often than the larger numbers.
Well, no. rand() returns a number between 0 and RAND_MAX; this quantisation will leave big holes in your distribution; in fact, almost all floating-point values in-between lowerBound and upperBound will never be selected.

How to "check" if a function really gives a random result?

How can one be sure that a function is really random or as close to the notion as possible? Also, what is the distinction between random and pseudo-random? Finally, what algorithms/sources can be used to generate random numbers?
P.S: Also asking this because a MySQL statement using ORDER BY RAND() LIMIT 1 isn't giving convincing results.
The thing about being random is that you can't tell if the return from a random function is random or not.
...or...
Proper random uses something that can truly be random, such as white noise. Pseudo random numbers are generally calculated from mathematical formulae or precomputed tables. The Linear congruential generator is a popular method of generating them.
To get a real random number, you generally want to interface with an outside source where something has been generated organically. This is called a True Random Number Generator.
Aloha!
There are several methods and tools for testing for randomness. These are applied on a set of numbers collected from the generator to be tested. That is, you test the generator based on a set of data generated.
In computing, esp for IT-security we normally want to have a generator that conforms to a uniform random process. There are many different processes, but I'm guessing that it is a uniform process you are aiming for.
NIST has published several documents with recommendations on both pseudo random number generators as well how to test them. Look at NIST documents SP 800-22 and SP 800-20.
As somebody else pointed out. If you want a True Random Number Generator (TRNG) you need to gather physical entropy. Examples of such sources are radioactive decay, cosmic radiation, lava lamps etc. Preferably you want sources that are hard to manipulate. IETF has an RFC that have some good recommendations, see RFC 4086 - Source of Randomness for Security:
https://www.rfc-editor.org/rfc/rfc4086
What you normally do is to collect entropy from one ore more (preferably more than one) source. The collected data is then filtered (whitening) and finally used to periodically seed a good PRNG. With different seeds, naturally.
This is how most modern good random generators works. An entropy collector feeding a PRNG created using cryptographic primitives such as symmetric ciphers (AES for example) or hash functions. See for example the random generator Yarrow/Fortuna by Schneier, which in modified form is used in FreeBSD.
Coming back to your question about testing. As somebody pointed out Marsaglia have produced a good set of tests, which was codified in the DIEHARD tests. There are now an even more exapnded set of tests in the Dieharder tests:
http://www.phy.duke.edu/~rgb/General/dieharder.php
Dieharder is a good tool that will give you good confidence that the huge pile of numbers supplied to it (collected from your generator) is random (with good quality) or not. Running Dieharder is easy, but will take some time.
In situ testing of randomness is hard. You don't normally want to implement Dieharder in your system. What you can do is implement some simple detectors that should detect patholigical cases. I usually suggest:
Equal value length. A simple counter that is reset whenever two consequtive values generated by the RNG differs. And then you need to define a threshold when you think the counter shows that the RNG is broken. If you see 10 million equal values and the value space is greater that one value (the one you see) your RNG is probably not working all that well. Esp if the value are seeing is one of the edge values. For example 0x00000.... or 0xfffff...
Median value. If you after generating a million values and have a uniform distribution have a median value that is heavily leaning towards one of the value space edges, not close to the middle, someting is probably also amiss.
Variance. If you after generating million of values haven't seen values close to the MIN and MAX of the value space, but instead have a narrow generated value space, then something is also amiss.
Finally. Since you hopefully are using a good PRNG (based on AES for example), the in situ-tests suggested might instead be applied on the entropy source.
I hope that helped in some ways.
There are statistical tests you can apply to see how likely it is that a given sequence of numbers were independent, identically distributed (iid) random variables.
Take a look at A Current View of Random Number Generators by George Marsaglia. In particular, take a look at sections 6-12. This provides an introduction to such tests followed by several that you can apply.
True, We can not guarantee the random number is actually a random .
about pseudo-random numbers : yes they just seems to be random ( Originally used in cryptography) (pseudo random functions ), when sending encrypted text and the evil in between traps the message thinks that the encrypted text he got is random, but the message was calculated from some function, moreover you will get the same message using the same function and key ( if any , so no-where they are not random, just look like random because you can not create the original text/number from which it generate. Such as hash functions(md5,sha1) and encryption techniques ( des,aes etc ).
For the number to be random, it must not be possible to predict it. So, any algorithm that generates "random" numbers generates pseudo-random numbers, as it is always possible to generate the same sequence of "random" numbers, using prievously used seed or value that is used during "randomizing". Truly random number can be generated by for example dice roll, but not computer algorithm.
The theoretical computer science teaches that a computer is a deterministic machine. Every algorithm always runs the same way, so you have to vary your seed. But where should a computer get a random seed from? From an external device? The CPU temperature (which would not vary much)?
To test a function that returns random numbers you should call it many times and see how many times each number is returned.
For example
For i := 1 to 1000000 do // Test the function 1.000.000 times
begin
RandomNumber := Rand(9); // Random numbers from 0 to 9
case RandomNumber of
1 : Returned0 := Returned0 + 1;
1 : Returned1 := Returned1 + 1;
1 : Returned2 := Returned2 + 1;
1 : Returned3 := Returned3 + 1;
1 : Returned4 := Returned4 + 1;
1 : Returned5 := Returned5 + 1;
1 : Returned6 := Returned6 + 1;
1 : Returned7 := Returned7 + 1;
1 : Returned8 := Returned8 + 1;
1 : Returned9 := Returned9 + 1;
end;
end
WriteLn('0: ', Returned0);
WriteLn('1: ', Returned1);
WriteLn('2: ', Returned2);
WriteLn('3: ', Returned3);
WriteLn('4: ', Returned4);
WriteLn('5: ', Returned5);
WriteLn('6: ', Returned6);
WriteLn('7: ', Returned7);
WriteLn('8: ', Returned8);
WriteLn('9: ', Returned9);
A perfect output should be equal numbers for each random output. Something like:
0: 100000
1: 100000
2: 100000
3: 100000
4: 100000
5: 100000
6: 100000
7: 100000
8: 100000
9: 100000

Detecting repetition with infinite input

What is the most optimal way to find repetition in a infinite sequence of integers?
i.e. if in the infinite sequence the number '5' appears twice then we will return 'false' the first time and 'true' the second time.
In the end what we need is a function that returns 'true' if the integer appeared before and 'false' if the function received the integer the first time.
If there are two solutions, one is space-wise and the second is time-wise, then mention both.
I will write my solution in the answers, but I don't think it is the optimal one.
edit: Please don't assume the trivial cases (i.e. no repetitions, a constantly rising sequence). What interests me is how to reduce the space complexity of the non-trivial case (random numbers with repetitions).
I'd use the following approach:
Use a hash table as your datastructure. For every number read, store it in your datastructure. If it's already stored before you found a repetition.
If n is the number of elements in the sequence from start to the repetition, then this only requires O(n) time and space. Time complexity is optimal, as you need to at least read the input sequence's elements up to the repetition point.
How long of a sequence are we talking (before the repetition occurs)? Is a repetition even guaranteed at all? For extreme cases the space complexity might become problematic. But to improve it you will probably need to know more structural information on your sequence.
Update: If the sequence is as you say very long with seldom repetitions and you have to cut down on the space requirement, then you might (given sufficient structural information on the sequence) be able to cut down the space cost.
As an example: let's say you know that your infinite sequence has a general tendency to return numbers that fit within the current range of witnessed min-max numbers. Then you will eventually have whole intervals that have already been contained in the sequence. In that case you can save space by storing such intervals instead of all the elements contained within it.
A BitSet for int values (2^32 numbers) would consume 512Mb. This may be ok if the BitSets are allocated not to often, fast enough and the mem is available.
An alternative are compressed BitSets that work best for sparse BitSets.
Actually, if the max number of values is infinite, you can use any lossless compression algorithm for a monochrome bitmap. IF you imagine a square with at least as many pixels as the number of possible values, you can map each value to a pixel (with a few to spare). Then you can represent white as the pixels that appeared and black for the others and use any compression algorithm if space is at a premium (that is certainly a problem that has been studied)
You can also store blocks. The worst case is the same in space O(n) but for that worst case you need that the number appeared have exactly 1 in between them. Once more numbers appear, then the storage will decrease:
I will write pseudocode and I will use a List, but you can always use a different structure
List changes // global
boolean addNumber(int number):
boolean appeared = false
it = changes.begin()
while it.hasNext():
if it.get() < number:
appeared != appeared
it = it.next()
else if it.get() == number:
if !appeared: return true
if it.next().get() == number + 1
it.next().remove() // Join 2 blocks
else
it.insertAfter(number + 1) // Insert split and create 2 blocks
it.remove()
return false
else: // it.get() > number
if appeared: return true
it.insertBefore(number)
if it.get() == number + 1:
it.remove() // Extend next block
else:
it.insertBefore(number + 1)
}
return false
}
What this code is the following: it stores a list of blocks. For each number that you add, it iterates over the list storing blocks of numbers that appeared and numbers that didn't. Let me illustrate with an example; I will add [) to illustrate which numbers in the block, the first number is included, the last is not.In the pseudocode it is replaced by the boolean appeared. For instance, if you get the 5, 9, 6, 8, 7 (in this order) you will have the following sequences after each function:
[5,6)
[5,6),[9,10)
[5,7),[9,10)
[5,7),[8,10)
[5,10)
In the last value you keep a block of 5 numbers with only 2.
Return TRUE
If the sequence is infinite then there will be repetition of every conceivable pattern.
If what you want to know is the first place in the sequence when there is a repeated digit that's another matter, but there's some difference between your question and your example.
Well, it seems obvious that in any solution we'll need to save the numbers that already appeared, so space wise we will always have a worst-case of O(N) where N<=possible numbers with the word size of our number type (i.e. 2^32 for C# int) - this is problematic over a long time if the sequence is really infinite/rarely repeats itself.
For saving the numbers that already appeared I would use an hash table and then check it each time I receive a new number.

What is the proper method of constraining a pseudo-random number to a smaller range?

What is the best way to constrain the values of a PRNG to a smaller range? If you use modulus and the old max number is not evenly divisible by the new max number you bias toward the 0 through (old_max - new_max - 1). I assume the best way would be something like this (this is floating point, not integer math)
random_num = PRNG() / max_orginal_range * max_smaller_range
But something in my gut makes me question that method (maybe floating point implementation and representation differences?).
The random number generator will produce consistent results across hardware and software platforms, and the constraint needs to as well.
I was right to doubt the pseudocode above (but not for the reasons I was thinking). MichaelGG's answer got me thinking about the problem in a different way. I can model it using smaller numbers and test every outcome. So, let's assume we have a PRNG that produces a random number between 0 and 31 and you want the smaller range to be 0 to 9. If you use modulus you bias toward 0, 1, 2, and 3. If you use the pseudocode above you bias toward 0, 2, 5, and 7. I don't think there can be a good way to map one set into the other. The best that I have come up with so far is to regenerate the random numbers that are greater than old_max/new_max, but that has deep problems as well (reducing the period, time to generate new numbers until one is in the right range, etc.).
I think I may have naively approached this problem. It may be time to start some serious research into the literature (someone has to have tackled this before).
I know this might not be a particularly helpful answer, but I think the best way would be to conceive of a few different methods, then trying them out a few million times, and check the result sets.
When in doubt, try it yourself.
EDIT
It should be noted that many languages (like C#) have built in limiting in their functions
int maximumvalue = 20;
Random rand = new Random();
rand.Next(maximumvalue);
And whenever possible, you should use those rather than any code you would write yourself. Don't Reinvent The Wheel.
This problem is akin to rolling a k-sided die given only a p-sided die, without wasting randomness.
In this sense, by Lemma 3 in "Simulating a dice with a dice" by B. Kloeckner, this waste is inevitable unless "every prime number dividing k also divides p". Thus, for example, if p is a power of 2 (and any block of random bits is the same as rolling a die with a power of 2 number of faces) and k has prime factors other than 2, the best you can do is get arbitrarily close to no waste of randomness, such as by batching multiple rolls of the p-sided die until p^n is "close enough" to a power of k.
Let me also go over some of your concerns about regenerating random numbers:
"Reducing the period": Besides batching of bits, this concern can be dealt with in several ways:
Use a PRNG with a bigger "period" (maximum cycle length).
Add a Bays–Durham shuffle to the PRNG's implementation.
Use a "true" random number generator; this is not trivial.
Employ randomness extraction, which is discussed in Devroye and Gravel 2015-2020 and in my Note on Randomness Extraction. However, randomness extraction is pretty involved.
Ignore the problem, especially if it isn't a security application or serious simulation.
"Time to generate new numbers until one is in the right range": If you want unbiased random numbers, then any algorithm that does so will generally have to run forever in the worst case. Again, by Lemma 3, the algorithm will run forever in the worst case unless "every prime number dividing k also divides p", which is not the case if, say, k is 10 and p is 32.
See also the question: How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?, especially my answer there.
If PRNG() is generating uniformly distributed random numbers then the above looks good. In fact (if you want to scale the mean etc.) the above should be fine for all purposes. I guess you need to ask what the error associated with the original PRNG() is, and whether further manipulating will add to that substantially.
If in doubt, generate an appropriately sized sample set, and look at the results in Excel or similar (to check your mean / std.dev etc. for what you'd expect)
If you have access to a PRNG function (say, random()) that'll generate numbers in the range 0 <= x < 1, can you not just do:
random_num = (int) (random() * max_range);
to give you numbers in the range 0 to max_range?
Here's how the CLR's Random class works when limited (as per Reflector):
long num = maxValue - minValue;
if (num <= 0x7fffffffL) {
return (((int) (this.Sample() * num)) + minValue);
}
return (((int) ((long) (this.GetSampleForLargeRange() * num))) + minValue);
Even if you're given a positive int, it's not hard to get it to a double. Just multiply the random int by (1/maxint). Going from a 32-bit int to a double should provide adequate precision. (I haven't actually tested a PRNG like this, so I might be missing something with floats.)
Psuedo random number generators are essentially producing a random series of 1s and 0s, which when appended to each other, are an infinitely large number in base two. each time you consume a bit from you're prng, you are dividing that number by two and keeping the modulus. You can do this forever without wasting a single bit.
If you need a number in the range [0, N), then you need the same, but instead of base two, you need base N. It's basically trivial to convert the bases. Consume the number of bits you need, return the remainder of those bits back to your prng to be used next time a number is needed.