Hashing - Collisions on purpose - function

Is there a way to design a hash such that certain subsets of the keys map to the same value (collisions) on purpose?
For example, if I wanted all (popcount>=4) subsets of some 64 bit int A to be mapped to X, and the same for B mapping to Y, etc.
I figured I could save some memory if I only had just enough keys due to collisions.

Use the modulus operator, and divide your 64-bit value by size of the set of hashKeys, like this:
hashKey = _64BitValue mod popcount
I don't know what language you are working with, but many modern languages use % for the modulus operator, some use "mod".
For instance, in java it would look like:
long hashKey = _64BitValue % popcount;
This will evenly distribute random 64-bit values across your keys.

Related

How do interpreters load their values?

I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.

Is there a "native" way to convert from numbers to dB in Tcl

dB or decibel is a unit that is used to show ratio in logarithmic scale, and specifecly, the definition of dB that I'm interested in is X(dB) = 20log(x) where x is the "normal" value, and X(dB) is the value in dB. When wrote a code converted between mil. and mm, I noticed that if I use the direct approach, i.e., multiplying by the ratio between the units, I got small errors on the opposite conversion, i.e.: to_mil [to_mm val_in_mil] wasn't equal to val_in_mil and the same with mm. The library units has solved this problem, as the conversions done by it do not have that calculation error. But the specifically doesn't offer (or I didn't find) the option to convert a number to dB in the library.
Is there another library / command that can transform numbers to dB and dB to numbers without calculation errors?
I did an experiment with using the direct math conversion, and I what I got is:
>> set a 0.005
0.005
>> set b [expr {20*log10($a)}]
-46.0205999133
>> expr {pow(10,($b/20))}
0.00499999999999
It's all a matter of precision. We often tend to forget that floating point numbers are not real numbers (in the mathematical sense of ℝ).
How many decimal digit do you need?
If you, for example, would only need 5 decimal digits, rounding 0.00499999999999 will give you 0.00500 which is what you wanted.
Since rounding fp numbers is not an easy task and may generate even more troubles, you might just change the way you determine if two numbers are equal:
>> set a 0.005
0.005
>> set b [expr {20*log10($a)}]
-46.0205999133
>> set c [expr {pow(10,($b/20))}]
0.00499999999999
>> expr {abs($a - $c) < 1E-10}
1
>> expr {abs($a - $c) < 1E-20}
0
>> expr {$a - $c}
8.673617379884035e-19
The numbers in your examples can be considered "equal" up to an error or 10-18. Note that this is just a rough estimate, not a full solution.
If you're really dealing with problems that are sensitive to numerical errors propagation you might look deeper into "numerical analysis". The article What Every Computer Scientist Should Know About Floating-Point Arithmetic or, even better, this site: http://floating-point-gui.de might be a start.
In case you need a larger precision you should drop your "native" requirement.
You may use the BigFloat offered by tcllib (http://tcllib.sourceforge.net/doc/bigfloat.html or even use GMP (the GNU multiple precision arithmetic library) through ffidl (http://elf.org/ffidl). There's an interface already defined for it: gmp.tcl
With the way floating point numbers are stored, every log10(...) can't correspond to exactly one pow(10, ...). So you lose precision, just like the integer divisions 89/7 and 88/7 both are 12.
When you put a value into floating point format, you should forget the ability to know it's exact value anymore unless you keep the old, exact value too. If you want exactly 1/200, store it as the integer 1 and the integer 200. If you want exactly the ten-logarithm of 1/200, store it as 1, 200 and the info that a ten-logarithm has been done on it.
You can fill your entire memory with the first x decimal digits of the square root of 2, but it still won't be the square root of 2 you store.

What exactly is a datatype?

I understand what a datatype is (intuitively). But I need the formal definition. I don't understand if it is a set or it's the names 'int' 'float' etc. The formal definition found on wikipedia is confusing.
In computer programming, a data type is a classification identifying one of various types of data, such as floating-point, integer, or Boolean, that determines the possible values for that type; the operations that can be done on values of that type; the meaning of the data; and the way values of that type can be stored.
Can anyone help me with that?
Yep. What that's saying is that a data type has three pieces:
The various possible values. So, for example, an eight bit signed integer might have -127..128. This of that as a set of values V.
The operations: so an 8-bit signed integer might have +, -, * (multiply), and / (divide). The full definition would define those as functions from V into V, or possible as a function from V into float for division.
The way it's stored -- I sort of gave it away when I said "eight bit signed integer". The other detail is that I'm assuming a specific representation by the way I showed the range of values.
You might, if you're into object oriented programming, notice that this is very much like the definition of a class, which is defined by the storage used by each object, adn the methods of the class. Providing those parts for some arbitrary thing, but not inheritance rules, gives you what's called an abstract data type.
Update
#Appy, there's some room for differences in the formalities. I was a little subtle because it was late and I was suddenly uncertain if I'd assumed one's complement or two's complement -- of course it's two's complement. So interpretation is included in my description. Abstractly, though, you'd say it is a algebraic structure T=(V,O) where V is a set of values, O a set of functions from V into some arbitrary type -- remember '==' for example will be a function eq:V × V → {0,1} so you can't expect every operation to be into V.
I can define it as a classification of a particular type of information. It is easy for humans to distinguish between different types of data. We can usually tell at a glance whether a number is a percentage, a time, or an amount of money. We do this through special symbols %, :, and $.
Basically it's the concept that I am sure you grock. For computers however a data type is defined and has various associated attributes, like size, like a definition keywork (sometimes), the values it can take (numbers or characters for example) and operations that can be done on it like add subtract for numbers and append on string or compare on a character, etc. These differ from language to language and even from environment to env. (16 - 32 bit ints/ 32 - 64 envs./ etc).
If there is anything I am missing or needs refining please ask as this is fairly open ended.

Give an unique 6 or 9 digit number to each row

Is it possible to assign an unique 6 or 9 digit number to each new row only with MySQL.
Example :
id1 : 928524
id2 : 124952
id3 : 485920
...
...
P.S : I can do that with php's rand() function, but I want a better way.
MySQL can assign unique continuous keys by itself. If you don't want to use rand(), maybe this is what you meant?
I suggest you manually set the ID of the first row to 100000, then tell the database to auto increment. Next row should then be 100001, then 100002 and so on. Each unique.
Don't know why you would ever want to do this but you will have to use php's rand function, see if its already in the database, if it is start from the beginning again, if its not then use it for the id.
Essentially you want a cryptographic hash that's guaranteed not to have a collision for your range of inputs. Nobody seems to know the collision behavior of MD5, so here's an algorithm that's guaranteed not to have any: Choose two large numbers M and N that have no common divisors-- they can be two very large primes, or 2**64 and 3**50, or whatever. You will be generating numbers in the range 0..M-1. Use the following hashing function:
H(k) = k*N (mod M)
Basic number theory guarantees that the sequence has no collisions in the range 0..M-1. So as long as the IDs in your table are less than M, you can just hash them with this function and you'll have distinct hashes. If you use unsigned 64-bit integer arithmetic, you can let M = 2**64. N can then be any odd number (I'd choose something large enough to ensure that k*N > M), and you get the modulo operation for free as arithmetic overflow!
I wrote the following in comments but I'd better repeat it here: This is not a good way to implement access protection. But it does prevent people from slurping all your content, if M is sufficiently large.

Compressing a binary matrix

We were asked to find a way to compress a square binary matrix as much as possible, and if possible, to add redundancy bits to check and maybe correct errors.
The redundancy thing is easy to implement in my opinion. The complicated part is compressing the matrix. I thought about using run-length after reshaping the matrix to a vector because there will be more zeros than ones, but I only achieved a 40bits compression (we are working on small sizes) although I thought it'd be better.
Also, after run-length an idea was Huffman coding the matrix, but a dictionary must be sent in order to recover the original information.
I'd like to know what would be the best way to compress a binary matrix?
After reading some comments, yes #Adam you're right, the 14x14 matrix should be compressed in 128bits, so if I only use the coordinates (rows&cols) for each non-zero element, still it would be 160bits (since there are twenty ones). I'm not looking for an exact solution but for a useful idea.
You can only talk about compressing something if you have a distribution and a representation. That's the issue of the dictionary you have to send along: you always need some sort of dictionary of protocol to uncompress something. It just so happens that things like .zip and .mpeg already have those dictionaries/codecs. Even something as simple as Huffman-encoding is an algorithm; on the other side of the communication channel (you can think of compression as communication), the other person already has a bit of code (the dictionary) to perform the Huffman decompression scheme.
Thus you cannot even begin to talk about compressing something without first thinking "what kinds of matrices do I expect to see?", "is the data truly random, or is there order?", and if so "how can I represent the matrices to take advantage of order in the data?".
You cannot compress some matrices without increasing the size of other objects (by at least 1 bit). This is bad news if all matrices are equally probable, and you care equally about them all.
Addenda:
The answer to use sparse matrix machinery is not necessarily the right answer. The matrix could for example be represented in python as [[(r+c)%2 for c in range (cols)] for r in range(rows)] (a checkerboard pattern), and a sparse matrix wouldn't compress it at all, but the Kolmogorov complexity of the matrix is the above program's length.
Well, I know every matrix will have the same number of ones, so this is kind of deterministic. The only think I don't know is where the 1's will be. Also, if I transmit the matrix with a dictionary and there are burst errors, maybe the dictionary gets affected so... wouldnt be the resulting information corrupted? That's why I was trying to use lossless data compression such as run-length, the decoder just doesnt need a dictionary. --original poster
How many 1s does the matrix have as a fraction of its size, and what is its size (NxN -- what is N)?
Furthermore, this is an incorrect assertion and should not be used as a reason to desire run-length encoding (which still requires a program); when you transmit data over a channel, you can always add error-correction to this data. "Data" is just a blob of bits. You can transmit both the data and any required dictionaries over the channel. The error-correcting machinery does not care at all what the bits you transmit are for.
Addendum 2:
There are (14*14) choose 20 possible arrangements, which I assume are randomly chosen. If this number was larger than 128^2 what you're trying to do would be impossible. Fortunately log_2((14*14) choose 20) ~= 90bits < 128bits so it's possible.
The simple solution of writing down 20 numbers like 32,2,67,175,52,...,168 won't work because log_2(14*14)*20 ~= 153bits > 128bits. This would be equivalent to run-length encoding. We want to do something like this but we are on a very strict budget and cannot afford to be "wasteful" with bits.
Because you care about each possibility equally, your "dictionary"/"program" will simulate a giant lookup table. Matlab's sparse matrix implementation may work but is not guaranteed to work and is thus not a correct solution.
If you can create a bijection between the number range [0,2^128) and subsets of size 20, you're good to go. This corresponds to enumerating ways to descend the pyramid in http://en.wikipedia.org/wiki/Binomial_coefficient to the 20th element of row 196. This is the same as enumerating all "k-combinations". See http://en.wikipedia.org/wiki/Combination#Enumerating_k-combinations
Fortunately I know that Mathematica and Sage and other CAS software can apparently generate the "5th" or "12th" or arbitrarily numbered k-subset. Looking through their documentation, we come upon a function called "rank", e.g. http://www.sagemath.org/doc/reference/sage/combinat/subset.html
So then we do some more searching, and come across some arcane Fortran code like http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_rank.m and http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_unrank.m
We could reverse-engineer it, but it's kind of dense. But now we have enough information to search for k-subset rank unrank, which leads us to http://www.site.uottawa.ca/~lucia/courses/5165-09/GenCombObj.pdf -- see the section
"Generating k-subsets (of an n-set): Lexicographical
Ordering" and the rank and unrank algorithms on the next few pages.
In order to achieve the exact theoretically optimal compression, in the case of a uniformly random distribution of 1s, we must thus use this technique to biject our matrices to our output number of range <2^128. It just so happens that combinations have a natural ordering, known as ranking and unranking of combinations. You assign a number to each combination (ranking), and if you know the number you automatically know the combination (unranking). Googling k-subset rank unrank will probably yield other algorithms.
Thus your solution would look like this:
serialize the matrix into a list
e.g. [[0,0,1][0,1,1][1,0,0]] -> [0,0,1,0,1,1,1,0,0]
take the indices of the 1s:
e.g. [0,0,1,0,1,1,1,0,0] -> [3,5,6,7]
1 2 3 4 5 6 7 8 9 a k=4-subset of an n=9 set
take the rank
e.g. compressed = rank([3,5,6,7], n=9)
compressed==412 (or something, I made that up)
you're done!
e.g. 412 -binary-> 110011100 (at most n=9bits, less than 2^n=2^9=512)
to uncompress, unrank it
I'll get to 128 bits in a sec, first here's how you fit a 14x14 boolean matrix with exactly 20 nonzeros into 136 bits. It's based on the CSC sparse matrix format.
You have an array c with 14 4-bit counters that tell you how many nonzeros are in each column.
You have another array r with 20 4-bit row indices.
56 bits (c) + 80 bits (r) = 136 bits.
Let's squeeze 8 bits out of c:
Instead of 4-bit counters, use 2-bit. c is now 2*14 = 28 bits, but can't support more than 3 nonzeros per column. This leaves us with 128-80-28 = 20 bits. Use that space for array a4c with 5 4-bit elements that "add 4 to an element of c" specified by the 4-bit element. So, if a4c={2,2,10,15, 15} that means c[2] += 4; c[2] += 4 (again); c[10] += 4;.
The "most wasteful" distribution of nonzeros is one where the column count will require an add-4 to support 1 extra nonzero: so 5 columns with 4 nonzeros each. Luckily we have exactly 5 add-4s available.
Total space = 28 bits (c) + 20 bits
(a4c) + 80 bits (r) = 128 bits.
Your input is a perfect candidate for a sparse matrix. You said you're using Matlab, so you already have a good sparse matrix built for you.
spm = sparse(dense_matrix)
Matlab's sparse matrix implementation uses Compressed Sparse Columns, which has memory usage on the order of 2*(# of nonzeros) + (# of columns), which should be pretty good in your case of 20 nonzeros and 14 columns. Storing 20 values sure is better than storing 196...
Also remember that all matrices in Matlab are going to be composed of doubles. Just because your matrix can be stored as a 1-bit boolean doesn't mean Matlab won't stick it into a 64-bit floating point value... If you do need it as a boolean you're going to have to make your own type in C and use .mex files to interface with Matlab.
After thinking about this again, if all your matrices are going to be this small and they're all binary, then just store them as a binary vector (bitmask). Going off your 14x14 example, that requires 196 bits or 25 bytes (plus n, m if your dimensions are not constant). That same vector in Matlab would use 64 bits per element, or 1568 bytes. So storing the matrix as a bitmask takes as much space as 4 elements of the original matrix in Matlab, for a compression ratio of 62x.
Unfortunately I don't know if Matlab supports bitmasks natively or if you have to resort to .mex files. If you do get into C++ you can use STL's vector<bool> which implements a bitmask for you.