Hash Index algo in MySQL - mysql

I was reading an article on hash indexing, and it seems that it is similar to the md5 function of PHP, in that that both take a string value and return an integer representing that string, and this representation is consistent. Is this similarity really there, or am I missing anything? Plus has anybody got an idea about the hashing algorithm MySQL employs for hash based index structure?

I'm not pretending to give a complete description on MySQL algo, but there are a few things that may be guessed.
First of all, Hash table wiki is a must-read. Then we have a notice from MySQL documentation:
They are used only for equality comparisons that use the = or <=> operators (but are very fast). They are not used for comparison
operators such as < that find a range of values. Systems that rely on
this type of single-value lookup are known as “key-value stores”; to
use MySQL for such applications, use hash indexes wherever possible.
The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot be used to search for the next
entry in order.)
MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide
which index to use). This may affect some queries if you change a
MyISAM table to a hash-indexed MEMORY table.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
This points to following (rather common) properties:
MySQL hash function operates on a fixed length "full-key" record (it
is a question though, how varchars are treated, e.g. they might be padded with zeros up to the maximum length)
There is a max_heap_table_size global value and a MAX_ROWS parameter that engine is likely to use when guessing upper row count for the hash function.
MySQL allows non-unique keys, but warns about proportional slowdowns. At least this may tell that there is no second hash function, but a mere linked list used in Collision resolution.
As for the actual function used, I don't think there is much to tell. MySQL may even use different functions according to some key heuristics (e.g. one for mostly sequential data, such as ID, but another for CHARs), and of course its output is changed according to estimated row count. However, you should only consider hash indices when BTREE cannot afford you good enough performance or you just never ever use any of its advantages, which is, I suppose, a rare case.
UPDATE
A bit into sources: /storage/heap/hp_hash.c contains a few implementations for hash functions. At least it was a right assumption that they use different techniques for different types, as it comes to TEXT and VARCHAR:
/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by email to the
* IEEE Posix P1003.2 mailing list from Phong Vo (kpv#research.att.com) and
* Glenn Fowler (gsf#research.att.com). Landon Curt Noll (chongo#toad.com)
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've seen so
* far, and works well on both numbers and strings.
*/
I'll try to give a simplified explanation.
ulong nr= 1, nr2= 4;
for (seg=keydef->seg,endseg=seg+keydef->keysegs ; seg < endseg ; seg++)
Every part of a compund key is processed separately, result is accumulated in nr.
if (seg->null_bit)
{
if (rec[seg->null_pos] & seg->null_bit)
{
nr^= (nr << 1) | 1;
continue;
}
}
NULL values are treated separately.
if (seg->type == HA_KEYTYPE_TEXT)
{
uint char_length= seg->length; /* TODO: fix to use my_charpos() */
seg->charset->coll->hash_sort(seg->charset, pos, char_length,
&nr, &nr2);
}
else if (seg->type == HA_KEYTYPE_VARTEXT1) /* Any VARCHAR segments */
{
uint pack_length= seg->bit_start;
uint length= (pack_length == 1 ? (uint) *(uchar*) pos : uint2korr(pos));
seg->charset->coll->hash_sort(seg->charset, pos+pack_length,
length, &nr, &nr2);
}
So are TEXT and VARCHAR. hash_sort is presumably some other function that takes collation into account. VARCHARs have a prefixed 1 or 2-byte length.
else
{
uchar *end= pos+seg->length;
for ( ; pos < end ; pos++)
{
nr *=16777619;
nr ^=(uint) *pos;
}
}
And every other type is treated byte-by-byte with mutiplication and xor.

Related

SQL string literal hexadecimal key to binary and back

after extensive search I am resorting to stack-overflows wisdom to help me.
Problem:
I have a database table that should effectively store values of the format (UserKey, data0, data1, ..) where the UserKey is to be handled as primary key but at least as an index. The UserKey itself (externally defined) is a string of 32 characters representing a checksum, which happens to be (a very big) hexadecimal number, i.e. it looks like this UserKey = "000000003abc4f6e000000003abc4f6e".
Now I can certainly store this UserKey in a char(32)-field, but I feel this being mighty inefficient, as I store a series of in principle arbitrary characters, i.e. reserving space for for more information per character than the 4 bits i need to store the hexadecimal characters (0..9,A-F).
So my thought was to convert this string literal into the hex-number it really represents, and store that. But this number (32*4 bits = 16Bytes) is much too big to store/handle as SQL only handles BIGINTS of 8Bytes.
My second thought was to convert this into a BINARY(16) representation, which should be compact and efficient concerning memory. However, I do not know how to efficiently convert between these two formats, as SQL also internally only handles numbers up to the maximum of 8 Bytes.
Maybe there is a way to convert this string to binary block by block and stitch the binary together somehow, in the way of:
UserKey == concat( stringblock1, stringblock2, ..)
UserKey_binary = concat( toBinary( stringblock1 ), toBinary( stringblock2 ), ..)
So my question is: is there any such mechanism foreseen in SQL that would solve this for me? How would a custom solution look like? (I find it hard to believe that I should be the first to encounter such a problem, as it has become quite modern to use ridiculously long hashkeys in many applications)
Also, the Userkey_binary should than act as relational key for the table, so I hope for a bit of speed by this more compact representation, as it needs to determine the difference on a minimal number of bits. Additionally, I want to mention that I would like to do any conversion if possible on the Server-side, so that user-scripts have not to be altered (the user-side should, if possible, still transmit a string literal not [partially] converted values in the insert statement)
In Contradiction to my previous statement, it seems that MySQL's UNHEX() function does a conversion from a string block by block and then concat much like I stated above, so the method works also for HEX literal values which are bigger than the BIGINT's 8 byte limitation. Here an example table that illustrates this:
CREATE TABLE `testdb`.`tab` (
`hexcol_binary` BINARY(16) GENERATED ALWAYS AS (UNHEX(charcol)) STORED,
`charcol` CHAR(32) NOT NULL,
PRIMARY KEY (`hexcol_binary`));
The primary key is a generated column, so that that updates to charcol are the designated way of interacting with the table with string literals from the outside:
REPLACE into tab (charcol) VALUES ('1010202030304040A0A0B0B0C0C0D0D0');
SELECT HEX(hexcol_binary) as HEXstring, tab.* FROM tab;
as seen building keys and indexes on the hexcol_binary works as intended.
To verify the speedup, take
ALTER TABLE `testdb`.`tab`
ADD INDEX `charkey` (`charcol` ASC);
EXPLAIN SELECT * from tab where hexcol_binary = UNHEX('1010202030304040A0A0B0B0C0C0D0D0') #keylength 16
EXPLAIN SELECT * from tab where charcol = '1010202030304040A0A0B0B0C0C0D0D0' #keylength 97
the lookup on the hexcol_binary column is much better performing, especially if its additonally made unique.
Note: the hex conversion does not care if the hex-characters A through F are capitalized or not for the conversion process, however the charcol will be very sensitive to this.

Generate unique serial from id number

I have a database that increases id incrementally. I need a function that converts that id to a unique number between 0 and 1000. (the actual max is much larger but just for simplicity's sake.)
1 => 3301,
2 => 0234,
3 => 7928,
4 => 9821
The number generated cannot have duplicates.
It can not be incremental.
Need it generated on the fly (not create a table of uniform numbers to read from)
I thought a hash function but there is a possibility for collisions.
Random numbers could also have duplicates.
I need a minimal perfect hash function but cannot find a simple solution.
Since the criteria are sort of vague (good enough to fool the average person), I am unsure exactly which route to take. Here are some ideas:
You could use a Pearson hash. According to the Wikipedia page:
Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.
You could just use a complicated looking one-to-one mathematical function. The drawback of this is that it would be difficult to make one that was not strictly increasing or strictly decreasing due to the one-to-one requirement. If you did something like (id ^ 2) + id * 2, the interval between ids would change and it wouldn't be immediately obvious what the function was without knowing the original ids.
You could do something like this:
new_id = (old_id << 4) + arbitrary_4bit_hash(old_id);
This would give the unique IDs and it wouldn't be immediately obvious that the first 4 bits are just garbage (especially when reading the numbers in decimal format). Like the last option, the new IDs would be in the same order as the old ones. I don't know if that would be a problem.
You could just hardcode all ID conversions by making a lookup array full of "random" numbers.
You could use some kind of hash function generator like gperf.
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
You could encrypt the ids with a key using a cryptographically secure mechanism.
Hopefully one of these works for you.
Update
Here is the rotational shift the OP requested:
function map($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
// Also, mask out all but 10 bits. This allows unique mappings
// from 0-1023 to 0-1023
$high_bits = 0b0000001111111000 & $number;
$new_low_bits = $high_bits >> 3;
$low_bits = 0b0000000000000111 & $number;
$new_high_bits = $low_bits << 7;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
function demap($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
$high_bits = 0b0000001110000000 & $number;
$new_low_bits = $high_bits >> 7;
$low_bits = 0b0000000001111111 & $number;
$new_high_bits = $low_bits << 3;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
This method has its advantages and disadvantages. The main disadvantage that I can think of (besides the security aspect) is that for lower IDs consecutive numbers will be exactly the same (multiplicative) interval apart until digits start wrapping around. That is to say
map(1) * 2 == map(2)
map(1) * 3 == map(3)
This happens, of course, because with lower numbers, all the higher bits are 0, so the map function is equivalent to just shifting. This is why I suggested using pseudo-random data for the lower bits rather than the higher bits of the number. It would make the regular interval less noticeable. To help mitigate this problem, the function I wrote shifts only the first 3 bits and rotates the rest. By doing this, the regular interval will be less noticeable for all IDs greater than 7.
It seems that it doesn't have to be numerical? What about an MD5-Hash?
select md5(id+rand(10000)) from ...

Give an unique 6 or 9 digit number to each row

Is it possible to assign an unique 6 or 9 digit number to each new row only with MySQL.
Example :
id1 : 928524
id2 : 124952
id3 : 485920
...
...
P.S : I can do that with php's rand() function, but I want a better way.
MySQL can assign unique continuous keys by itself. If you don't want to use rand(), maybe this is what you meant?
I suggest you manually set the ID of the first row to 100000, then tell the database to auto increment. Next row should then be 100001, then 100002 and so on. Each unique.
Don't know why you would ever want to do this but you will have to use php's rand function, see if its already in the database, if it is start from the beginning again, if its not then use it for the id.
Essentially you want a cryptographic hash that's guaranteed not to have a collision for your range of inputs. Nobody seems to know the collision behavior of MD5, so here's an algorithm that's guaranteed not to have any: Choose two large numbers M and N that have no common divisors-- they can be two very large primes, or 2**64 and 3**50, or whatever. You will be generating numbers in the range 0..M-1. Use the following hashing function:
H(k) = k*N (mod M)
Basic number theory guarantees that the sequence has no collisions in the range 0..M-1. So as long as the IDs in your table are less than M, you can just hash them with this function and you'll have distinct hashes. If you use unsigned 64-bit integer arithmetic, you can let M = 2**64. N can then be any odd number (I'd choose something large enough to ensure that k*N > M), and you get the modulo operation for free as arithmetic overflow!
I wrote the following in comments but I'd better repeat it here: This is not a good way to implement access protection. But it does prevent people from slurping all your content, if M is sufficiently large.

Creating a hash of a string thats sortable

Is there anyway to create hashs of strings where the hashes can be sorted and have the same results as if the strings themselves were sorted?
This won't be possible, at least if you allow strings longer than the hash size. You have 256^(max. string size) possible strings mapped to 256^(hash size) hash values, so you'll end up with some of the strings unsorted.
Just imagine the simplest hash: Truncating every string to (hash size) bytes.
Yes. It's called using the entire input string as the hash.
As others have pointed out it's not practical to do exactly what you've asked. You'd have to use the string itself as the hash which would constrain the lengths of strings that could be "hashed" and so on.
The obvious approach to maintaining a "sorted hash" data structure would be to maintain both a sorted list (heap or binary tree, for example) and a hashed mapping of the data. Inserts and removals would be O(log(n)) while retrievals would be O(1). Off hand I'm not sure how often this would be worth the additional complexity and overhead.
If you had a particularly large data set, mostly read-only and such that logarithmic time retrieval was overly expensive then I suppose it might be useful. Note that the cost of updates is actually the sum of the constant time (hash) and the logarithmic time (binary tree or heap) operations. However O(1) + O(log(n)) reduces to the larger of the two terms during asymptotic analysis. (The underlying cost is still there --- relevant to any implementation effort regardless of its theoretical irrelevance).
For a significant range of data set sizes the cost of maintaining this hypothetical hybrid data structure could be estimated as "twice" the cost of maintaining either of the pure ones. (In other words many implementations of a binary tree over can scale to billions of elements (2^~32 or so) in time cost that's comparable to the cost of the typical hash functions). So I'd be hard-pressed to convince myself that such added code complexity and run-time cost (of a hybrid data structure) would actually be of benefit to a given project.
(Note: I saw that Python 3.1.1 added the notion of "ordered" dictionaries ... and this is similar to being sorted, but not quite the same. From what I gather the ordered dictionary preserves the order in which elements were inserted to the collection. I also seem to remember some talk of "views" ... objects in the language which can access keys of a dictionary in some particular manner (sorted, reversed, reverse sorted, ...) at (possibly) lower cost than passing the set of keys through the built-in "sorted()" and "reversed()." I haven't used these nor have a looked at the implementation details. I would guess that one of these "views" would be something like a lazily evaluated index, performing the necessary sorting on call, and storing the results with some sort of flag or trigger (observer pattern or listener) that's reset when the back-end source collection is updated. In that scheme a call to the "view" would update its index; subsequence calls would be able to use those results so long as no insertions nor deletions had been made to the dictionary. Any call to the view subsequent to key changes would incur the cost of updating the view. However this is all pure speculation on my part. I mention it because it might also provide insight into some alternative ways to approach the question).
Not unless there are fewer strings than hashes, and the hashes are perfect. Even then you still have to ensure the hash order is the same as the string order, this is probably not possible unless you know all the strings ahead of time.
No. The hash would have to contain the same amount of information as the string it is replacing. Otherwise, if two strings mapped to the same hash value, how could you possibly sort them?
Another way of thinking about it is this: If I have two strings, "a" and "b", then I hash both of them with this sort preserving hash function and get f(a) and f(b). However, there are an infinite number of strings that are greater than "a" but less than "b". This would require hashing the strings to arbitrary precision Real values (because of cardinality). In the end, you would basically just have the string encoded as a number.
You're essentially asking if you can compress the key strings into smaller keys while preserving their collation order. So it depends on your data. If your strings are composed of only hexadecimal digits, for example, they can be replaced with 4-bit codes.
But for the general case, it can't be done. You'd end up "hashing" each source key into itself.
I stumble upon this, and although everyone is correct with their answers, I needed a solution exactly like this to use in elasticsearch (don't ask why). Sometimes we don't need a perfect solution for all cases, we just need one to work with the constraints that are acceptable. My solution is able to generate a sortable hashcode for the first n chars of the string, I did some preliminary tests and didn't have any collisions. You need to define beforehand the charset that is used and play with n to a deemed acceptable value of the first chars needed to sort and try to maintain the result hash code in the positive interval of the defined type for it to work, in my case, for Java Long type I could go up to 13 chars.
Below is my code in Java, hopefully, it will help someone else that needs this.
String charset = "abcdefghijklmnopqrstuvwxyz";
public long orderedHash(final String s, final String charset, final int n) {
Long hash = 0L;
if(s.isEmpty() || n == 0)
return hash;
Long charIndex = (long)(charset.indexOf(s.charAt(0)));
if(charIndex == -1)
return hash;
for(int i = 1 ; i < n; i++)
hash += (long)(charIndex * Math.pow(charset.length(), i));
hash += charIndex + 1 + orderedHash(s.substring(1), charset, n - 1);
return hash;
}
Examples:
orderedHash("a", charset, 13) // 1
orderedHash("abc", charset, 13) // 4110785825426312
orderedHash("b", charset, 13) // 99246114928149464
orderedHash("google", charset, 13) // 651008600709057847
orderedHash("stackoverflow", charset, 13) // 1858969664686174756
orderedHash("stackunderflow", charset, 13) // 1858969712216171093
orderedHash("stackunderflo", charset, 13) // 1858969712216171093 same, 13 chars limitation
orderedHash("z", charset, 13) // 2481152873203736576
orderedHash("zzzzzzzzzzzzz", charset, 13) // 2580398988131886038
orderedHash("zzzzzzzzzzzzzz", charset, 14) // -4161820175519153195 no good, overflow
orderedHash("ZZZZZZZZZZZZZ", charset, 13) // 0 no good, not in charset
If more precision is needed, use an unsigned type or a composite one made of two longs for example and compute the hashcode with substrings.
Edit: Although the previously algorithm sufficed for my use I noticed that it was not really ordering correctly the strings if they didn't have a length bigger that the chosen n. With this new algorithm it should be ok now.

MySQL bitwise operations, bloom filter

I'd like to implement a bloom filter using MySQL (other a suggested alternative).
The problem is as follows:
Suppose I have a table that stores 8 bit integers, with these following values:
1: 10011010
2: 00110101
3: 10010100
4: 00100110
5: 00111011
6: 01101010
I'd like to find all results that are bitwise AND to this:
00011000
The results should be rows 1 and 5.
However, in my problem, they aren't 8 bit integers, but rather n-bit integers. How do I store this, and how do I query? Speed is key.
Create a table with int column (use this link to pick the right int size). Don't store numbers as a sequence of 0 and 1.
For your data it will look like this:
number
154
53
148
38
59
106
and you need to find all entries matching 24.
Then you can run a query like
SELECT * FROM test WHERE number & 24 = 24
If you want to avoid convertion into 10 base numbers in your application you can hand it over to mysql:
INSERT INTO test SET number = b'00110101';
and search like this
SELECT bin(number) FROM test WHERE number & b'00011000' = b'00011000'
Consider not using MySQL for this.
First off, there probably isn't a built-in way for more than 64-bit tables. You'd have to resort to user-defined functions written in C.
Second, each query is going to require a full table scan, because MySQL can't use an index for your query. So, unless your table is very small, this will not be fast.
Switch to PostgreSQL and use bit(n)
Bloom filters by their nature require table scans to evaluate matches. In MySQL, there is no bloom filter type. The simple solution is to map the bytes of the bloom filter onto BitInteger (8-byte words) and perform the check in the query. So assuming that the bloom filters 8 bytes or fewer (a very small filter) you could execute a prepared statement like:
SELECT * FROM test WHERE cast(filter, UNSIGNED) & cast(?, UNSIGNED) = cast(?, UNSIGNED)
and replace the parameters with the value you are looking for. However, for larger filters, you have to create multiple filter columns and split your target filter into multiple words. You have to cast to unsigned to do the check properly.
Since many reasonable bloom filters are in the Kilo- to Megabyte range in size it makes sense to use blobs to store them. Once you switch to blobs there are no native mechanisms to perform the byte level comparisons. And pulling an entire table of large blobs across the network to do the filter in code locally does not make much sense.
The only reasonable solution I have found is a UDF. The UDF should accept a char* and iterate over it casting the char* to an unsigned char* and perform the target & candidate = target check. This code would look something like:
my_bool bloommatch(UDF_INIT *initid, UDF_ARGS *args, char* result, unsigned long* length, char *is_null, char *error)
{
if (args->lengths[0] > args->lengths[1])
{
return 0;
}
char* b1=args->args[0];
char* b2=args->args[1];
int limit = args->lengths[0];
unsigned char a;
unsigned char b;
int i;
for (i=0;i<limit;i++)
{
a = (unsigned char) b1[i];
b = (unsigned char) b2[i];
if ((a & b) != a)
{
return 0;
}
}
return 1;
}
This solution is implemented and is available here
For up to 64 bits, you can use a MySQL integer type, like tinyint (8b), int (16b), mediumint (24b) and bigint (64b). Use the unsigned variants.
Above 64b, use the MySQL (VAR)BINARY type. Those are raw byte buffers.
For example BINARY(16) is good for 128 bits.
To prevent table scans you need an index per useful bit, and/or an index per set of related bits. You can create virtual columns for that, and put an index on each of them.
To implement a Bloom filter using a database, I'd think about it differently.
I'd do a two-level filter. Use a single multi-bit hash function to generate an id (this would be more like a hash table bucket index) and then use bits within the row for the remaining k-1 hash functions of the more classical kind. Within the row, it could be (say) 100 bigint columns (I'd compare performance vs BLOBs too).
It would effectively be N separate Bloom filters, where N is the domain of your first hash function. The idea is to reduce the size of the Bloom filter required by choosing a hash bucket. It wouldn't have the full efficiency of an in-memory Bloom filter, but could still greatly reduce the amount of data needing to be stored compared to putting all the values in the database and indexing them. Presumably the reason for using a database in the first place is lack of memory for a full Bloom filter.