I'd like to implement a bloom filter using MySQL (other a suggested alternative).
The problem is as follows:
Suppose I have a table that stores 8 bit integers, with these following values:
1: 10011010
2: 00110101
3: 10010100
4: 00100110
5: 00111011
6: 01101010
I'd like to find all results that are bitwise AND to this:
00011000
The results should be rows 1 and 5.
However, in my problem, they aren't 8 bit integers, but rather n-bit integers. How do I store this, and how do I query? Speed is key.
Create a table with int column (use this link to pick the right int size). Don't store numbers as a sequence of 0 and 1.
For your data it will look like this:
number
154
53
148
38
59
106
and you need to find all entries matching 24.
Then you can run a query like
SELECT * FROM test WHERE number & 24 = 24
If you want to avoid convertion into 10 base numbers in your application you can hand it over to mysql:
INSERT INTO test SET number = b'00110101';
and search like this
SELECT bin(number) FROM test WHERE number & b'00011000' = b'00011000'
Consider not using MySQL for this.
First off, there probably isn't a built-in way for more than 64-bit tables. You'd have to resort to user-defined functions written in C.
Second, each query is going to require a full table scan, because MySQL can't use an index for your query. So, unless your table is very small, this will not be fast.
Switch to PostgreSQL and use bit(n)
Bloom filters by their nature require table scans to evaluate matches. In MySQL, there is no bloom filter type. The simple solution is to map the bytes of the bloom filter onto BitInteger (8-byte words) and perform the check in the query. So assuming that the bloom filters 8 bytes or fewer (a very small filter) you could execute a prepared statement like:
SELECT * FROM test WHERE cast(filter, UNSIGNED) & cast(?, UNSIGNED) = cast(?, UNSIGNED)
and replace the parameters with the value you are looking for. However, for larger filters, you have to create multiple filter columns and split your target filter into multiple words. You have to cast to unsigned to do the check properly.
Since many reasonable bloom filters are in the Kilo- to Megabyte range in size it makes sense to use blobs to store them. Once you switch to blobs there are no native mechanisms to perform the byte level comparisons. And pulling an entire table of large blobs across the network to do the filter in code locally does not make much sense.
The only reasonable solution I have found is a UDF. The UDF should accept a char* and iterate over it casting the char* to an unsigned char* and perform the target & candidate = target check. This code would look something like:
my_bool bloommatch(UDF_INIT *initid, UDF_ARGS *args, char* result, unsigned long* length, char *is_null, char *error)
{
if (args->lengths[0] > args->lengths[1])
{
return 0;
}
char* b1=args->args[0];
char* b2=args->args[1];
int limit = args->lengths[0];
unsigned char a;
unsigned char b;
int i;
for (i=0;i<limit;i++)
{
a = (unsigned char) b1[i];
b = (unsigned char) b2[i];
if ((a & b) != a)
{
return 0;
}
}
return 1;
}
This solution is implemented and is available here
For up to 64 bits, you can use a MySQL integer type, like tinyint (8b), int (16b), mediumint (24b) and bigint (64b). Use the unsigned variants.
Above 64b, use the MySQL (VAR)BINARY type. Those are raw byte buffers.
For example BINARY(16) is good for 128 bits.
To prevent table scans you need an index per useful bit, and/or an index per set of related bits. You can create virtual columns for that, and put an index on each of them.
To implement a Bloom filter using a database, I'd think about it differently.
I'd do a two-level filter. Use a single multi-bit hash function to generate an id (this would be more like a hash table bucket index) and then use bits within the row for the remaining k-1 hash functions of the more classical kind. Within the row, it could be (say) 100 bigint columns (I'd compare performance vs BLOBs too).
It would effectively be N separate Bloom filters, where N is the domain of your first hash function. The idea is to reduce the size of the Bloom filter required by choosing a hash bucket. It wouldn't have the full efficiency of an in-memory Bloom filter, but could still greatly reduce the amount of data needing to be stored compared to putting all the values in the database and indexing them. Presumably the reason for using a database in the first place is lack of memory for a full Bloom filter.
Related
I need to store an array of unsigned TINYINT in a MySQL table for each user.
The array length is constant and I don't need to do any search or sorting in it.
Only its values are changed over time.
My goal is to have the values stored in a way that the data size remains as close as N x TINYINT for each line and hopefully readable.
I was considering 2 solutions:
Solution 1:
| user_id | TINYINT_1 | TINYINT_... | TINYINT_N |
Solution 2:
| user_id | JSON array [TINYINT_1, TINYINT_..., TINYINT_N] |
The second seems cleaner as I don't need to give N useless names, but from what I understand I have have no control on the type of value used to store data in a JSON array and I'm afraid that it increase the final memory size way more than N x TINYINT per line.
Is there way to control the type of values or some other smarter ways to do it?
Thanks for your advises.
One TINYINT takes one byte. The only way to ensure the storage of N TINYINTs is N x TINYINT bytes is to store them as a BINARY(N) up to N of 255, and BLOB if it's longer. That is, each TINYINT gets one byte in a binary string. That's not readable at all, but it is the most compact way to store it.
Because you would be responsible for interpreting this string byte-by-byte, there is no way it could be misinterpreted, or the elements treated as some other data type. That's entirely up to your code that unpacks the string.
MySQL does not have an array type (like for example PostgreSQL does).
If you want the array of TINYINT to be stored in a readable fashion, you could store it as a string of hex digits, with two digits per TINYINT. This takes exactly twice the space of the BINARY(N) solution.
You can also store the numbers as a string of comma-separated text numbers in base 10, which is more readable, but takes more space.
You can also use JSON, which allows for an array of digits, but it takes even more space because it stores numbers in base 10, and there need to be [ ] array delimiters. And you already thought of the possibility that JSON allows arbitrary types for the array elements. MySQL supports JSON schema, but not automatically. You'd have to write a CHECK constraint. There's an example here: https://dev.mysql.com/doc/refman/8.0/en/json-validation-functions.html
after extensive search I am resorting to stack-overflows wisdom to help me.
Problem:
I have a database table that should effectively store values of the format (UserKey, data0, data1, ..) where the UserKey is to be handled as primary key but at least as an index. The UserKey itself (externally defined) is a string of 32 characters representing a checksum, which happens to be (a very big) hexadecimal number, i.e. it looks like this UserKey = "000000003abc4f6e000000003abc4f6e".
Now I can certainly store this UserKey in a char(32)-field, but I feel this being mighty inefficient, as I store a series of in principle arbitrary characters, i.e. reserving space for for more information per character than the 4 bits i need to store the hexadecimal characters (0..9,A-F).
So my thought was to convert this string literal into the hex-number it really represents, and store that. But this number (32*4 bits = 16Bytes) is much too big to store/handle as SQL only handles BIGINTS of 8Bytes.
My second thought was to convert this into a BINARY(16) representation, which should be compact and efficient concerning memory. However, I do not know how to efficiently convert between these two formats, as SQL also internally only handles numbers up to the maximum of 8 Bytes.
Maybe there is a way to convert this string to binary block by block and stitch the binary together somehow, in the way of:
UserKey == concat( stringblock1, stringblock2, ..)
UserKey_binary = concat( toBinary( stringblock1 ), toBinary( stringblock2 ), ..)
So my question is: is there any such mechanism foreseen in SQL that would solve this for me? How would a custom solution look like? (I find it hard to believe that I should be the first to encounter such a problem, as it has become quite modern to use ridiculously long hashkeys in many applications)
Also, the Userkey_binary should than act as relational key for the table, so I hope for a bit of speed by this more compact representation, as it needs to determine the difference on a minimal number of bits. Additionally, I want to mention that I would like to do any conversion if possible on the Server-side, so that user-scripts have not to be altered (the user-side should, if possible, still transmit a string literal not [partially] converted values in the insert statement)
In Contradiction to my previous statement, it seems that MySQL's UNHEX() function does a conversion from a string block by block and then concat much like I stated above, so the method works also for HEX literal values which are bigger than the BIGINT's 8 byte limitation. Here an example table that illustrates this:
CREATE TABLE `testdb`.`tab` (
`hexcol_binary` BINARY(16) GENERATED ALWAYS AS (UNHEX(charcol)) STORED,
`charcol` CHAR(32) NOT NULL,
PRIMARY KEY (`hexcol_binary`));
The primary key is a generated column, so that that updates to charcol are the designated way of interacting with the table with string literals from the outside:
REPLACE into tab (charcol) VALUES ('1010202030304040A0A0B0B0C0C0D0D0');
SELECT HEX(hexcol_binary) as HEXstring, tab.* FROM tab;
as seen building keys and indexes on the hexcol_binary works as intended.
To verify the speedup, take
ALTER TABLE `testdb`.`tab`
ADD INDEX `charkey` (`charcol` ASC);
EXPLAIN SELECT * from tab where hexcol_binary = UNHEX('1010202030304040A0A0B0B0C0C0D0D0') #keylength 16
EXPLAIN SELECT * from tab where charcol = '1010202030304040A0A0B0B0C0C0D0D0' #keylength 97
the lookup on the hexcol_binary column is much better performing, especially if its additonally made unique.
Note: the hex conversion does not care if the hex-characters A through F are capitalized or not for the conversion process, however the charcol will be very sensitive to this.
I have a table of user entries, and for every entry I have an array of (2-byte) integers to store (15-25, sporadically even more). The array elements will be written and read all at the same time, it is never needed to update or to access them individually. Their order matters. It makes sense to think of this as an array object.
I have many millions of these user entries and want to store this with the minimum possible amount of disk space. I'm however struggling with MySQL's lack of Array datatype.
I've been considering the following options.
Do it the MySQL way. Make a table my_data with columns user_id, data_id and data_int. To make this efficient, one needs an index on user_id, totalling well over 10 bytes per integer.
Store the array in text format. This takes ~6.5 bytes per integer.
making 35-40 columns ("enough") and having -32768 be 'empty' (since this value cannot occur in my data). This takes 3.5-4 bytes per integer, but is somewhat ugly (as I have to impose a strict limit on the number of elements in the array).
Is there a better way to do this in MySQL? I know MySQL has an efficient varchar type, so ideally I'd store my 2-byte integers as 2-byte chars in a varchar (or a similar approach with blob), but I'm not sure how to do that. Is this possible? How should this be done?
You could store them as separate SMALLINT NULL columns.
In MyISAM this this uses 2 bytes of data + 1 bit of null indicator for each value.
In InnoDB, the null indicators are encoded into the column's field start offset, so they don't take any extra space, and null values are not actually stored in the row data. If the rows are small enough that all the offsets are 1 byte, then this uses 3 bytes for every existing value (1 byte offset, 2 bytes data), and 1 byte for every nonexistent value.
Either of these would be better than using INT with a special value to indicate that it doesn't exist, since that would be 4 bytes of data for every value.
See NULL in MySQL (Performance & Storage)
The best answer was given in the comments, so I'll repost it here with some use-ready code, for further reference.
MySQL has a varbinary type that works really well for this: you can simply use PHP's pack/unpack functions to convert them to and from binary form, and store that binary form in the database using varbinary. Example code for the conversion is below.
function pack24bit($n) { //input: 24-bit integer, output: binary string of length 3 bytes
$b3 = $n%256;
$b2 = $n/256;
$b1 = $b2/256;
$b2 = $b2%256;
return pack('CCC',$b1,$b2,$b3);
}
function unpack24bit($packed) { //input: binary string of 3 bytes long, output: 24-bit int
$arr = unpack('C3b',$packed);
return 256*(256*$arr['b1']+$arr['b2'])+$arr['b3'];
}
I was reading an article on hash indexing, and it seems that it is similar to the md5 function of PHP, in that that both take a string value and return an integer representing that string, and this representation is consistent. Is this similarity really there, or am I missing anything? Plus has anybody got an idea about the hashing algorithm MySQL employs for hash based index structure?
I'm not pretending to give a complete description on MySQL algo, but there are a few things that may be guessed.
First of all, Hash table wiki is a must-read. Then we have a notice from MySQL documentation:
They are used only for equality comparisons that use the = or <=> operators (but are very fast). They are not used for comparison
operators such as < that find a range of values. Systems that rely on
this type of single-value lookup are known as “key-value stores”; to
use MySQL for such applications, use hash indexes wherever possible.
The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot be used to search for the next
entry in order.)
MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide
which index to use). This may affect some queries if you change a
MyISAM table to a hash-indexed MEMORY table.
Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
This points to following (rather common) properties:
MySQL hash function operates on a fixed length "full-key" record (it
is a question though, how varchars are treated, e.g. they might be padded with zeros up to the maximum length)
There is a max_heap_table_size global value and a MAX_ROWS parameter that engine is likely to use when guessing upper row count for the hash function.
MySQL allows non-unique keys, but warns about proportional slowdowns. At least this may tell that there is no second hash function, but a mere linked list used in Collision resolution.
As for the actual function used, I don't think there is much to tell. MySQL may even use different functions according to some key heuristics (e.g. one for mostly sequential data, such as ID, but another for CHARs), and of course its output is changed according to estimated row count. However, you should only consider hash indices when BTREE cannot afford you good enough performance or you just never ever use any of its advantages, which is, I suppose, a rare case.
UPDATE
A bit into sources: /storage/heap/hp_hash.c contains a few implementations for hash functions. At least it was a right assumption that they use different techniques for different types, as it comes to TEXT and VARCHAR:
/*
* Fowler/Noll/Vo hash
*
* The basis of the hash algorithm was taken from an idea sent by email to the
* IEEE Posix P1003.2 mailing list from Phong Vo (kpv#research.att.com) and
* Glenn Fowler (gsf#research.att.com). Landon Curt Noll (chongo#toad.com)
* later improved on their algorithm.
*
* The magic is in the interesting relationship between the special prime
* 16777619 (2^24 + 403) and 2^32 and 2^8.
*
* This hash produces the fewest collisions of any function that we've seen so
* far, and works well on both numbers and strings.
*/
I'll try to give a simplified explanation.
ulong nr= 1, nr2= 4;
for (seg=keydef->seg,endseg=seg+keydef->keysegs ; seg < endseg ; seg++)
Every part of a compund key is processed separately, result is accumulated in nr.
if (seg->null_bit)
{
if (rec[seg->null_pos] & seg->null_bit)
{
nr^= (nr << 1) | 1;
continue;
}
}
NULL values are treated separately.
if (seg->type == HA_KEYTYPE_TEXT)
{
uint char_length= seg->length; /* TODO: fix to use my_charpos() */
seg->charset->coll->hash_sort(seg->charset, pos, char_length,
&nr, &nr2);
}
else if (seg->type == HA_KEYTYPE_VARTEXT1) /* Any VARCHAR segments */
{
uint pack_length= seg->bit_start;
uint length= (pack_length == 1 ? (uint) *(uchar*) pos : uint2korr(pos));
seg->charset->coll->hash_sort(seg->charset, pos+pack_length,
length, &nr, &nr2);
}
So are TEXT and VARCHAR. hash_sort is presumably some other function that takes collation into account. VARCHARs have a prefixed 1 or 2-byte length.
else
{
uchar *end= pos+seg->length;
for ( ; pos < end ; pos++)
{
nr *=16777619;
nr ^=(uint) *pos;
}
}
And every other type is treated byte-by-byte with mutiplication and xor.
Is it possible to assign an unique 6 or 9 digit number to each new row only with MySQL.
Example :
id1 : 928524
id2 : 124952
id3 : 485920
...
...
P.S : I can do that with php's rand() function, but I want a better way.
MySQL can assign unique continuous keys by itself. If you don't want to use rand(), maybe this is what you meant?
I suggest you manually set the ID of the first row to 100000, then tell the database to auto increment. Next row should then be 100001, then 100002 and so on. Each unique.
Don't know why you would ever want to do this but you will have to use php's rand function, see if its already in the database, if it is start from the beginning again, if its not then use it for the id.
Essentially you want a cryptographic hash that's guaranteed not to have a collision for your range of inputs. Nobody seems to know the collision behavior of MD5, so here's an algorithm that's guaranteed not to have any: Choose two large numbers M and N that have no common divisors-- they can be two very large primes, or 2**64 and 3**50, or whatever. You will be generating numbers in the range 0..M-1. Use the following hashing function:
H(k) = k*N (mod M)
Basic number theory guarantees that the sequence has no collisions in the range 0..M-1. So as long as the IDs in your table are less than M, you can just hash them with this function and you'll have distinct hashes. If you use unsigned 64-bit integer arithmetic, you can let M = 2**64. N can then be any odd number (I'd choose something large enough to ensure that k*N > M), and you get the modulo operation for free as arithmetic overflow!
I wrote the following in comments but I'd better repeat it here: This is not a good way to implement access protection. But it does prevent people from slurping all your content, if M is sufficiently large.