MySQL binary against non-binary for hash IDs - mysql

Assuming that I want to use a hash as an ID instead of a numeric. Would it be an performance advantage to store them as BINARY over non-binary?
CREATE TABLE `test`.`foobar` (
`id` CHAR(32) BINARY CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
PRIMARY KEY (`id`)
)
CHARACTER SET ascii;

Yes. Often a hash digest is stored as the ASCII representation of hex digits, for example MD5 of the word 'hash' is:
0800fc577294c34e0b28ad2839435945
This is a 32-character ASCII string.
But MD5 really produces a 128-bit binary hash value. This should require only 16 bytes to be stored as binary values instead of hex digits. So you can gain some space efficiency by using binary strings.
CREATE TABLE test.foobar (
id BINARY(16) NOT NULL PRIMARY KEY
);
INSERT INTO test.foobar (id) VALUES (UNHEX(MD5('hash')));
Re. your comments that you are more concerned about performance than space efficiency:
I don't know of any reason that the BINARY data type would be speedier than CHAR.
Being half as large can be an advantage for performance if you use cache buffers effectively. That is, a given amount of cache memory can store twice as many rows worth of BINARY data if the string is half the size of the CHAR needed to store the same value in hex. Likewise the cache memory for the index on that column can store twice as much.
The result is a more effective cache, because a random query has a greater chance of hitting the cached data or index, instead of requiring a disk access. Cache efficiency is important for most database applications, because usually the bottleneck is disk I/O. If you can use cache memory to reduce frequency of disk I/O, it's a much bigger bang for the buck than the choice between one data type or another.
As for the difference between a hash string stored in BINARY versus a BIGINT, I would choose BIGINT. The cache efficiency will be even greater, and also on 64-bit processors integer arithmetic and comparisons should be very fast.
I don't have measurements to support the claims above. The net benefit of choosing one data type over another depends a lot on data patterns and types of queries in your database and application. To get the most precise answer, you must try both solutions and measure the difference.
Re. your supposition that binary string comparison is quicker than default case-insensitive string comparison, I tried the following test:
mysql> SELECT BENCHMARK(100000000, 'foo' = 'FOO');
1 row in set (5.13 sec)
mysql> SELECT BENCHMARK(100000000, 'foo' = BINARY 'FOO');
1 row in set (4.23 sec)
So binary string comparison is 17.5% faster than case-insensitive string comparison. But notice that after evaluating this expression 100 million times, the total difference is still less than 1 second. While we can measure the relative difference in speed, the absolute difference in speed is really insignificant.
So I'll reiterate:
Measure, don't guess or suppose. Your educated guesses will be wrong a lot of the time. Measure before and after every change you make, so you know how much it helped.
Invest your time and attention where you get the greatest bang for the buck.
Don't sweat the small stuff. Of course, a tiny difference adds up with enough iterations, but given those iterations, a performance improvement with greater absolute benefit is still preferable.

From the manual:
The BINARY and VARBINARY types are similar to CHAR and VARCHAR, except
that they contain binary strings rather than non-binary strings. That is,
they contain byte strings rather than character strings. This means that
they have no character set, and sorting and comparison are based on the
numeric values of the bytes in the values.
Since CHAR(32) BINARY causes a BINARY(32) column to be created under the hood, the benefit is that it will take less time to sort by that column, and probably less time to find corresponding rows if the column is indexed.

Related

Store UUID v4 in MySQL

I'm generating UUIDs using PHP, per the function found here
Now I want to store that in a MySQL database. What is the best/most efficient MySQL field format for storing UUID v4?
I currently have varchar(256), but I'm pretty sure that's much larger than necessary. I've found lots of almost-answers, but they're generally ambiguous about what form of UUID they're referring to, so I'm asking for the specific format.
Store it as VARCHAR(36) if you're looking to have an exact fit, or VARCHAR(255) which is going to work out with the same storage cost anyway. There's no reason to fuss over bytes here.
Remember VARCHAR fields are variable length, so the storage cost is proportional to how much data is actually in them, not how much data could be in them.
Storing it as BINARY is extremely annoying, the values are unprintable and can show up as garbage when running queries. There's rarely a reason to use the literal binary representation. Human-readable values can be copy-pasted, and worked with easily.
Some other platforms, like Postgres, have a proper UUID column which stores it internally in a more compact format, but displays it as human-readable, so you get the best of both approaches.
If you always have a UUID for each row, you could store it as CHAR(36) and save 1 byte per row over VARCHAR(36).
uuid CHAR(36) CHARACTER SET ascii
In contrast to CHAR, VARCHAR values are stored as a 1-byte or 2-byte
length prefix plus data. The length prefix indicates the number of
bytes in the value. A column uses one length byte if values require no
more than 255 bytes, two length bytes if values may require more than
255 bytes.
https://dev.mysql.com/doc/refman/5.7/en/char.html
Though be careful with CHAR, it will always consume the full length defined even if the field is left empty. Also, make sure to use ASCII for character set, as CHAR would otherwise plan for worst case scenario (i.e. 3 bytes per character in utf8, 4 in utf8mb4)
[...] MySQL must reserve four bytes for each character in a CHAR
CHARACTER SET utf8mb4 column because that is the maximum possible
length. For example, MySQL must reserve 40 bytes for a CHAR(10)
CHARACTER SET utf8mb4 column.
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Question is about storing an UUID in MySQL.
Since version 8.0 of mySQL you can use binary(16) with automatic conversion via UUID_TO_BIN/BIN_TO_UUID functions:
https://mysqlserverteam.com/mysql-8-0-uuid-support/
Be aware that mySQL has also a fast way to generate UUIDs as primary key:
INSERT INTO t VALUES(UUID_TO_BIN(UUID(), true))
Most efficient is definitely BINARY(16), storing the human-readable characters uses over double the storage space, and means bigger indices and slower lookup. If your data is small enough that storing as them as text doesn't hurt performance, you probably don't need UUIDs over boring integer keys. Storing raw is really not as painful as others suggest because any decent db admin tool will display/dump the octets as hexadecimal, rather than literal bytes of "text". You shouldn't need to be looking up UUIDs manually in the db; if you have to, HEX() and x'deadbeef01' literals are your friends. It is trivial to write a function in your app – like the one you referenced – to deal with this for you. You could probably even do it in the database as virtual columns and stored procedures so the app never bothers with the raw data.
I would separate the UUID generation logic from the display logic to ensure that existing data are never changed and errors are detectable:
function guidv4($prettify = false)
{
static $native = function_exists('random_bytes');
$data = $native ? random_bytes(16) : openssl_random_pseudo_bytes(16);
$data[6] = chr(ord($data[6]) & 0x0f | 0x40); // set version to 0100
$data[8] = chr(ord($data[8]) & 0x3f | 0x80); // set bits 6-7 to 10
if ($prettify) {
return guid_pretty($data);
}
return $data;
}
function guid_pretty($data)
{
return strlen($data) == 16 ?
vsprintf('%s%s-%s-%s-%s-%s%s%s', str_split(bin2hex($data), 4)) :
false;
}
function guid_ugly($data)
{
$data = preg_replace('/[^[:xdigit:]]+/', '', $data);
return strlen($data) == 32 ? hex2bin($data) : false;
}
Edit: If you only need the column pretty when reading the database, a statement like the following is sufficient:
ALTER TABLE test ADD uuid_pretty CHAR(36) GENERATED ALWAYS AS (CONCAT_WS('-', LEFT(HEX(uuid_ugly), 8), SUBSTR(HEX(uuid_ugly), 9, 4), SUBSTR(HEX(uuid_ugly), 13, 4), SUBSTR(HEX(uuid_ugly), 17, 4), RIGHT(HEX(uuid_ugly), 12))) VIRTUAL;
This works like a charm for me in MySQL 8.0.26
create table t (
uuid BINARY(16) default (UUID_TO_BIN(UUID())),
)
When querying you may use
select BIN_TO_UUID(uuid) uuid from t;
The result is:
# uuid
'8c45583a-0e1f-11ec-804d-005056219395'
The most space-efficient would be BINARY(16) or two BIGINT UNSIGNED.
The former might give you headaches because manual queries do not (in a straightforward way) give you readable/copyable values.
The latter might give you headaches because of having to map between one value and two columns.
If this is a primary key, I would definitely not waste any space on it, as it becomes part of every secondary index as well. In other words, I would choose one of these types.
For performance, the randomness of random UUIDs (i.e. UUID v4, which is randomized) will hurt severely. This applies when the UUID is your primary key or if you do a lot of range queries on it. Your insertions into the primary index will be all over the place rather than all at (or near) the end. Your data loses temporal locality, which was a helpful property in various cases.
My main improvement would be to use something similar to a UUID v1, which uses a timestamp as part of its data, and ensure that the timestamp is in the highest bits. For example, the UUID might be composed something like this:
Timestamp | Machine Identifier | Counter
This way, we get a locality similar to auto-increment values.
This could be useful if you use binary(16) data type:
INSERT INTO table (UUID) VALUES
(UNHEX(REPLACE(UUID(), "-","")))
I just found a nice article going in more depth on these topics: https://www.xaprb.com/blog/2009/02/12/5-ways-to-make-hexadecimal-identifiers-perform-better-on-mysql/
It covers the storage of values, with the same options already expressed in the different answers on this page:
One: watch out for character set
Two: use fixed-length, non-nullable values
Three: Make it BINARY
But also adds some interesting insight about indexes:
Four: use prefix indexes
In many but not all cases, you don’t need to index the full length of
the value. I usually find that the first 8 to 10 characters are
unique. If it’s a secondary index, this is generally good enough. The
beauty of this approach is that you can apply it to existing
applications without any need to modify the column to BINARY or
anything else—it’s an indexing-only change and doesn’t require the
application or the queries to change.
Note that the article doesn't tell you how to create such a "prefix" index. Looking at MySQL documentation for Column Indexes we find:
[...] you can create an index that uses only the first N characters of the
column. Indexing only a prefix of column values in this way can make
the index file much smaller. When you index a BLOB or TEXT column, you
must specify a prefix length for the index. For example:
CREATE TABLE test (blob_col BLOB, INDEX(blob_col(10)));
[...] the prefix length in
CREATE TABLE, ALTER TABLE, and CREATE INDEX statements is interpreted
as number of characters for nonbinary string types (CHAR, VARCHAR,
TEXT) and number of bytes for binary string types (BINARY, VARBINARY,
BLOB).
Five: build hash indexes
What you can do is generate a checksum of the values and index that.
That’s right, a hash-of-a-hash. For most cases, CRC32() works pretty
well (if not, you can use a 64-bit hash function). Create another
column. [...] The CRC column isn’t guaranteed to be unique, so you
need both criteria in the WHERE clause or this technique won’t work.
Hash collisions happen quickly; you will probably get a collision with
about 100k values, which is much sooner than you might think—don’t
assume that a 32-bit hash means you can put 4 billion rows in your
table before you get a collision.
This is a fairly old post but still relevant and comes up in search results often, so I will add my answer to the mix. Since you already have to use a trigger or your own call to UUID() in your query, here are a pair of functions that I use to keep the UUID as text in for easy viewing in the database, but reducing the footprint from 36 down to 24 characters. (A 33% savings)
delimiter //
DROP FUNCTION IF EXISTS `base64_uuid`//
DROP FUNCTION IF EXISTS `uuid_from_base64`//
CREATE definer='root'#'localhost' FUNCTION base64_uuid() RETURNS varchar(24)
DETERMINISTIC
BEGIN
/* converting INTO base 64 is easy, just turn the uuid into binary and base64 encode */
return to_base64(unhex(replace(uuid(),'-','')));
END//
CREATE definer='root'#'localhost' FUNCTION uuid_from_base64(base64_uuid varchar(24)) RETURNS varchar(36)
DETERMINISTIC
BEGIN
/* Getting the uuid back from the base 64 version requires a little more work as we need to put the dashes back */
set #hex = hex(from_base64(base64_uuid));
return lower(concat(substring(#hex,1,8),'-',substring(#hex,9,4),'-',substring(#hex,13,4),'-',substring(#hex,17,4),'-',substring(#hex,-12)));
END//

always use 255 chars for varchar fields decreases performance?

I usually use maximum chars possible for varchar fields, so in most cases I set 255 but only using 16 chars in columns...
does this decreases performance for my database?
When it comes to storage, a VARCHAR(255) column will take up 1 byte to store the length of the actual value plus the bytes required to store the actual value.
For a latin1 VARCHAR(255) column, that's at most 256 bytes. For a UTF8 column, where each character can take up to 3 bytes (though rarely), the maximum size is 766 bytes. As we know the maximum index length for a single column in bytes in InnoDB is 767 bytes, hence perhaps the reason some declare 255 as the maximum supported column length.
So, again, when storing the value, it only takes up as much room as is actually needed.
However, if the column is indexed, the index automatically allocates the maximum possible size so that each node in the index has enough room to store any possible value. When searching through an index, MySQL loads the nodes in specific byte size chunks at a time. Large nodes means less nodes per read, which means it takes longer to search the index.
MySQL will also use the maximum size when storing the values in a temp table for sorting.
So, even if you aren't using indexes, but are ever performing a query that can't utilize an index for sorting, you will get a performance hit.
Therefore, if performance is your goal, setting any VARCHAR column to 255 characters should not be a rule of thumb. Instead, you should use the minimum required.
There may be edge cases where you'd rather suffer the performance every day so that you never have to lock a table completely to increase the size of a column, but I don't think that's the norm.
One possible exception is if you are joining on a VARCHAR column between two tables. MySQL says:
MySQL can use indexes on columns more efficiently if they are declared
as the same type and size.
In that case, you might use the max size between the two.
Whenever you're talking about "performance" you can only find out one way: Benchmarking.
In theoretical terms there's no difference between VARCHAR(20) and VARCHAR(255) if they're both populated with the same data. Keep in mind if you get your length wrong you will have massive truncation problems and MySQL does not warn you before it starts chopping data to fit.
I try to avoid setting limits on VARCHAR columns unless the data would be completely invalid if it was longer. For instance, two-character ISO country codes can be stored in VARCHAR(2) because longer strings are meaningless. For other things, especially names or phone numbers, limiting the length is potentially and probably harmful.
Still, you will want to test any schema you create to be sure it meets your performance requirements. I expect you'd have a hard time detecting any difference at all between VARCHAR(25) and VARCHAR(16).
There are two ways in which this will decrease performance.
if you're loading those columns many many times, performing a join on the column, or other such thing that means they need to be accessed a large number of times. The number of times depends on your machine, but think on the order of millions.
if you're always filling the field (using 20 chars in a varchar(20), then the length checks are adding a little overhead whenever you perform an insert.
The best way to determine this though is to benchmark your database though.

Which is better Storing hash value or the bigint variable from which the hash value is generated

I have a table in which a column stores image src which is in hash value and that hash value is generated from microtime(),Now I have two choice storing directly hash value in database or storing that bigint microtime from which the image name is derived.Which would make my db faster.
We have to analyze this from all sides to asses what speed faults are incured.
I will make a few assumptions:
this data will be used as an identifier (primary key, unique key, composite key);
this data is used for searches and joins;
you are using a hashing algorithm such as SHA1 that yields a 40 character string of hex encoded data (MD5 yields a 32 character string of hex encoded data all said bellow can be adapted to MD5 if that's what you're using);
you may be interested in converting the hex values of the hash into binary to reduce the storage required by half and to improve comparison speed;
Inserting and Updating on the application side:
As #Namphibian stated is composed of 2 operations for the BIGINT versus 3 operations for the CHAR.
But the speed difference in my opinion really isn't that big. You can run 10.000.000 continuous calculations (in a while loop) and benchmark them to find out the real difference between them.
Also a speed difference in the application code affects users linearly, while speed differences in the DB affect users nonlinearly when traffic increases because overlapping writes have to wait for each other and some reads have to wait for writes to finish.
Inserting and Updating on the DB side:
Is almost the same for a BIGINT as it is for a CHAR(40) or a BINARY(20) because the more serious time consumption is done waiting for access to the disk rather than actually writing to it.
Selecting and Joining on the DB side:
This is always faster for a BIGINT compared to a CHAR(40) or a BINARY(20) for two reasons:
BIGINT is stored in 8 bytes while CHAR(40) is stored in 40 bytes and BINARY(20) in 20 bytes;
BIGINT's serially increasing nature makes it predictable and easy to compare and sort.
Second best option is the BINARY(20) because it saves some space and it is easier to compare due to reduced length.
Both BINARY(20) and CHAR(40) are the result of the hashing mechanism and are randomized, hence comparing and sorting takes a longer time on average because randomized data in indexes (for a btree index) needs more tree traversals to fetch (i mean that in the context of multiple values, not for one single value).
An important scientific principle may apply here: don't lose the original data. You never know what you may need it for.

What is the most efficient MySQL data type for long bitstrings?

I have a need to store into a MySQL table lengthy bit-strings which could be as long as 32768 bits. This data need will not need to be indexed or full-text searched at any time. If I have read correctly, this size should be well within both my max_packet_size as well as the row-size limit # 65k.
Ideally I would like to store the strings (and INSERT them) in 0b format, but this is not a requirement...anything that will give me essentially 1:1 data/size on disk would be great.
BLOBs do not seem to do the job well enough, as a string comprised of only ones and zeroes ('010101010101') is seen no different than normal text and costs me L bytes + 2. BIT() would be perfect, but is limited only to 64 bits max length.
Though much of the data (90%+) would be sufficiently represented within an unsigned Bigint, the remaining 10% of rows entice me to find a more elegant solution than splitting them up logically (i.e., searching a secondary table if not found in the first, secondary table using BLOBs for remaining 10% rows, etc.).
An added bonus would be any type that permits bitwise operations, but if not, this is just as easily done outside the MySQL server.
What is the most efficient data type for this purpose?
I would say it mainly depends on your access pattern. If you can afford to read/write the whole bitstring at each access, then a varbinary(4096) will work fine and be quite compact (only 2 bytes of overhead for the whole field). In this model, one bit on application side is really represented by one bit in the data storage, and it is up to the client application to interpret it as a bitstring (performing bitwise operations, etc ...)
If you want to optimize a bit more, you can imagine a table with a bigint and a varbinary(4096):
create table dummy (
dummykey int not null,
bit1 bigint null,
bit2 varbinary(4096) null,
primary key(dummykey)
);
Only one of the two fields is not null for a given record. If bit1 is not null, then it can store 64 bits. For larger bitstrings, bit1 is null, and bit2 is used instead. The client application has to be smart enough to handle all bitwise operations (paying special attention to signed/unsigned issues with bit1).
I guess the BLOB type is what you need. It can represent binary strings up to 2^16 bytes and has an overhead of 2 bytes per record (if L is the length in bytes of the value, L+2 bytes is its size on disk).
Then, if you really want to optimize, use two tables, one with BLOB and the other with TINYBLOB (strings up to 2^8 bytes, 1 byte overhead), then UNION them together in a VIEW or during SELECT.
If you want to optimize even more, use a third table with BIGINT (this will allow storing binary strings up to 58 bits, since the remaining 6 will be needed to store the length of the binary string).

Decimal VS Int in MySQL?

Are there any performance difference between decimal(10,0) unsigned type and int(10) unsigned type?
It may depend on the version of MySQL you are using. See here.
Prior to MySQL 5.0.3, the DECIMAL type was stored as a string and would typically be slower.
However, since MySQL 5.0.3 the DECIMAL type is stored in a binary format so with the size of your DECIMAL above, there may not be much difference in performance.
The main performance issue would have been the amount of space taken up by the different types (with DECIMAL being slower). With MySQL 5.0.3+ this appears to be less of an issue, however if you will be performing numeric calculations on the values as part of the query, there may be some performance difference. This may be worth testing as there is no indication in the documentation that i can see.
Edit:
With regards to the int(10) unsigned, i took this at face value as just being a 4 byte int. However this has a maximum value of 4294967295 which strictly doesn't provide the same range of numbers as a DECIMAL(10,0) unsigned .
As #Unreason pointed out, you would need to use a bigint to cover the full range of 10 digit numbers, pushing the size up to 8 bytes.
A common mistake is that when specifying numeric columns types in MySQL, people often think the number in the brackets has an impact on the size of the number they can store. It doesn't. The number range is purely based on the column type and whether it is signed or unsigned. The number in the brackets is for display purposes in results and has no impact on the values stored in the column. It will also have no impact of the display of the results unless you specify the ZEROFILL option on the column as well.
According to the mysql data storage your decimal will require
DECIMAL(10,0): 4 bytes for 9 digits and 1 byte for the remaining 10th digit, so in total five bytes (assuming my reading of documentation is correct).
INT(10): will need BIGINT which is 8 bytes.
The differences is that the decimal is packed and some operations on such data type might be slower then on normal INT types which map directly to machine represented numbers.
Still I would do your own tests to confirm the above reasoning.
EDIT:
I noticed that I did not elaborate on the obvious point - assuming the above logic is sound the difference in size required is 60% more space needed for BIGINT variant.
However this does not directly translate to penalties due to the fact that data is normally not written byte by byte. In case of selects/updates of many rows you should see the performance loss/gain, but in case of selecting/updating a small number of rows the filesystem will fetch blocks from the disk(s) which will normally get/write multiple columns anyway.
The size (and speed) of indexes might be more directly impacted.
However, the question on how the packing influences various operations still remains open.
According to this similar question, yes, potentially there is a big performance hit because of difference in the way DECIMAL and INT are treated and threaded into the CPU when doing calculations.
See: Is there a performance hit using decimal data types (MySQL / Postgres)
I doubt such a difference can be performance related at all.
Most of performance issues tied to proper database design and indexing plan, and server/hardware tuning as a next level.