MySQL CAST() causes significant performance hit - mysql

So I ran the following in the MySQL console as a control test to see what was holding back the speed of my query.
SELECT bbva_deductions.ded_code, SUBSTRING_INDEX(bbva_deductions.employee_id, '-' , -1) AS tt_emplid,
bbva_job.paygroup, bbva_job.file_nbr, bbva_deductions.ded_amount
FROM bbva_deductions
LEFT JOIN bbva_job
ON CAST(SUBSTRING_INDEX(bbva_deductions.employee_id, '-' , -1) AS UNSIGNED) = bbva_job.emplid LIMIT 500
It took consistently around 4 seconds to run. (seems very high for only 500 rows). Simply removing the CAST within the JOIN decreased that to just 0.01 seconds.
In this context, why on earth is CAST so slow?
Here is the output of an EXPLAIN for this query:
And the same for the query without a CAST:
EXPLAIN EXTENDED:

As documented under How MySQL Uses Indexes:
MySQL uses indexes for these operations:
[ deletia ]
To retrieve rows from other tables when performing joins. MySQL can use indexes on columns more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR are considered the same if they are declared as the same size. For example, VARCHAR(10) and CHAR(10) are the same size, but VARCHAR(10) and CHAR(15) are not.
Comparison of dissimilar columns may prevent use of indexes if values cannot be compared directly without conversion. Suppose that a numeric column is compared to a string column. For a given value such as 1 in the numeric column, it might compare equal to any number of values in the string column such as '1', ' 1', '00001', or '01.e1'. This rules out use of any indexes for the string column.
In your case, you are attempting to join on a comparison between a substring (of a string column in one table) and a string column in another table. An index can be used for this operation, however the comparison is performed lexicographically (i.e. treating the operands as strings, even if they represent numbers).
By explicitly casting one side to an integer, the comparison is performed numerically (as desired) - but this requires MySQL to implicitly convert the type of the string column and therefore it is unable to use that column's index.
You have hit this road bump because your schema is poorly designed. You should strive to ensure that all columns:
are encoded using the data types that are most relevant to their content; and
contain only a single piece of information — see Is storing a delimited list in a database column really that bad?
At the very least, your bbva_job.emplid should be an integer; and your bbva_deductions.employee_id should be split so that its parts are stored in separate (appropriately-typed) columns. With appropriate indexes, your query will then be considerably more performant.

Related

Performance (not space) differences between storing dates in DATE vs. VARCHAR fields

We have a MySQL-based system that stores date values in VARCHAR(10) fields, as SQL-format strings (e.g. '2021-09-07').
I am aware of the large additional space requirements of using this format over the native DATE format, but I have been unable to find any information about performance characteristics and how the two formats would differ in terms of speed. I would assume that working with strings would be slower for non-indexed fields. However, if the field is indexed I could imagine that the indexes on string fields could potentially yield much larger improvements than those on date fields, and possibly even overtake them in terms of performance, for certain tasks.
Can anyone advise on the speed implications of choosing one format over the other, in relation to the following situations (assuming the field is indexed):
Use of the field in a JOIN condition
Use of the field in a WHERE clause
Use of the field in an ORDER BY clause
Updating the index on INSERT/UPDATE
I would like to migrate the fields to the more space-efficient format but want to get some information about any potential performance implications (good or bad) that may apply. I plan to do some profiling of my own if I go down this route, but don't want to waste my time if there is a clear, known advantage of one format over the other.
Note that I am also interested in the same question for VARCHAR(19) vs. DATETIME, particualrly if it yields a different answer.
Additional space is a performance issue. Databases work by reading data pages (and index pages) from disk. Bigger records require more data pages to store. And that has an effect on performance.
In other words, your date column is 11 bytes versus 4 bytes. If you had a table with only ten date columns, that would be 110 bytes versus 40, that would require almost three times as much time to scan the data.
As for your operations, if you have indexes that are being used, then the indexes will also be larger. Because of the way that MySQL handles collations for columns, comparing string values is (generally) going to be less efficient than comparing binary values.
Certain operations such as extracting the date components are probably comparable (a string function versus a date function). However, other operations such as extracting the day of the week or the month name probably require converting to a date and then to a string, which is more expensive.
More bytes being compared --> slightly longer to do the comparison.
More bytes in the column --> more space used --> slightly slower operations because of more bulk on disk, in cache, etc.
This applies to either 1+10 or 1+19 bytes for VARCHAR versus 3 for DATE or 5 for DATETIME. The "1" is for the 'length' of VARCHAR; if the strings are a consistent length, you could switch to CHAR.
A BTree is essentially the same whether the value is an Int, Float, Decimal, Enum, Date, or Char. VARCHAR is slightly different in that it has a 'length'; I don't see this is a big issue in the grand scheme of things.
The number of rows that need to be fetched is the most important factor in the speed of a query. Other things, such as datatype size, come further down the list of what slows things down.
There are lots of DATETIME functions that you won't be able to used. This may lead to a table scan instead of using an INDEX. That would have a big impact on performance. Let's see some specific SQL queries.
Using CHARACTER SET ascii COLLATE ascii_bin for your date/datetime columns would make comparisons slightly faster.

Impact on MySQL SELECT query if we search a string with length greater than column's length

I have a column field "name" with type VARCHAR and length as 80 .
If i search for a string with length greater than 80 , what is the behaviour of SELECT operation in this case ?
Apart from round trip to DB , will it scan whole table or just perform a return as the length is already greater than 80 ?
SELECT * FROM TABLE WHERE name = "longgg... ...string" ;
I need this knowledge because in the existing codebase there is a variable which is used in all layers of MVC . This variable is meant for a different column with different length . For saving time and code redundancy I just want to use same variable which has validation for a bigger length as compared to 80.
After the discussion , I am going to add length validation instead of
depending on db validation as all the rows are scanned !
One more wise thought in the comments , if the column is indexed whole table is not scanned .Verified same using EXPLAIN :
All rows are scanned as there is no index. CHAR/VARCHAR index stores length as well (included in explain/key_len). So if index created on the column, the query should not compare anything as based by the index SQL should understand that there is no rows found by criteria.
I don't have a documentation reference for this, but from my observations testing what will happen is that MySQL will cast both sides of the equality expression to the length of the string literal on the RHS. That is, MySQL will execute the following query:
SELECT *
FROM yourTable
WHERE CAST(name AS CHAR(100)) = CAST('longgg... ...string' AS CHAR(100));
-- assuming it has a length of 100
This comparison would generally fail, since MySQL will not pad a string on the LHS which is less than 100 characters, meaning that the lengths would not generally even match.

Store UUID v4 in MySQL

I'm generating UUIDs using PHP, per the function found here
Now I want to store that in a MySQL database. What is the best/most efficient MySQL field format for storing UUID v4?
I currently have varchar(256), but I'm pretty sure that's much larger than necessary. I've found lots of almost-answers, but they're generally ambiguous about what form of UUID they're referring to, so I'm asking for the specific format.
Store it as VARCHAR(36) if you're looking to have an exact fit, or VARCHAR(255) which is going to work out with the same storage cost anyway. There's no reason to fuss over bytes here.
Remember VARCHAR fields are variable length, so the storage cost is proportional to how much data is actually in them, not how much data could be in them.
Storing it as BINARY is extremely annoying, the values are unprintable and can show up as garbage when running queries. There's rarely a reason to use the literal binary representation. Human-readable values can be copy-pasted, and worked with easily.
Some other platforms, like Postgres, have a proper UUID column which stores it internally in a more compact format, but displays it as human-readable, so you get the best of both approaches.
If you always have a UUID for each row, you could store it as CHAR(36) and save 1 byte per row over VARCHAR(36).
uuid CHAR(36) CHARACTER SET ascii
In contrast to CHAR, VARCHAR values are stored as a 1-byte or 2-byte
length prefix plus data. The length prefix indicates the number of
bytes in the value. A column uses one length byte if values require no
more than 255 bytes, two length bytes if values may require more than
255 bytes.
https://dev.mysql.com/doc/refman/5.7/en/char.html
Though be careful with CHAR, it will always consume the full length defined even if the field is left empty. Also, make sure to use ASCII for character set, as CHAR would otherwise plan for worst case scenario (i.e. 3 bytes per character in utf8, 4 in utf8mb4)
[...] MySQL must reserve four bytes for each character in a CHAR
CHARACTER SET utf8mb4 column because that is the maximum possible
length. For example, MySQL must reserve 40 bytes for a CHAR(10)
CHARACTER SET utf8mb4 column.
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Question is about storing an UUID in MySQL.
Since version 8.0 of mySQL you can use binary(16) with automatic conversion via UUID_TO_BIN/BIN_TO_UUID functions:
https://mysqlserverteam.com/mysql-8-0-uuid-support/
Be aware that mySQL has also a fast way to generate UUIDs as primary key:
INSERT INTO t VALUES(UUID_TO_BIN(UUID(), true))
Most efficient is definitely BINARY(16), storing the human-readable characters uses over double the storage space, and means bigger indices and slower lookup. If your data is small enough that storing as them as text doesn't hurt performance, you probably don't need UUIDs over boring integer keys. Storing raw is really not as painful as others suggest because any decent db admin tool will display/dump the octets as hexadecimal, rather than literal bytes of "text". You shouldn't need to be looking up UUIDs manually in the db; if you have to, HEX() and x'deadbeef01' literals are your friends. It is trivial to write a function in your app – like the one you referenced – to deal with this for you. You could probably even do it in the database as virtual columns and stored procedures so the app never bothers with the raw data.
I would separate the UUID generation logic from the display logic to ensure that existing data are never changed and errors are detectable:
function guidv4($prettify = false)
{
static $native = function_exists('random_bytes');
$data = $native ? random_bytes(16) : openssl_random_pseudo_bytes(16);
$data[6] = chr(ord($data[6]) & 0x0f | 0x40); // set version to 0100
$data[8] = chr(ord($data[8]) & 0x3f | 0x80); // set bits 6-7 to 10
if ($prettify) {
return guid_pretty($data);
}
return $data;
}
function guid_pretty($data)
{
return strlen($data) == 16 ?
vsprintf('%s%s-%s-%s-%s-%s%s%s', str_split(bin2hex($data), 4)) :
false;
}
function guid_ugly($data)
{
$data = preg_replace('/[^[:xdigit:]]+/', '', $data);
return strlen($data) == 32 ? hex2bin($data) : false;
}
Edit: If you only need the column pretty when reading the database, a statement like the following is sufficient:
ALTER TABLE test ADD uuid_pretty CHAR(36) GENERATED ALWAYS AS (CONCAT_WS('-', LEFT(HEX(uuid_ugly), 8), SUBSTR(HEX(uuid_ugly), 9, 4), SUBSTR(HEX(uuid_ugly), 13, 4), SUBSTR(HEX(uuid_ugly), 17, 4), RIGHT(HEX(uuid_ugly), 12))) VIRTUAL;
This works like a charm for me in MySQL 8.0.26
create table t (
uuid BINARY(16) default (UUID_TO_BIN(UUID())),
)
When querying you may use
select BIN_TO_UUID(uuid) uuid from t;
The result is:
# uuid
'8c45583a-0e1f-11ec-804d-005056219395'
The most space-efficient would be BINARY(16) or two BIGINT UNSIGNED.
The former might give you headaches because manual queries do not (in a straightforward way) give you readable/copyable values.
The latter might give you headaches because of having to map between one value and two columns.
If this is a primary key, I would definitely not waste any space on it, as it becomes part of every secondary index as well. In other words, I would choose one of these types.
For performance, the randomness of random UUIDs (i.e. UUID v4, which is randomized) will hurt severely. This applies when the UUID is your primary key or if you do a lot of range queries on it. Your insertions into the primary index will be all over the place rather than all at (or near) the end. Your data loses temporal locality, which was a helpful property in various cases.
My main improvement would be to use something similar to a UUID v1, which uses a timestamp as part of its data, and ensure that the timestamp is in the highest bits. For example, the UUID might be composed something like this:
Timestamp | Machine Identifier | Counter
This way, we get a locality similar to auto-increment values.
This could be useful if you use binary(16) data type:
INSERT INTO table (UUID) VALUES
(UNHEX(REPLACE(UUID(), "-","")))
I just found a nice article going in more depth on these topics: https://www.xaprb.com/blog/2009/02/12/5-ways-to-make-hexadecimal-identifiers-perform-better-on-mysql/
It covers the storage of values, with the same options already expressed in the different answers on this page:
One: watch out for character set
Two: use fixed-length, non-nullable values
Three: Make it BINARY
But also adds some interesting insight about indexes:
Four: use prefix indexes
In many but not all cases, you don’t need to index the full length of
the value. I usually find that the first 8 to 10 characters are
unique. If it’s a secondary index, this is generally good enough. The
beauty of this approach is that you can apply it to existing
applications without any need to modify the column to BINARY or
anything else—it’s an indexing-only change and doesn’t require the
application or the queries to change.
Note that the article doesn't tell you how to create such a "prefix" index. Looking at MySQL documentation for Column Indexes we find:
[...] you can create an index that uses only the first N characters of the
column. Indexing only a prefix of column values in this way can make
the index file much smaller. When you index a BLOB or TEXT column, you
must specify a prefix length for the index. For example:
CREATE TABLE test (blob_col BLOB, INDEX(blob_col(10)));
[...] the prefix length in
CREATE TABLE, ALTER TABLE, and CREATE INDEX statements is interpreted
as number of characters for nonbinary string types (CHAR, VARCHAR,
TEXT) and number of bytes for binary string types (BINARY, VARBINARY,
BLOB).
Five: build hash indexes
What you can do is generate a checksum of the values and index that.
That’s right, a hash-of-a-hash. For most cases, CRC32() works pretty
well (if not, you can use a 64-bit hash function). Create another
column. [...] The CRC column isn’t guaranteed to be unique, so you
need both criteria in the WHERE clause or this technique won’t work.
Hash collisions happen quickly; you will probably get a collision with
about 100k values, which is much sooner than you might think—don’t
assume that a 32-bit hash means you can put 4 billion rows in your
table before you get a collision.
This is a fairly old post but still relevant and comes up in search results often, so I will add my answer to the mix. Since you already have to use a trigger or your own call to UUID() in your query, here are a pair of functions that I use to keep the UUID as text in for easy viewing in the database, but reducing the footprint from 36 down to 24 characters. (A 33% savings)
delimiter //
DROP FUNCTION IF EXISTS `base64_uuid`//
DROP FUNCTION IF EXISTS `uuid_from_base64`//
CREATE definer='root'#'localhost' FUNCTION base64_uuid() RETURNS varchar(24)
DETERMINISTIC
BEGIN
/* converting INTO base 64 is easy, just turn the uuid into binary and base64 encode */
return to_base64(unhex(replace(uuid(),'-','')));
END//
CREATE definer='root'#'localhost' FUNCTION uuid_from_base64(base64_uuid varchar(24)) RETURNS varchar(36)
DETERMINISTIC
BEGIN
/* Getting the uuid back from the base 64 version requires a little more work as we need to put the dashes back */
set #hex = hex(from_base64(base64_uuid));
return lower(concat(substring(#hex,1,8),'-',substring(#hex,9,4),'-',substring(#hex,13,4),'-',substring(#hex,17,4),'-',substring(#hex,-12)));
END//

Which one will be faster in MySQL with BINARY or without Binary?

Please explain, which one will be faster in Mysql for the following query?
SELECT * FROM `userstatus` where BINARY Name = 'Raja'
[OR]
SELECT * FROM `userstatus` where Name = 'raja'
Db entry for Name field is 'Raja'
I have 10000 records in my db, i tried with "explain" query but both saying same execution time.
Your question does not make sense.
The collation of a row determines the layout of the index and whether tests wil be case-sensitive or not.
If you cast a row, the cast will take time.
So logically the uncasted operation should be faster....
However, if the cast makes it to find fewer rows than the casted operation will be faster or the other way round.
This of course changes the whole problem and makes the comparison invalid.
A cast to BINARY makes the comparison to be case-sensitive, changing the nature of the test and very probably the number of hits.
My advice
Never worry about speed of collations, the percentages are so small it is never worth bothering about.
The speed penalty from using select * (a big no no) will far outweigh the collation issues.
Start with putting in an index. That's a factor 10,000 speedup with a million rows.
Assuming that the Names field is a simple latin-1 text type, and there's no index on it, then the BINARY version of the query will be faster. By default, MySQL does case-insensitive comparisons, which means the field values and the value you're comparing against both get smashed into a single case (either all-upper or all-lower) and then compared. Doing a binary comparison skips the case conversion and does a raw 1:1 numeric comparison of each character value, making it a case-sensitive comparison.
Of course, that's just one very specific scenario, and it's unlikely to be met in your case. Too many other factors affect this, especially the presence of an index.

Using char index to find numeric values

I have a column on a mysql table that stores mostly numeric values, but sometimes strings. It's defined as VARCHAR(20). There is an index on this column for the first four characters.
ADD INDEX `refNumber` USING BTREE(`refNumber`(4));
Since the field is mostly numeric, it is useful for the user to be able to query for values that fall within an numeric range (e.g., 100 to 2000). If I use a numeric comparison, this index is not used.
WHERE refNumber BETWEEN 100 AND 2000
If I use a string comparison, I get some values I don't want (e.g., 10000 comes back when querying for a range of 100 to 2000).
WHERE refNumber BETWEEN '100' AND '2000'
Is there a good solution to this?
Note: there are some values that are recorded with zeros padded on the front like 0234, which should be returned when looking for values between 100 and 2000.
Three possibilities
1) Separate the numeric values into their own column.
2) If you MUST keep things as they are, decide on a maximum length for the numbers, zero- or blank-pad the numbers to that length.
3) I don't know if MySQL supports function-based indexes, but that might be an option. if so, write a function that returns the extracted numeric value and use that as the basis of the index.
You can try using the string comparison first, so the index is used, and still do the numeric comparison afterwards. It shouldn't slow things too much, since the second filter will only apply to a small subset of the rows.
WHERE refNumber BEETWEEN '100' AND '2000' AND CAST(refNumber as SIGNED INTEGER) BEETWEEN 100 AND 2000