SQL: Is selecting Substring of TEXT field faster than whole value - mysql

If I have TEXT data type (up to 65,535 characters), is taking first 500 characters faster than the whole column? There is a LEFT function that can particularly be used in this situation, but I'm wondering if it improves performance or maybe downgrades it, since it is a function after all.

In addition to Bill Karwin's answer..
You can easily check that with the BENCHMARK() function, which shows that there is no difference.
SET #a:= repeat(uuid(), 65535 / 32);
Query OK, 0 rows affected (0,001 sec)
SELECT BENCHMARK(10000000, LEFT(#a,500));
1 row in set (16,017 sec)
SELECT BENCHMARK(10000000, #a);
1 row in set (16,031 sec)
If you do a SELECT with LEFT() it will be of course much faster, since network traffic (but also memory usage) is much smaller.

The InnoDB storage engine has to read the full TEXT content regardless. The LEFT() function operates on the full string. It doesn't have any way of telling the storage engine to read only part of the string.
In an RDBMS where functions had intimate knowledge of the storage format, string functions like LEFT() could be optimized in clever ways.
But MySQL has a distinct plugin architecture to implement ar variety of storage engines. The storage code is separate from storage-independent things like builtin string functions. So a string function has no opportunity to request part of a TEXT column.
The code that implements MySQL's LEFT() function is here: https://github.com/mysql/mysql-server/blob/8.0/sql/item_strfunc.cc#L1443-L1461
The only optimization is that it checks the length of the string. If the string is already shorter than the requested substring, it just returns the whole string. This implies that the full string must be available to check the length.

Related

Will a binary operator used in group by prevent the use of an index for optimization?

i.e. there is an index for invoke_statistics.method
SELECT MAX(`t0`.`method`) AS `d0`,
SUM(`t0`.`success`) AS `m0`
FROM `invoke_statistics` AS `t0`
GROUP BY BINARY `t0`.`method`
LIMIT 20000
Will the BINARY operator used in the GROUP BY sentence prevent the use of the index for optimization?
If yes, then, what is the recommended way to group by a string field via strict string comparation instead of using BINARY, considering I have no permission to change the table definition?
This is a bit long for a comment.
I don't understand this query. Why not write:
SELECT BINARY t0.method d0, SUM(t0.success) AS m0
FROM invoke_statistics as t0
GROUP BY BINARY t0.method;
The initial MAX() shouldn't be doing anything (how can two values in a column be the same in the binary representation but different in their actually representation?).
Then, to answer your question, MySQL does take collation into account when creating indexes -- it has to, because collations define ordering. Because BINARY changes the collation, I would expect it to preclude index usage. This is not a 100% certainty; it is an expectation.

MySQL hex() vs unhex()

When storing binary data in MySQL I use the hex() and unhex() functions. But there are two ways I can search on binary data:
Method 1
select * from tbl
where
id=unhex('ABCDABCDABCDABCDABCDABCDABCDABCD')
Method 2
select * from tbl
where
hex(id)='ABCDABCDABCDABCDABCDABCDABCDABCD'
Both methods work, but my instinct is that method 1 is better as only the input value is worked on by the unhex function, whereas in method 2 every value in the id column of the table will be put through the hex function.
Is this reasoning correct, or would MySQL optimise the query to prevent this? Are there any other reasons for choosing one method over the other?
When you use any functions on columns, using indexes becomes hard or impossible. I'm not sure if MySQL supports indexes with functions, but it's still more complicated than using just the column.
Also as you say the function has to be run for each row, whereas in the other only once for input data.
For these reasons do use the form with unhex.
If there is an index on the id column the first method is much faster. If there is no index, the first method is still more efficient.
With the first method, the UNHEX function can be called just once, and if there is a index, it is used. The second method call the function for each row of the table and does not use the index.

MySQL - query by number or letter?

I need to set values to a "Yes or No" column name STATUS. And I'm thinking about 2 methods.
method 1 (use letter): set value Y/N then find all rows that have value Y in field STATUS by a query like:
SELECT * FROM post WHERE status="Y"
method 2 (use number): set value 1/0 then find all rows that have value 1 in field STATUS by a query like:
SELECT * FROM post WHERE status=1
Should I use method 1 or method 2? Which one is faster? Which one is better?
The two are essentially equivalent, so this becomes a question of which is better for your application.
If you are concerned about space, then the smallest space for one character is char(1), using 8 bits. With a number, you can use bit or set types for pack multiple flags. But, this only makes a difference if you have lots of flags.
The store-it-as-a-number approach has a slight advantage, where you can count the "Yes" values by doing:
select sum(status)
(Of course, in MySQL, this is only a marginal improvement on sum(status = 'Y').
The store-it-as-a-letter approach has a slight advantage if you decide to include "Maybe" or other values at some point in the future.
Finally, any difference in performance in different ways of representing these values is going to be very, very minimal. You would need a table with millions and millions of rows to start to notice a problem. So, use the mechanism that works best for your application and way of representing the value.
Second one is definitely faster primarily because whenever you involve something within quotes , it is meaningless to SQL. It would be better to use types that are non string in order to get better performance. I would suggest using METHOD 2.
Fastest way would be ;
SELECT * FROM post WHERE `status` = FIND_IN_SET(`status`,'y');
I think you should create column with ENUM('n','y'). Mysql stores this type in optimal way. It also will help you to store only allowed values in the field.
You can also make it more human friendly ENUM('no','yes') without affect to performance. Because strings 'no' and 'yes' are stored only once per ENUM definition. Mysql stores only index of the value per row.
I think the method 1 is better if you are concerned with the storage prospective .
As storing an integer i.e 1/2 takes 4 bytes of memory where as a character takes only 1 byte of memory. So its better to use method 1.
This may increase some performance .

MYSQL - test VARCHAR for ZERO Equivilants

I have a mysql table that gets populated from a flat file with 20 million rows a daily. I have one field 'app_value' that is a varchar(24) that is a mix of text and numbers. I want to run a batch task every day to normalize all the values that are equivalent of zeros. Checking the database I have seen at least the following zero equivalent values but I think there are others.
0
0.0
0.000000
My plan was to cast to decimal and check if that that cast was equal to 0. If so I will update the value to '0'. To test my theory I ran
SELECT d.id, d.pp_value, CAST('d.app_value' as DECIMAL(20,6))
It seemed to work okay on the zero equivalent numbers however I was not that surprised to see that when 'app_value' is a character it is also cast to 0.000000. Is there a better way to do this? I need to protect against null, blanks and characters. I also need to be concerned about efficiency as I have to do this against 20 million rows every day.
You could match against a regular expression:
WHERE d.app_value RLIKE '^0+(\\.0*)?$'
Of course, this would not be particularly efficient (as it will require a full table scan on every invocation: indexes are of no help). If at all possible, I'd suggest checking for zeroes when loading data into the table (either directly within LOAD DATA itself, or using triggers, or else through some external preprocessing).
Second way to do it is with an NOT IN subquery.
This may scale better vs the regex engine on the larger datasets.. Regex engine startup/matching is relatively costly for the CPU compared to normal string matching..
note this one is bit hacky because we "trust" on MySQL auto cast conversion
select
*
from
data
where
number not in (
select
number
from
data
where
number >= 'a'
) and number = 0
see demo http://sqlfiddle.com/#!2/7118e7/2

Which one will be faster in MySQL with BINARY or without Binary?

Please explain, which one will be faster in Mysql for the following query?
SELECT * FROM `userstatus` where BINARY Name = 'Raja'
[OR]
SELECT * FROM `userstatus` where Name = 'raja'
Db entry for Name field is 'Raja'
I have 10000 records in my db, i tried with "explain" query but both saying same execution time.
Your question does not make sense.
The collation of a row determines the layout of the index and whether tests wil be case-sensitive or not.
If you cast a row, the cast will take time.
So logically the uncasted operation should be faster....
However, if the cast makes it to find fewer rows than the casted operation will be faster or the other way round.
This of course changes the whole problem and makes the comparison invalid.
A cast to BINARY makes the comparison to be case-sensitive, changing the nature of the test and very probably the number of hits.
My advice
Never worry about speed of collations, the percentages are so small it is never worth bothering about.
The speed penalty from using select * (a big no no) will far outweigh the collation issues.
Start with putting in an index. That's a factor 10,000 speedup with a million rows.
Assuming that the Names field is a simple latin-1 text type, and there's no index on it, then the BINARY version of the query will be faster. By default, MySQL does case-insensitive comparisons, which means the field values and the value you're comparing against both get smashed into a single case (either all-upper or all-lower) and then compared. Doing a binary comparison skips the case conversion and does a raw 1:1 numeric comparison of each character value, making it a case-sensitive comparison.
Of course, that's just one very specific scenario, and it's unlikely to be met in your case. Too many other factors affect this, especially the presence of an index.