About the sql performance of select ... in - mysql

Mysql 5.7.21
I use pool to connect database and run the SQL
let mysql = require('mysql');
let pool = mysql.createPool(db);
pool.getConnection((err, conn) => {
if(err){
...
}else{
console.log('allConnections:' + pool._allConnections.length);
let q = conn.query(sql, val, (err, rows,fields) => {
...
I have a table with around 1,000,000 records. I wrote a select to fecth the records.
select * from tableA where trackingNo in (?)
I will send the trackingNo via array param. The amount of trackingNo is around 20000. It means the length of array is around 20000.
And I made the index to trackingNo column. (trackingNo column is varchar type, not unique, can be null, blank and all possible values)
The problem is, I find it will cost around 5 minutes to get the results! 5 minutes here means purely backend sql handling time. I think it is too slow to match 20000 records in 1,000,000 records. Do you have any suggestion for select.. in ?
Explain SQL:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE tableA null ALL table_tracking_no_idx null null null 999507 50 Using where

You could consider populating a table with the tracking numbers you want to match. Then, you could use an inner join instead of your current WHERE IN approach:
SELECT *
FROM tableA a
INNER JOIN tbl b
ON a.trackingNo = b.trackingNo;
This has the advantage that you may index the new tbl table on the trackingNo column to make the join lookup extremely fast.
This assumes that tbl would have a single column trackingNo which contains the 20K+ values you need to consider.

MySQL creates a binary search tree for IN lists that are composed of constants. As explained in the documentation:
If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.
In general, creating a separate table with constants does not provide much improvement in performance.
I suppose there could be some subtle issue with type compatibility -- such as collations -- that interferes with this process.
This type of query probably requires a full table scan. If the rows are wide, then the combination of the scan and returning the data may be accounting for the performance. I do agree that five minutes is a long time, but it could be entirely due to the network connection between the app/GUI and the database.

Related

Making a MySQL query return/stop executing after finding the first match

I have a simple query Select * From `TableName` Where `Id` = #Id Limit 1 that executes on a table with hundreds of thousands of records.
Id is a unique auto-incrementing primary column and because of that, there can only be one row with Id equal to 33 or 69 for example.
How do I make this query return/stop executing upon finding the first match? Or does it do that automatically?
For example like this C# code:
foreach (var entry in entries)
if (entry.Id == RequiredId)
return entry;
To answer your question, the query planner will know to exit after the first match because of the limit keyword and will not resume the full table scan.
If you want to take it one step further, create a unique index / primary key on the Id column, that'll make it even faster by not needing to do a full table scan anymore, as the index makes it possible to use faster search algorithms.

MySQL not using index with selective INT key

I have created the table measurements as listed below.
This table is written to periodically and will rapidly grow to contain millions of rows after a few days.
On read: I only need the precise time of the measurement and its value (unix_epoch and value).
To improve performance, I have added column date_from_epoch which is the day extracted out of unix_epoch (the measurement precise time) in this format: yyyymmdd. It should have a good selectivity (after multiple days of measurements have been written to the table) and I am using it as a key for an index. I am hoping to scan only the days for which I want the measurements on read, and not all the days present in the table (example: after 10 days, if 1,000,000 are added each day, I am hoping to scan only 1,000,000 rows if I need data contained within one day, not 10,000,000).
I have also:
used innoDB for the engine
partitioned the table by hash into 10 files to help with I/O
made sure the type used in my query is the same as the column type (or did I get this verification wrong?).
Question:
I have made a test after measurements have trickled in the measurement table for 2 days.
Using EXPLAIN, I see my read query does not use the index. Why is the query not using the index?
Table is created with:
CREATE TABLE measurements(
date_from_epoch INT UNSIGNED,
unix_epoch INT UNSIGNED,
application_name varchar(255),
environment varchar(255),
metric_name varchar(255),
host_name varchar(1024),
value FLOAT(38,3)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY HASH(unix_epoch)
PARTITIONS 10;
CREATE TRIGGER write_epoch_day
BEFORE INSERT ON measurements
FOR EACH ROW
SET NEW.date_from_epoch = FROM_UNIXTIME(NEW.unix_epoch, '%Y%m%d');
ALTER TABLE measurements ADD INDEX (date_from_epoch);
The query is:
EXPLAIN SELECT * FROM measurements
WHERE date_from_epoch >= 20150615 AND date_from_epoch <= 20150615
AND unix_epoch >= 1434423478 AND unix_epoch <= 1434430678
AND BINARY application_name = 'all'
AND BINARY environment = 'prod'
AND BINARY metric_name = 'Internet availability'
AND (BINARY host_name = 'kitkat' )
ORDER BY unix_epoch ASC;
Explain gives:
id select_type table type possible_keys key key_len ref rows Extra
-------------------------------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE measurements ALL date_from_epoch 118011 Using where; Using filesort
Thanks for reading and head-scratching!
There is an option to use FORCE INDEX in MYSQL
Refer this for better understanding.
Thanks Sashi!
I have modified the query to
EXPLAIN SELECT * FROM measurements FORCE INDEX (date_from_epoch)
WHERE date_from_epoch >= 20150615 AND date_from_epoch <= 20150615
AND unix_epoch >= 1434423478 AND unix_epoch <= 1434430678
AND BINARY application_name = 'all'
AND BINARY environment = 'prod'
AND BINARY metric_name = 'Internet availability'
AND BINARY host_name = 'kitkat'
ORDER BY unix_epoch ASC;
Explain says still "Using where; Using file sort" but the number of rows scanned is now down to 67,906 vs the 118,011 initially scanned (Which is great).
Although the number of rows for date_from_epoch = 20150615 is 113,182. I am now wondering why the number of rows scanned is not 113,182 (not that I want it to go up, but I would like to understand what mysql did to even further optimize the execution).
A lot of things need fixing:
Don't use PARTITION BY HASH; it does not help.
Since you have a range across the partition key, it must touch all partitions. See EXPLAIN PARTITIONS SELECT ....
Don't bother with the extra epoch_from_date and Trigger; just do comparisons on unix_epoch. (See the manual on the conversion routines needed.)
Don't use BINARY. Instead, specify the columns as COLLATION utf8_bin. Performance will be much better.
Normalize (or turn into an ENUM) these fields: application_name, environment, metric_name, host_name. What you have is unnecessarily bulky for millions of rows. (I am assuming there are only a few distinct values for those fields.) The space savings will make the SELECT run much faster.
FLOAT(38, 3) has an extra (unnecessary?) rounding. Simply use FLOAT.
(After making the above changes) INDEX(application_name, environment, metric_name, host_name, unix_epoch) would be quite helpful, at least for that one query. And it will be significantly better than the INDEX you are asking about.

Distinct (or group by) using filesort and temp table

I know there are similar questions on this but I've got a specific query / question around why this query
EXPLAIN SELECT DISTINCT RSubdomain FROM R_Subdomains WHERE EmploymentState IN (0,1) AND RPhone='7853932120'
gives me this output explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains index NULL RSubdomain 767 NULL 3278 Using where
with and index on RSubdomains
but if I add in a composite index on EmploymentState/RPhone
I get this output from explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains range EmploymentState EmploymentState 67 NULL 2 Using where; Using temporary
if I take away the distinct on RSubdomains it drops the Using temp from the explain output... but what I don't get is why, when I add in the composite key (and keeping the key on RSubdomain) does the distinct end up using a temp table and which index schema is better here? I see that the amount of rows scanned on the combined key is far less, but the query is of type range and it's also slower.
Q: why ... does the distinct end up using a temp table?
MySQL is doing a range scan on the index (i.e. reading index blocks) to locate the rows that satisfy the predicates (WHERE clause). Then MySQL has to lookup the value of the RSubdomain column from the underlying table (it's not available in the index.) To eliminate duplicates, MySQL needs to scan the values of RSubdomain that were retrieved. The "Using temp" indicates the MySQL is materializing a resultset, which is processed in a subsequent step. (Likely, that's the set of RSubdomain values that was retrieved; given the DISTINCT, it's likely that MySQL is actually creating a temporary table with RSubdomain as a primary or unique key, and only inserting non-duplicate values.
In the first case, it looks like the rows are being retreived in order by RSubdomain (likely, that's the first column in the cluster key). That means that MySQL needn't compare the values of all the RSubdomain values; it only needs to check if the last retrieved value matches the currently retrieved value to determine whether the value can be "skipped."
Q: which index schema is better here?
The optimum index for your query is likely a covering index:
... ON R_Subdomains (RPhone, EmploymentState, RSubdomain)
But with only 3278 rows, you aren't likely to see any performance difference.
FOLLOWUP
Unfortunately, MySQL does not provide the type of instrumentation provided in other RDBMS (like the Oracle event 10046 sql trace, which gives actual timings for resources and waits.)
Since MySQL is choosing to use the index when it is available, that is probably the most efficient plan. For the best efficiency, I'd perform an OPTIMIZE TABLE operation (for InnoDB tables and MyISAM tables with dynamic format, if there have been a significant number of DML changes, especially DELETEs and UPDATEs that modify the length of the row...) At the very least, it would ensure that the index statistics are up to date.
You might want to compare the plan of an equivalent statement that does a GROUP BY instead of a DISTINCT, i.e.
SELECT r.RSubdomain
FROM R_Subdomains r
WHERE r.EmploymentState IN (0,1)
AND r.RPhone='7853932120'
GROUP
BY r.Subdomain
For optimum performance, I'd go with a covering index with RPhone as the leading column; that's based on an assumption about the cardinality of the RPhone column (close to unique values), opposed to only a few different values in the EmploymentState column. That covering index will give the best performance... i.e. the quickest elimination of rows that need to be examined.
But again, with only a couple thousand rows, it's going to be hard to see any performance difference. If the query was examining millions of rows, that's when you'd likely see a difference, and the key to good performance will be limiting the number of rows that need to be inspected.

Query huge MySQL DB

I have a MySQL DB with two columns. 'Key' and 'Used'. Key is a string, Used is an integer. Is there a very fast way to search for a specific Key and then return the Use in a huge MySQL DB with 6000000 rows of data.
You can make it fast by creating an index on key field:
CREATE INDEX mytable_key_idx ON mytable (`key`);
You can actually make it even faster for reading by creating covering index on both (key, used) fields:
CREATE INDEX mytable_key_used_idx ON mytable (`key`, `used`);
In this case, when reading, MySQL could retrieve used value from the index itself, without reading the table (index-only scan). However, if you have a lot of write activity, covering index may work slower because now it has to update both an index and actual table.
The normative SQL for that would be:
SELECT t.key, t.used FROM mytable t WHERE t.key = 'particularvalue' ;
The output from
EXPLAIN
SELECT t.key, t.used FROM mytable t WHERE t.key = 'particularvalue' ;
Would give details about the access plan, what indexes are being considered, etc.
The output from a
SHOW CREATE TABLE mytable ;
would give information about the table, the engine being used and the available indexes, as well as the datatypes.
Slow performance on a query like this is usually indicative of a suboptimal access plan, either because suitable indexes are not available, or not being used. Sometimes, a characterset mismatch between the column datatype and the literal datatype in the predicate can make an index "unusable" by a particular query.

(Bitwise) Supersets and Subsets in MySQL

Are the following queries effective in MySQL:
SELECT * FROM table WHERE field & number = number;
# to find values with superset of number's bits
SELECT * FROM table WHERE field | number = number;
# to find values with subset of number's bits
...if an index for the field has been created?
If not, is there a way to make it run faster?
Update:
See this entry in my blog for performance details:
Bitwise operations and indexes
SELECT * FROM table WHERE field & number = number
SELECT * FROM table WHERE field | number = number
This index can be effective in two ways:
To avoid early table scans (since the value to compare is contained in the index itself)
To limit the range of values examined.
Neither condition in the queries above is sargable, this is the index will not be used for the range scan (with the conditions as they are now).
However, point 1 still holds, and the index can be useful.
If your table contains, say, 100 bytes per row in average, and 1,000,000 records, then the table scan will need to scan 100 Mb of data.
If you have an index (with a 4-byte key, 6-byte row pointer and some internal overhead), the query will need to scan only 10 Mb of data plus additional data from the table if the filter succeeds.
The table scan is more efficient if your condition is not selective (you have high probablility to match the condition).
The index scan is more efficient if your condition is selective (you have low probablility to match the condition).
Both these queries will require scanning the whole index.
But by rewriting the AND query you can benefit from the ranging on the index too.
This condition:
field & number = number
can only match the fields if the highest bits of number set are set in the field too.
And you should just provide this extra condition to the query:
SELECT *
FROM table
WHERE field & number = number
AND field >= 0xFFFFFFFF & ~((2 << FLOOR(LOG(2, 0xFFFFFFFF & ~number))) - 1)
This will use the range for coarse filtering and the condition for fine filtering.
The more bits for number are unset at the end, the better.
I doubt the optimizer would figure that one...
Maybe you can call EXPLAIN on these queries and confirm my pessimistic guess. (remembering of course that much of query plan decisions are based on the specific instance of a given database, i.e. variable amounts of data and/ore merely data with a different statistical profile may produce distinct plans).
Assuming that the table has a significant amount of rows, and that the "bitwised" criteria remain selective enough) a possible optimization is achieved when avoiding a bitwise operation on every single row, by rewriting the query with an IN construct (or with a JOIN)
Something like that (conceptual, i.e. not tested)
CREATE TEMPORARY TABLE tblFieldValues
(Field INT);
INSERT INTO tblFieldValues
SELECT DISTINCT Field
FROM table;
-- SELECT * FROM table WHERE field | number = number;
-- now becomes
SELECT *
FROM table t
WHERE field IN
(SELECT Field
FROM tblFieldValues
WHERE field | number = number);
The full benefits of an approach like this need to be evaluated with different use cases (all of which with a sizeable number of rows in table, since otherwise the direct "WHERE field | number = number" approach is efficient enough), but I suspect this could be significantly faster. Further gains can be achieved if the "tblFieldValues" doesn't need to be recreated each time. Efficient creation of this table of course implies an index on Field in the original table.
I've tried this myself, and the bitwise operations are not enough to prevent Mysql from using an index on the "field" column. It is likely, though, that a full scan of the index is taking place.