Mysql optimized query and index for exclusion - mysql

Mysql optimized query and index with exclusion
In the case of a select on a high volume table with a select criteria excluding results, what are the possible alternatives?
for example with the following table:
+----+---+---+----+----+
| id | A | B | C | D |
+----+---+---+----+----+
| 1 | a | b | c | d |
| 2 | a | b | c | d |
| 3 | a | b | c1 | d1 |
| 4 | a | b | c2 | d |
| 5 | a | b | c | d2 |
| 6 | a | b | c | d2 |
+----+---+---+----+----+
I would like to select all the tuples (C,D) where A=a and B=b and (C!=c or D!=d)
SELECT C,D FROM my_table WHERE A=a AND B=b AND (C!=c OR D!=d) GROUP BY C,D;
expected result:
(c1,d1)
(c2,d)
(c,d2)
I tried to add an index like that: CREATE INDEX idx_my_index ON my_table(A, B, C, D); but response times are still very long
NB: I'm using MariadDB 10.3
The explain:
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
| 1 | SIMPLE | my_table | ref | idx_my_index | idx_my_index | 6 | const,const | 12055772 | Using where; Using index |
+----+-------------+-----------+-------+----------------+---------------+---------+-------------+-----------+--------------------------+
Is there some improvement to add on my index, on mariadb config or another select to do that?
Specific solution: If we use this query as a subquery we can use the FirstMatch strategy to avoid the full scan of the table. this is described into https://mariadb.com/kb/en/firstmatch-strategy/
SELECT * FROM my__second_table tbis
WHERE (tbis.C, tbis.D)
IN (SELECT C,D FROM my_table WHERE A=a AND B=b AND (C!=c OR D!=d));

Your index is optimal. Discussion:
INDEX(A, B, -- see Note 1
C, D) -- see note 2
Note 1: A,B can be in either order. These will be used for filtering on = to find possible rows. Confirmed by "const,const".
Note 2: C,D can be in either order. != does not work well for filtering, hence these come after A and B. They are included here to make the index "covering". Confirmed by "Using index".
"response times are still very long" -- 12M rows in the table? How many rows before the GROUP BY? How many rows in the result? (These might give us clues into where to go next.)
"Alternative". Probably SELECT DISTINCT ... instead of SELECT ... GROUP BY ... would run at the same speed. (But you could try it. Also, the EXPLAIN might be the same`; the result should be the same.)
Please provide SHOW CREATE TABLE; it might give some more clues, such as NULL/NOT NULL and Engine type. (I don't hold out much hope here.)
Please provide EXPLAIN FORMAT=JSON SELECGT ... -- This might give some more insight. Also: Turn on the Optimizer Trace.

Related

Why is my MySQL query is so slow?

I'm trying to figure out why that query so slow (take about 6 second to get result)
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
WHERE
c.id NOT IN (... big list of ids which should be excluded)
This is execution plan
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | z1 | index | PRIMARY | PRIMARY | 4 | NULL | 318563 | 99.85 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c | eq_ref | PRIMARY,member_id | PRIMARY | 4 | z1.id | 1 | 100.00 | |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | c.member_id | 1 | 100.00 | Using index |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
is it because mysql has to take out almost whole 1st table ? Can it be adjusted ?
You can try to replace c with a subquery.
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
(select c.id
from c
WHERE
c.id NOT IN (... big list of ids which should be excluded)) c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
to leave only necessary id's
It is imposible to say from the information you've provided whether there is a faster solution to obtaining the same data (we would need to know abou data distributions and what foreign keys are obligatory). However assuming that this is a hierarchical data set, then the plan is probably not optimal: the only predicate to reduce the number of rows is c.id NOT IN.....
The first question to ask yourself when optimizing any query is Do I need all the rows? How many rows is this returning?
I'm struggling to see any utlity in a query which returns a list of 'id' values (implying a set of autoincrement integers).
You can't use an index for a NOT IN (or <>) hence the most eficient solution is probably to start with a full table scan on 'c' - which should be the outcome of StanislavL's query.
Since you don't use the values from i and z, the joins could be replaced with 'exists' which may help performance.
I would consider creating a compound index for c(id, member_id). This way the query should work at index level only without scanning any rows in tables.

joining table in mysql not using index properly?

I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?
MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

MySQL Query Optimization; SELECT multiple fields vs. JOIN

We've got a relatively straightforward query that does LEFT JOINs across 4 tables. A is the "main" table or the top-most table in the hierarchy. B links to A, C links to B. Furthermore, X links to A. So the hierarchy is basically
A
C => B => A
X => A
The query is essentially:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
WHERE
b.flag = true
ORDER BY
x.date DESC
LIMIT 25
Via EXPLAIN, I've confirmed that the correct indexes are in place, and that the built-in MySQL query optimizer is using those indexes correctly and properly.
So here's the strange part...
When we run the query as is, it takes about 1.1 seconds to run.
However, after doing some checking, it seems that if I removed most of the SELECT fields, I get a significant speed boost.
So if instead we made this into a two-step query process:
First query same as above except change the SELECT clause to only SELECT a.id instead of SELECT *
Second query also same as above, except change the WHERE clause to only do an a.id IN agains the result of Query 1 instead of what we have before
The result is drastically different. It's .03 seconds for the first query and .02 for the second query.
Doing this two-step query in code essentially gives us a 20x boost in performance.
So here's my question:
Shouldn't this type of optimization already be done within the DB engine? Why does the difference in which fields that are actually SELECTed make a difference on the overall performance of the query?
At the end of the day, it's merely selecting the exact same 25 rows and returning the exact same full contents of those 25 rows. So, why the wide disparity in performance?
ADDED 2012-08-24 13:02 PM PDT
Thanks eggyal and invertedSpear for the feedback. First off, it's not a caching issue -- I've run tests running both queries multiple times (about 10 times) alternating between each approach. The result averages at 1.1 seconds for the first (single query) approach and .03+.02 seconds for the second (2 query) approach.
In terms of indexes, I thought I had done an EXPLAIN to ensure that we're going thru the keys, and for the most part we are. However, I just did a quick check again and one interesting thing to note:
The slower "single query" approach doesn't show the Extra note of "Using index" for the third line:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
While it does show "Using index" for when we query for just the IDs:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | Using index |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
The strange thing is that both do list the correct index being used... but I guess it begs the questions:
Why are they different (considering all the other clauses are the exact same)? And is this an indication of why it's slower?
Unfortunately, the MySQL docs do not give much information for when the "Extra" column is blank/null in the EXPLAIN results.
More important than speed, you have a flaw in your query logic. When you test a LEFT JOINed column in the WHERE clause (other than testing for NULL), you force that join to behave as if it were an INNER JOIN. Instead, you'd want:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
AND b.flag = true
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
ORDER BY
x.date DESC
LIMIT 25
My next suggestion would be to examine all of those .*'s in your SELECT. Do you really need all the columns from all the tables?

Is column order important in mysql?

I read somewhere that column order in mysql is important. I believe they were referring to the indexed columns.
QUESTION: If column order is important, when and why is it important?
The reason I ask is because I have a table in mysql similar to the one below.
The primary index is on the left and I have an index on the far right. Is this bad?
It is a MyISAM table and will be used predominantly for selects (no inserts, deletes or updates).
-----------------------------------------------
| Primary index | data1| data2 | d3| Index |
-----------------------------------------------
| 1 | A | cat | 1 | A |
| 2 | B | toads | 3 | A |
| 3 | A | yabby | 7 | B |
| 4 | B | rabbits | 1 | B |
-----------------------------------------------
Column order is only important when defining indexes, as this affects whether an index is suitable to use in executing a query. (This is true of all RBDMS's, not just MySQL)
e.g.
Index defined on columns MyIndex(a, b, c) in that order.
A query such as
select a from mytable
where c = somevalue
probably won't use that index to execute the query (depends on several factors such as row count, column selectivity etc)
Whereas, it will most likely choose to use an index defined as MyIndex2(c,a,b)
Update: see use-the-index-luke.com (thanks Greg).

MySQL query optimization is driving me nuts! Almost same, but horribly different

I have the following two queries (*), which only differ in the field being restricted in the WHERE clause (name1 vs name2):
SELECT A.third_id, COUNT(DISTINCT B.fourth_id) AS num
FROM first A
JOIN second B ON A.third_id = B.third_id
WHERE A.name1 LIKE 'term%'
SELECT A.third_id, COUNT(DISTINCT B.fourth_id) AS num
FROM first A
JOIN second B ON A.third_id = B.third_id
WHERE A.name2 LIKE 'term%'
Both of the name fields have a single-column index on them. There is also an index on both third_id columns as well as fourth_id (which are all foreign keys into other tables, but it is not relevant here).
According to EXPLAIN, the first one behaves like this - which is what I want:
+----+-------------+-------+-------+---------------+----------+---------+---------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+---------------+------+-------------+
| 1 | SIMPLE | A | range | third_id,name | name | 767 | NULL | 3491 | Using where |
| 1 | SIMPLE | B | ref | third_id | third_id | 4 | db.A.third_id | 16 | |
+----+-------------+-------+-------+---------------+----------+---------+---------------+------+-------------+
The second one does this, which I definitely do not want:
+----+-------------+-------+------+----------------+----------+---------+---------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+----------+---------+---------------+--------+-------------+
| 1 | SIMPLE | B | ALL | third_id | NULL | NULL | NULL | 507539 | |
| 1 | SIMPLE | A | ref | third_id,name2 | third_id | 4 | db.B.third_id | 1 | Using where |
+----+-------------+-------+------+----------------+----------+---------+---------------+--------+-------------+
What the heck is happening here? How do I make the second one behave properly (i.e. like the first one)?
(*) Actually, I don't. I have a bit more complex queries; I have eliminated extras for this post, and distilled them to the minimal queries that still exhibit the problematic behaviour. Also, names were changed to protect the guilty.
Add CREATE TABLE statements to your post.
A real SELECT statement would be helpful too.
1 possible reason is that name2 has a much higher percentage of values starting with "term%".
Try enforcing order of tables in query by using STRAIGHT_JOIN.
SELECT A.third_id, COUNT(DISTINCT B.fourth_id) AS num
FROM first A
STRAIGHT_JOIN second B ON A.third_id = B.third_id
WHERE A.name2 LIKE 'term%'
How many records is in those tables ? Check cardinality/slectivity in name2 column.
If selectivity is low try Naktibalda "STRAIGHT_JOIN" or hints http://dev.mysql.com/doc/refman/5.0/en/index-hints.html