MySQL query optimization - mysql

Could you please help me optimize this query. I've spent lots of time and still cannot rephrase it to be fast enough (say running in the matters of seconds, not minutes as it is now).
The query:
SELECT m.my_id, m.my_value, m.my_timestamp
FROM (
SELECT my_id, MAX(my_timestamp) AS most_recent_timestamp
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
) as tmp
LEFT OUTER JOIN my_table m
ON tmp.my_id = m.my_id AND tmp.most_recent_timestamp = m.my_timestamp
ORDER BY m.my_timestamp;
my_table is defined as follows:
CREATE TABLE my_table (
my_id INTEGER NOT NULL,
my_value VARCHAR(4000),
my_timestamp TIMESTAMP default CURRENT_TIMESTAMP NOT NULL,
INDEX MY_ID_IDX (my_id),
INDEX MY_TIMESTAMP_IDX (my_timestamp),
INDEX MY_ID_MY_TIMESTAMP_IDX (my_id, my_timestamp)
);
The goal of this query is to select the most recent my_value for each my_idbefore some timestamp. my_table contains ~100 million entries and it takes ~8 minutes to perform it.
explain:
+----+-------------+-------------+-------+------------------------------------------------+-------------------------+---------+---------------------------+-------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------------------------------------+-------------------------+---------+---------------------------+-------+---------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 90721 | Using temporary; Using filesort |
| 1 | PRIMARY | m | ref | MY_ID_IDX,MY_TIMESTAMP_IDX,MY_ID_TIMESTAMP_IDX | MY_TIMESTAMP_IDX | 4 | tmp.most_recent_timestamp | 1 | Using where |
| 2 | DERIVED | my_table | range | MY_TIMESTAMP_IDX | MY_ID_MY_TIMESTAMP_IDX | 8 | NULL | 61337 | Using where; Using index for group-by |
+----+-------------+-------------+-------+------------------------------------------------+-----------------------+---------+---------------------------+------+---------------------------------------+

If I understand correctly, you should be able to drop the nested select completely, and move the where clause to the main query, order by my_timestamp descending and limit 1.
SELECT my_id, my_value, max(my_timestamp)
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
*edit - added max and group by

a trick to get a most recent record can be to use order by together with 'limit 1' instead of max aggregation together with "self" join
somthing like this (not tested):
SELECT m.my_id, m.my_value, m.my_timestamp
FROM my_table m
WHERE my_timestamp < '2011-03-01 08:00:00'
ORDER BY m.my_timestamp DESC
LIMIT 1
;
update above doesn't work because a grouping is required...
other solution that has WHERE-IN-SubSelect instead of the JOIN you've used.
could be faster. please test with your data.
SELECT m.my_id, m.my_value, m.my_timestamp
FROM my_table m
WHERE ( m.my_id, m.my_timestamp ) IN (
SELECT i.my_id, MAX(i.my_timestamp)
FROM my_table i
WHERE i.my_timestamp < '2011-03-01 08:00:00'
GROUP BY i.my_id
)
ORDER BY m.my_timestamp;

I notice in the explain plan that the optimizer is using the MY_ID_MY_TIMESTAMP_IDX index for the sub-query, but not the outer query.
You may be able to speed it up using an index hint. I also updated the ON clause to refer to tmp.most_recent_timestamp using its alias.
SELECT m.my_id, m.my_value, m.my_timestamp
FROM (
SELECT my_id, MAX(my_timestamp) AS most_recent_timestamp
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
) as tmp
LEFT OUTER JOIN my_table m use index (MY_ID_MY_TIMESTAMP_IDX)
ON tmp.my_id = m.my_id AND tmp.most_recent_timestamp = m.my_timestamp
ORDER BY m.my_timestamp;

Related

How to optimize select query with case statements?

I have 3 tables over 1,000,000+ records. My select query is running for hours.
How to optimize it? I'm newbie.
I tried to add index for name, still it taking hours to load.
Like this,
ALTER TABLE table2 ADD INDEX(name);
and like this also,
CREATE INDEX INDEX1 table2(name);
SELECT MS.*, P.Counts FROM
(SELECT M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name
WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name
END col1
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id) AS MS
LEFT JOIN
(select E.id, count(E.id) Counts
from table3 E
where E.field2 = 'value1'
group by E.id) AS P
ON MS.id=P.id;
Explain <above query>;
output:
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| 1 | PRIMARY | M | NULL | ALL | NULL | NULL | NULL | NULL | 344763 | 100.00 | NULL |
| 1 | PRIMARY | <derived3> | NULL | ref | <auto_key0> | <auto_key0> | 8 | CP.M.id | 10 | 100.00 | NULL |
| 1 | PRIMARY | V | NULL | index | NULL | INDEX1 | 411 | NULL | 1411083 | 100.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 3 | DERIVED | E | NULL | ref | PRIMARY,f2,f3 | f2| 43 | const | 966442 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
I expect to get result in less than 1 min.
The query indented for clarity.
SELECT MS.*, P.Counts
FROM (
SELECT M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name
WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name
END col1
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id
) AS MS
LEFT JOIN (
select E.id, count(E.id) Counts
from table3 E
where E.field2 = 'value1'
group by E.id
) AS P ON MS.id=P.id;
Your query has no filtering predicate, so it's essentially retrieving all the rows. That is a 1,000,000+ rows from table1. Then it's joining it with table2, and then with another table expression/derived table.
Why do you expect this query to be fast? A massive query like this one will normally run as a batch process at night. I assume this query is not for an online process, right?
Maybe you need to rethink the process. Do you really need to process millions of rows at once interactively? Will the user read a million rows in the web page?
Subqueries are not always well-optimized.
I think you can flatten it out something like:
SELECT M.*, V.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name END col1,
( SELECT COUNT(*) FROM table3 WHERE field2 = 'value1' AND id = x.id
) AS Counts
FROM table1 AS M
LEFT JOIN table2 AS V ON M.id = V.id
I may have some parts not quite right; see if you can make this formulation work.
For starters, you are returning the same result for 'col1' in case v.name is null or v.name != 'text'. That said, you can include that extra condition on you join with table2 and use IFNULL function.
Has you are filtering table3 by field2, you could probably create an index over table 3 that includes field2.
You should also check if you can include any additional filter for any of those tables, and if you do you can consider using a stored procedure to get the results.
Also, I don´t see why you need to the aggregate the first join into 'MS' you can easy do all the joins in one go like this:
SELECT
M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
IFNULL(V.name, M.name) as col1,
P.Counts
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id AND V.name <> 'text'
LEFT JOIN
(SELECT
E.id,
COUNT(E.id) Counts
FROM table3 E
WHERE E.field2 = 'value1'
GROUP BY E.id) AS P ON M.id=P.id;
I'm also assuming that you do have clustered indexes for all id fields in all this three tables, but with no filter, if you are dealing with millions off records, this will always be an big heavy query. To say the least your are doing a table scan for table1.
I've included this additional information after you comment.
I've mentioned clustered index, but according to the official documentation about indexes here
When you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index. So if you already have a primary key defined you don't need to do anything else.
Has the documentation also point's out you should define a primary key for each table that you create.
If you don't have a primary key. Here is the code snippet you requested.
ALTER TABLE table1 ADD CONSTRAINT pk_table1
PRIMARY KEY CLUSTERED (id);
ATTENTION: Keep in mind that creating a clustered index is a big operation, for tables like yours with tones of data.
This isn’t something you want to do without planning, on a production server. This operation will also take a long time and table will be locked during the process.

select 1 random row with complex filtering

I've 2 tables:
first table users:
+-------------------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------+---------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| first_name | text | NO | | NULL | |
| age | int(11) | YES | | NULL | |
| settings | text | YES | | NULL | |
+-------------------------+---------+------+-----+---------+-------+
second table proposals:
+---------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| from_id | int(11) | NO | | NULL | |
| to_id | int(11) | NO | | NULL | |
| status | int(11) | NO | | NULL | |
+---------+---------+------+-----+---------+----------------+
I need to get 1 random row from users which id is not in to_id in proposals
I'm doing it (without rand) with this sql:
SELECT DISTINCT *
FROM profiles
WHERE
profiles.first_name IS NOT NULL
AND
NOT EXISTS (
SELECT *
FROM proposal
WHERE
proposal.to_id = profiles.id
)
LIMIT 0 , 1
performance is fine: 1 row in set (0.00 sec)
but perfomance is very bad: 1 row in set (1.78 sec) when I add ORDER BY RAND() to the end
I've big holes in users.id and I can't use something like MAX(id)
I'he try set random limit, example:
...
LIMIT 1234 , 1;
Empty set (2.71 sec)
But it takes much time too :(
How to get random 1 user which users.id isn't exists in proposals.to_id with good perfomance?
I think that I need to first get all profiles with a rand() and then filter them, but I do not know how to do it.
I've two problem solutions.
1) With random id, from https://stackoverflow.com/a/4329447/2051938
SELECT *
FROM profiles AS r1
JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM profiles)) AS id)
AS r2
WHERE
r1.id >= r2.id
AND
r1.first_name IS NOT NULL
AND
NOT EXISTS (
SELECT *
FROM proposal
WHERE
proposal.to_id = r1.id
)
LIMIT 0 , 1
2) With ORDER BY RAND()
SELECT *
FROM
(
SELECT *
FROM profiles
WHERE
profiles.first_name IS NOT NULL
ORDER BY RAND()
) AS users
WHERE
NOT EXISTS (
SELECT *
FROM proposal
WHERE
proposal.to_id = users.id
)
LIMIT 0 , 1
First solution is faster but it've problem with "holes in id" and when you got id from the end (users may end earlier than there will be a match)
Second solution is slower but without flaws!
Have you tried switching not exists to left join?
SELECT DISTINCT *
FROM profiles t1
LEFT JOIN
proposal t2
ON t1.id = t2.to_id
WHERE t1.first_name IS NOT NULL AND
t2.to_id IS NULL
ORDER BY RAND()
LIMIT 0 , 1
This will return you all rows of profiles, and to those that are not matched by a row in proposal it will assign NULL values, on which you can filter.
The result should be the same, but the performance may be better.
As RAND() function assigns a random number to every row present in result, performance will be directly proportional to number of records.
If you want to select only one (random) record, you can apply LIMIT <random number from 0 to record count>, 1
e.g.:
SELECT u.id, count(u.id) as `count`
FROM users u
WHERE
first_name IS NOT NULL
AND
NOT EXISTS (
SELECT *
FROM proposal
WHERE
proposal.to_id = u.id
)
LIMIT RAND(0, count-1) , 1
I haven't tried executing this query, however, it MySQL complains about using count in RAND, you can calculate count separately and substitute the value in this query.
First, I don't think the select distinct is necessary. So, try removing that:
SELECT p.*
FROM profiles p
WHERE p.first_name IS NOT NULL AND
NOT EXISTS (SELECT 1
FROM proposal pr
WHERE pr.to_id = p.id
)
ORDER BY rand()
LIMIT 0 , 1;
That might help a bit. Then, a relatively easy way to reduce the time spent is to reduce the data volume. If you know you will always have thousands of rows that meet the conditions, then you can do:
SELECT p.*
FROM profiles
WHERE p.first_name IS NOT NULL AND
NOT EXISTS (SELECT 1
FROM proposal pr
WHERE pr.to_id = p.id
) AND
rand() < 0.01
ORDER BY rand()
LIMIT 0, 1;
The trick is to find the comparison value that ensures that you get at least one row. This is tricky because you have another set of data. Here is one method that uses a subquery:
SELECT p.*
FROM (SELECT p.*, (#rn := #rn + 1) as rn
FROM profiles p CROSS JOIN
(SELECT #rn := 0) params
WHERE p.first_name IS NOT NULL AND
NOT EXISTS (SELECT 1
FROM proposal pr
WHERE pr.to_id = p.id
)
) p
WHERE rand() < 100 / #rn
ORDER BY rand()
LIMIT 0, 1;
This uses a variable to calculate the number of rows and then randomly selects 100 of them for processing. When choosing 100 rows randomly, there is a very, very, very high likelihood that at least one will be chosen.
The downside to this approach is that the subquery needs to be materialized, which adds to the cost of the query. It is, however, cheaper than a sort on the full data.

SQL improvement in MySQL

I have these tables in MySQL.
CREATE TABLE `tableA` (
`id_a` int(11) NOT NULL,
`itemCode` varchar(50) NOT NULL,
`qtyOrdered` decimal(15,4) DEFAULT NULL,
:
PRIMARY KEY (`id_a`),
KEY `INDEX_A1` (`itemCode`)
) ENGINE=InnoDB
CREATE TABLE `tableB` (
`id_b` int(11) NOT NULL AUTO_INCREMENT,
`qtyDelivered` decimal(15,4) NOT NULL,
`id_a` int(11) DEFAULT NULL,
`opType` int(11) NOT NULL, -- '0' delivered to customer, '1' returned from customer
:
PRIMARY KEY (`id_b`),
KEY `INDEX_B1` (`id_a`)
KEY `INDEX_B2` (`opType`)
) ENGINE=InnoDB
tableA shows how many quantity we received order from customer, tableB shows how many quantity we delivered to customer for each order.
I want to make a SQL which counts how many quantity remaining for delivery on each itemCode.
The SQL is as below. This SQL works, but slow.
SELECT T1.itemCode,
SUM(IFNULL(T1.qtyOrdered,'0')-IFNULL(T2.qtyDelivered,'0')+IFNULL(T3.qtyReturned,'0')) as qty
FROM tableA AS T1
LEFT JOIN (SELECT id_a,SUM(qtyDelivered) as qtyDelivered FROM tableB WHERE opType = '0' GROUP BY id_a)
AS T2 on T1.id_a = T2.id_a
LEFT JOIN (SELECT id_a,SUM(qtyDelivered) as qtyReturned FROM tableB WHERE opType = '1' GROUP BY id_a)
AS T3 on T1.id_a = T3.id_a
WHERE T1.itemCode = '?'
GROUP BY T1.itemCode
I tried explain on this SQL, and the result is as below.
+----+-------------+------------+------+----------------+----------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------+----------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | T1 | ref | INDEX_A1 | INDEX_A1 | 152 | const | 1 | Using where |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 21211 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 10 | |
| 3 | DERIVED | tableB | ref | INDEX_B2 | INDEX_B2 | 4 | | 96 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | tableB | ref | INDEX_B2 | INDEX_B2 | 4 | | 55614 | Using where; Using temporary; Using filesort |
+----+-------------+-------------------+----------------+----------+---------+-------+-------+----------------------------------------------+
I want to improve my query. How can I do that?
First, your table B has int for opType, but you are comparing to string via '0' and '1'. Leave as numeric 0 and 1. To optimize your pre-aggregates, you should not have individual column indexes, but a composite, and in this case a covering index. INDEX table B ON (OpType, ID_A, QtyDelivered) as a single index. The OpType to optimize the WHERE, ID_A to optimize the group by, and QtyDelivered for the aggregate in the index without going to the raw data pages.
Since you are looking for the two types, you can roll them up into a single subquery testing for either in a single pass result. THEN, Join to your tableA results.
SELECT
T1.itemCode,
SUM( IFNULL(T1.qtyOrdered, 0 )
- IFNULL(T2.qtyDelivered, 0)
+ IFNULL(T2.qtyReturned, 0)) as qty
FROM
tableA AS T1
LEFT JOIN ( SELECT
id_a,
SUM( IF( opType=0,qtyDelivered, 0)) as qtyDelivered,
SUM( IF( opType=1,qtyDelivered, 0)) as qtyReturned
FROM
tableB
WHERE
opType IN ( 0, 1 )
GROUP BY
id_a) AS T2
on T1.id_a = T2.id_a
WHERE
T1.itemCode = '?'
GROUP BY
T1.itemCode
Now, depending on the size of your tables, you might be better doing a JOIN on your inner table to table A so you only get those of the item code you are expectin. If you have 50k items and you are only looking for items that qualify = 120 items, then your inner query is STILL qualifying based on the 50k. In that case would be overkill. In this case, I would suggest an index on table A by ( ItemCode, ID_A ) and adjust the inner query to
LEFT JOIN ( SELECT
b.id_a,
SUM( IF( b.opType = 0, b.qtyDelivered, 0)) as qtyDelivered,
SUM( IF( b.opType = 1, b.qtyDelivered, 0)) as qtyReturned
FROM
( select distinct id_a
from tableA
where itemCode = '?' ) pqA
JOIN tableB b
on PQA.id_A = b.id_a
AND b.opType IN ( 0, 1 )
GROUP BY
id_a) AS T2
My Query against your SQLFiddle

How to get combined result in MySQL with query optimization?

I have table like,
id | OpenDate | CloseDate
------------------------------------------------
1 | 2013-01-16 07:30:48 | 2013-01-16 10:49:48
2 | 2013-01-16 08:30:00 | NULL
I needed to get combined result as below
id | date | type
---------------------------------
1 | 2013-01-16 07:30:48 | Open
1 | 2013-01-16 10:49:48 | Close
2 | 2013-01-16 08:30:00 | Open
I used UNION to get above output (can we achieve without UNION?)
SELECT id,date,type FROM(
SELECT id,`OpenDate` as date, 'Open' as 'type' FROM my_table
UNION ALL
SELECT id,`CloseDate` as date, 'Close' as 'type' FROM my_table
)AS `tab` LIMIT 0,15
I am getting the desired output, but now in performance side--> i have 4000 records in my table and by doing UNION its combining and giving around 8000 records, which is making very slow to load the site(more than 13 sec). How can i optimize this query to fasten the output?
I tried LIMIT in sub-query also, pagination offset is not working properly as it should if i use LIMIT in sub-query. Please help me to resolve this.
Update
EXPLAIN result
id select_type table type key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL 8858
2 DERIVED orders index OpenDate 4 NULL 4588 Using index
3 UNION orders index CloseDate 4 NULL 4588 Using index
NULL UNION RESULT<union2,3> ALL NULL NULL NULL NULL
I would do something like the following:
SELECT
id,
IF(act, t1.OpenDate, t1.CloseDate) as `date`,
IF(act, 'Open', 'Close') as `type`
FROM my_table t1
JOIN (SELECT 1 as act UNION SELECT 0) as _
JOIN my_table t2 USING (id);

Optimize Query with huge NOT IN Statement

I am trying to find the sourcesites that ONLY exist before a certain timestamp. This query seems very poor for the job. Any idea how to optimize or an index that might improve?
select distinct sourcesite
from contentmeta
where timestamp <= '2011-03-15'
and sourcesite not in (
select distinct sourcesite
from contentmeta
where timestamp>'2011-03-15'
);
There is an index on sourcesite and timestamp, but query still takes a long time
mysql> EXPLAIN select distinct sourcesite from contentmeta where timestamp <= '2011-03-15' and sourcesite not in (select distinct sourcesite from contentmeta where timestamp>'2011-03-15');
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
| 1 | PRIMARY | contentmeta | index | NULL | sitetime | 14 | NULL | 725697 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | contentmeta | index_subquery | sitetime | sitetime | 5 | func | 48 | Using index; Using where; Full scan on NULL key |
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
The subquery doesn't need the DISTINCT, and the WHERE clause on the outer query is not needed either, since you are already filtering by the NOT IN.
Try:
select distinct sourcesite
from contentmeta
where sourcesite not in (
select sourcesite
from contentmeta
where timestamp > '2011-03-15'
);
This should work:
SELECT DISTINCT c1.sourcesite
FROM contentmeta c1
LEFT JOIN contentmeta c2
ON c2.sourcesite = c1.sourcesite
AND c2.timestamp > '2011-03-15'
WHERE c1.timestamp <= '2011-03-15'
AND c2.sourcesite IS NULL
For optimum performance, have a multi-column index on contentmeta (sourcesite, timestamp).
Generally, joins perform better than subqueries because derived tables cannot utilize indexes.
I find that "not in" just doesn't optimize well across many databases. Use a left outer join instead:
select distinct sourcesite
from contentmeta cm
left outer join
(
select distinct sourcesite
from contentmeta
where timestamp>'2011-03-15'
) t
on cm.sourcesite = t.sourcesite
where timestamp <= '2011-03-15' and t.sourcesite is null
This assumes that sourcesite is never null.