Optimize Query with huge NOT IN Statement - mysql

I am trying to find the sourcesites that ONLY exist before a certain timestamp. This query seems very poor for the job. Any idea how to optimize or an index that might improve?
select distinct sourcesite
from contentmeta
where timestamp <= '2011-03-15'
and sourcesite not in (
select distinct sourcesite
from contentmeta
where timestamp>'2011-03-15'
);
There is an index on sourcesite and timestamp, but query still takes a long time
mysql> EXPLAIN select distinct sourcesite from contentmeta where timestamp <= '2011-03-15' and sourcesite not in (select distinct sourcesite from contentmeta where timestamp>'2011-03-15');
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+
| 1 | PRIMARY | contentmeta | index | NULL | sitetime | 14 | NULL | 725697 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | contentmeta | index_subquery | sitetime | sitetime | 5 | func | 48 | Using index; Using where; Full scan on NULL key |
+----+--------------------+-------------+----------------+---------------+----------+---------+------+--------+-------------------------------------------------+

The subquery doesn't need the DISTINCT, and the WHERE clause on the outer query is not needed either, since you are already filtering by the NOT IN.
Try:
select distinct sourcesite
from contentmeta
where sourcesite not in (
select sourcesite
from contentmeta
where timestamp > '2011-03-15'
);

This should work:
SELECT DISTINCT c1.sourcesite
FROM contentmeta c1
LEFT JOIN contentmeta c2
ON c2.sourcesite = c1.sourcesite
AND c2.timestamp > '2011-03-15'
WHERE c1.timestamp <= '2011-03-15'
AND c2.sourcesite IS NULL
For optimum performance, have a multi-column index on contentmeta (sourcesite, timestamp).
Generally, joins perform better than subqueries because derived tables cannot utilize indexes.

I find that "not in" just doesn't optimize well across many databases. Use a left outer join instead:
select distinct sourcesite
from contentmeta cm
left outer join
(
select distinct sourcesite
from contentmeta
where timestamp>'2011-03-15'
) t
on cm.sourcesite = t.sourcesite
where timestamp <= '2011-03-15' and t.sourcesite is null
This assumes that sourcesite is never null.

Related

How to optimize select query with case statements?

I have 3 tables over 1,000,000+ records. My select query is running for hours.
How to optimize it? I'm newbie.
I tried to add index for name, still it taking hours to load.
Like this,
ALTER TABLE table2 ADD INDEX(name);
and like this also,
CREATE INDEX INDEX1 table2(name);
SELECT MS.*, P.Counts FROM
(SELECT M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name
WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name
END col1
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id) AS MS
LEFT JOIN
(select E.id, count(E.id) Counts
from table3 E
where E.field2 = 'value1'
group by E.id) AS P
ON MS.id=P.id;
Explain <above query>;
output:
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| 1 | PRIMARY | M | NULL | ALL | NULL | NULL | NULL | NULL | 344763 | 100.00 | NULL |
| 1 | PRIMARY | <derived3> | NULL | ref | <auto_key0> | <auto_key0> | 8 | CP.M.id | 10 | 100.00 | NULL |
| 1 | PRIMARY | V | NULL | index | NULL | INDEX1 | 411 | NULL | 1411083 | 100.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 3 | DERIVED | E | NULL | ref | PRIMARY,f2,f3 | f2| 43 | const | 966442 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
I expect to get result in less than 1 min.
The query indented for clarity.
SELECT MS.*, P.Counts
FROM (
SELECT M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name
WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name
END col1
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id
) AS MS
LEFT JOIN (
select E.id, count(E.id) Counts
from table3 E
where E.field2 = 'value1'
group by E.id
) AS P ON MS.id=P.id;
Your query has no filtering predicate, so it's essentially retrieving all the rows. That is a 1,000,000+ rows from table1. Then it's joining it with table2, and then with another table expression/derived table.
Why do you expect this query to be fast? A massive query like this one will normally run as a batch process at night. I assume this query is not for an online process, right?
Maybe you need to rethink the process. Do you really need to process millions of rows at once interactively? Will the user read a million rows in the web page?
Subqueries are not always well-optimized.
I think you can flatten it out something like:
SELECT M.*, V.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name END col1,
( SELECT COUNT(*) FROM table3 WHERE field2 = 'value1' AND id = x.id
) AS Counts
FROM table1 AS M
LEFT JOIN table2 AS V ON M.id = V.id
I may have some parts not quite right; see if you can make this formulation work.
For starters, you are returning the same result for 'col1' in case v.name is null or v.name != 'text'. That said, you can include that extra condition on you join with table2 and use IFNULL function.
Has you are filtering table3 by field2, you could probably create an index over table 3 that includes field2.
You should also check if you can include any additional filter for any of those tables, and if you do you can consider using a stored procedure to get the results.
Also, I don´t see why you need to the aggregate the first join into 'MS' you can easy do all the joins in one go like this:
SELECT
M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
IFNULL(V.name, M.name) as col1,
P.Counts
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id AND V.name <> 'text'
LEFT JOIN
(SELECT
E.id,
COUNT(E.id) Counts
FROM table3 E
WHERE E.field2 = 'value1'
GROUP BY E.id) AS P ON M.id=P.id;
I'm also assuming that you do have clustered indexes for all id fields in all this three tables, but with no filter, if you are dealing with millions off records, this will always be an big heavy query. To say the least your are doing a table scan for table1.
I've included this additional information after you comment.
I've mentioned clustered index, but according to the official documentation about indexes here
When you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index. So if you already have a primary key defined you don't need to do anything else.
Has the documentation also point's out you should define a primary key for each table that you create.
If you don't have a primary key. Here is the code snippet you requested.
ALTER TABLE table1 ADD CONSTRAINT pk_table1
PRIMARY KEY CLUSTERED (id);
ATTENTION: Keep in mind that creating a clustered index is a big operation, for tables like yours with tones of data.
This isn’t something you want to do without planning, on a production server. This operation will also take a long time and table will be locked during the process.

SQL improvement in MySQL

I have these tables in MySQL.
CREATE TABLE `tableA` (
`id_a` int(11) NOT NULL,
`itemCode` varchar(50) NOT NULL,
`qtyOrdered` decimal(15,4) DEFAULT NULL,
:
PRIMARY KEY (`id_a`),
KEY `INDEX_A1` (`itemCode`)
) ENGINE=InnoDB
CREATE TABLE `tableB` (
`id_b` int(11) NOT NULL AUTO_INCREMENT,
`qtyDelivered` decimal(15,4) NOT NULL,
`id_a` int(11) DEFAULT NULL,
`opType` int(11) NOT NULL, -- '0' delivered to customer, '1' returned from customer
:
PRIMARY KEY (`id_b`),
KEY `INDEX_B1` (`id_a`)
KEY `INDEX_B2` (`opType`)
) ENGINE=InnoDB
tableA shows how many quantity we received order from customer, tableB shows how many quantity we delivered to customer for each order.
I want to make a SQL which counts how many quantity remaining for delivery on each itemCode.
The SQL is as below. This SQL works, but slow.
SELECT T1.itemCode,
SUM(IFNULL(T1.qtyOrdered,'0')-IFNULL(T2.qtyDelivered,'0')+IFNULL(T3.qtyReturned,'0')) as qty
FROM tableA AS T1
LEFT JOIN (SELECT id_a,SUM(qtyDelivered) as qtyDelivered FROM tableB WHERE opType = '0' GROUP BY id_a)
AS T2 on T1.id_a = T2.id_a
LEFT JOIN (SELECT id_a,SUM(qtyDelivered) as qtyReturned FROM tableB WHERE opType = '1' GROUP BY id_a)
AS T3 on T1.id_a = T3.id_a
WHERE T1.itemCode = '?'
GROUP BY T1.itemCode
I tried explain on this SQL, and the result is as below.
+----+-------------+------------+------+----------------+----------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------+----------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | T1 | ref | INDEX_A1 | INDEX_A1 | 152 | const | 1 | Using where |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 21211 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 10 | |
| 3 | DERIVED | tableB | ref | INDEX_B2 | INDEX_B2 | 4 | | 96 | Using where; Using temporary; Using filesort |
| 2 | DERIVED | tableB | ref | INDEX_B2 | INDEX_B2 | 4 | | 55614 | Using where; Using temporary; Using filesort |
+----+-------------+-------------------+----------------+----------+---------+-------+-------+----------------------------------------------+
I want to improve my query. How can I do that?
First, your table B has int for opType, but you are comparing to string via '0' and '1'. Leave as numeric 0 and 1. To optimize your pre-aggregates, you should not have individual column indexes, but a composite, and in this case a covering index. INDEX table B ON (OpType, ID_A, QtyDelivered) as a single index. The OpType to optimize the WHERE, ID_A to optimize the group by, and QtyDelivered for the aggregate in the index without going to the raw data pages.
Since you are looking for the two types, you can roll them up into a single subquery testing for either in a single pass result. THEN, Join to your tableA results.
SELECT
T1.itemCode,
SUM( IFNULL(T1.qtyOrdered, 0 )
- IFNULL(T2.qtyDelivered, 0)
+ IFNULL(T2.qtyReturned, 0)) as qty
FROM
tableA AS T1
LEFT JOIN ( SELECT
id_a,
SUM( IF( opType=0,qtyDelivered, 0)) as qtyDelivered,
SUM( IF( opType=1,qtyDelivered, 0)) as qtyReturned
FROM
tableB
WHERE
opType IN ( 0, 1 )
GROUP BY
id_a) AS T2
on T1.id_a = T2.id_a
WHERE
T1.itemCode = '?'
GROUP BY
T1.itemCode
Now, depending on the size of your tables, you might be better doing a JOIN on your inner table to table A so you only get those of the item code you are expectin. If you have 50k items and you are only looking for items that qualify = 120 items, then your inner query is STILL qualifying based on the 50k. In that case would be overkill. In this case, I would suggest an index on table A by ( ItemCode, ID_A ) and adjust the inner query to
LEFT JOIN ( SELECT
b.id_a,
SUM( IF( b.opType = 0, b.qtyDelivered, 0)) as qtyDelivered,
SUM( IF( b.opType = 1, b.qtyDelivered, 0)) as qtyReturned
FROM
( select distinct id_a
from tableA
where itemCode = '?' ) pqA
JOIN tableB b
on PQA.id_A = b.id_a
AND b.opType IN ( 0, 1 )
GROUP BY
id_a) AS T2
My Query against your SQLFiddle

MySQL grouping query optimization

I have three tables: categories, articles, and article_events, with the following structure
categories: id, name (100,000 rows)
articles: id, category_id (6000 rows)
article_events: id, article_id, status_id (20,000 rows)
The highest article_events.id for each article row describes the current status of each article.
I'm returning a table of categories and how many articles are in them with a most-recent-event status_id of '1'.
What I have so far works, but is fairly slow (10 seconds) with the size of my tables. Wondering if there's a way to make this faster. All the tables have proper indexes as far as I know.
SELECT c.id,
c.name,
SUM(CASE WHEN e.status_id = 1 THEN 1 ELSE 0 END) article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN (
SELECT article_id, MAX(id) event_id
FROM article_events
GROUP BY article_id
) most_recent ON most_recent.article_id = a.id
LEFT JOIN article_events e ON most_recent.event_id = e.id
GROUP BY c.id
Basically I have to join to the events table twice, since asking for the status_id along with the MAX(id) just returns the first status_id it finds, and not the one associated with the MAX(id) row.
Any way to make this better? or do I just have to live with 10 seconds? Thanks!
Edit:
Here's my EXPLAIN for the query:
ID | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 124044 | Using index; Using temporary; Using filesort
1 | PRIMARY | a | ref | category_id | category_id | 4 | c.id | 3 |
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 6351 |
1 | PRIMARY | e | eq_ref | PRIMARY | PRIMARY | 4 | most_recent.event_id | 1 |
2 | DERIVED | article_events | ALL | NULL | NULL | NULL | NULL | 19743 | Using temporary; Using filesort
If you can eliminate subqueries with JOINs, it often performs better because derived tables can't use indexes. Here's your query without subqueries:
SELECT c.id,
c.name,
COUNT(a1.article_id) AS article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN article_events ae1
ON ae1.article_id = a.id
LEFT JOIN article_events ae2
ON ae2.article_id = a.id
AND ae2.id > a1.id
WHERE ae2.id IS NULL
GROUP BY c.id
You'll want to experiment with the indexes and use EXPLAIN to test, but here's my guess (I'm assuming id fields are primary keys and you are using InnoDB):
categories: `name`
articles: `category_id`
article_events: (`article_id`, `id`)
Didn't try it, but I'm thinking this will save a bit of work for the database:
SELECT ae.article_id AS ref_article_id,
MAX(ae.id) event_id,
ae.status_id,
(select a.category_id from articles a where a.id = ref_article_id) AS cat_id,
(select c.name from categories c where c.id = cat_id) AS cat_name
FROM article_events
GROUP BY ae.article_id
Hope that helps
EDIT:
By the way... Keep in mind that joins have to go through each row, so you should start your selection from the small end and work your way up, if you can help it. In this case, the query has to run through 100,000 records, and join each one, then join those 100,000 again, and again, and again, even if values are null, it still has to go through those.
Hope this all helps...
I don't like that index on categories.id is used, as you're selecting the whole table.
Try running:
ANALYZE TABLE categories;
ANALYZE TABLE article_events;
and re-run the query.

Select the fields are duplicated in mysql

Assuming that I have the below customer_offer table.
My question is:
How to select all the rows where the key(s) are duplicated in that table?
+---------+-------------+------------+----------+--------+---------------------+
| link_id | customer_id | partner_id | offer_id | key | date_updated |
+---------+-------------+------------+----------+--------+---------------------+
| 1 | 99 | 11 | 14 | mmmmmq | 2011-09-21 12:40:46 |
| 2 | 100 | 11 | 14 | qmmmmq | 2011-09-21 12:40:46 |
| 3 | 101 | 11 | 14 | 8mmmmq | 2011-09-21 12:40:46 |
| 4 | 99 | 11 | 14 | Dmmmmq | 2011-09-21 12:59:28 |
| 5 | 100 | 11 | 14 | Nmmmmq | 2011-09-21 12:59:28 |
+---------+-------------+------------+----------+--------+---------------------+
UPDATE:
Thanks so much for all your answer. There are many answers are good. Now I got the solution to do.
select *
from customer_offer
where key in
(select key from customer_offer group by key having count(*) > 1)
Update:
As mentioned from #Scorpi0, if with a big table, it is better to use join. And from mysql6.0 the new optimizer will convert this kind of subqueries into joins.
Self join
SELECT * FROM customer_offer c1 inner join customer_offer c2
on c1.key = c2.key
or group by the field then take when count > 1
SELECT COUNT(key),link_id FROM customer_offer c1
group by key, link_id
having COUNT(Key) > 1
SELECT DISTINCT c1.*
FROM customer_offer c1
INNER JOIN customer_offer c2
ON c1.key = c2.key
AND c1.link_id != c2.link_id
Assuming link_id is a primary key.
Use a sub-query to do the count check, and the main query to select the rows. The count check query is simply:
SELECT `link_id` FROM `customer_offer` GROUP BY `key` HAVING COUNT(`key`) > 1
Then the outer query will use this by joining into it:
SELECT customer_offer.* FROM customer_offer
INNER JOIN (SELECT `link_id` FROM `customer_offer` GROUP BY `key` HAVING COUNT(`key`) > 1) AS count_check
ON customer_offer.link_id = count_check.link_id
There are many threads on the mysql website which explains how to do this. This link will explain how to do this using mysql: http://forums.mysql.com/read.php?10,180556,180567#msg-180567
As a brief example the code below is from the link with a slight modification which better suits your example.
SELECT *
FROM tbl
GROUP BY key
HAVING COUNT(key)>1;
You can also use a joing which is my prefered method, as this removes the slower count method:
SELECT *
FROM this_table t
inner join this_table t1 on t.key = t1.key
SELECT link_id, key, count(key) as Occurrences
FROM table
GROUP BY key
HAVING COUNT(key)>1;

MySQL query optimization

Could you please help me optimize this query. I've spent lots of time and still cannot rephrase it to be fast enough (say running in the matters of seconds, not minutes as it is now).
The query:
SELECT m.my_id, m.my_value, m.my_timestamp
FROM (
SELECT my_id, MAX(my_timestamp) AS most_recent_timestamp
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
) as tmp
LEFT OUTER JOIN my_table m
ON tmp.my_id = m.my_id AND tmp.most_recent_timestamp = m.my_timestamp
ORDER BY m.my_timestamp;
my_table is defined as follows:
CREATE TABLE my_table (
my_id INTEGER NOT NULL,
my_value VARCHAR(4000),
my_timestamp TIMESTAMP default CURRENT_TIMESTAMP NOT NULL,
INDEX MY_ID_IDX (my_id),
INDEX MY_TIMESTAMP_IDX (my_timestamp),
INDEX MY_ID_MY_TIMESTAMP_IDX (my_id, my_timestamp)
);
The goal of this query is to select the most recent my_value for each my_idbefore some timestamp. my_table contains ~100 million entries and it takes ~8 minutes to perform it.
explain:
+----+-------------+-------------+-------+------------------------------------------------+-------------------------+---------+---------------------------+-------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------------------------------------+-------------------------+---------+---------------------------+-------+---------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 90721 | Using temporary; Using filesort |
| 1 | PRIMARY | m | ref | MY_ID_IDX,MY_TIMESTAMP_IDX,MY_ID_TIMESTAMP_IDX | MY_TIMESTAMP_IDX | 4 | tmp.most_recent_timestamp | 1 | Using where |
| 2 | DERIVED | my_table | range | MY_TIMESTAMP_IDX | MY_ID_MY_TIMESTAMP_IDX | 8 | NULL | 61337 | Using where; Using index for group-by |
+----+-------------+-------------+-------+------------------------------------------------+-----------------------+---------+---------------------------+------+---------------------------------------+
If I understand correctly, you should be able to drop the nested select completely, and move the where clause to the main query, order by my_timestamp descending and limit 1.
SELECT my_id, my_value, max(my_timestamp)
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
*edit - added max and group by
a trick to get a most recent record can be to use order by together with 'limit 1' instead of max aggregation together with "self" join
somthing like this (not tested):
SELECT m.my_id, m.my_value, m.my_timestamp
FROM my_table m
WHERE my_timestamp < '2011-03-01 08:00:00'
ORDER BY m.my_timestamp DESC
LIMIT 1
;
update above doesn't work because a grouping is required...
other solution that has WHERE-IN-SubSelect instead of the JOIN you've used.
could be faster. please test with your data.
SELECT m.my_id, m.my_value, m.my_timestamp
FROM my_table m
WHERE ( m.my_id, m.my_timestamp ) IN (
SELECT i.my_id, MAX(i.my_timestamp)
FROM my_table i
WHERE i.my_timestamp < '2011-03-01 08:00:00'
GROUP BY i.my_id
)
ORDER BY m.my_timestamp;
I notice in the explain plan that the optimizer is using the MY_ID_MY_TIMESTAMP_IDX index for the sub-query, but not the outer query.
You may be able to speed it up using an index hint. I also updated the ON clause to refer to tmp.most_recent_timestamp using its alias.
SELECT m.my_id, m.my_value, m.my_timestamp
FROM (
SELECT my_id, MAX(my_timestamp) AS most_recent_timestamp
FROM my_table
WHERE my_timestamp < '2011-03-01 08:00:00'
GROUP BY my_id
) as tmp
LEFT OUTER JOIN my_table m use index (MY_ID_MY_TIMESTAMP_IDX)
ON tmp.my_id = m.my_id AND tmp.most_recent_timestamp = m.my_timestamp
ORDER BY m.my_timestamp;