Optimize mysql query involving millions of rows

Optimize mysql query involving millions of rows - mysql

I have, in a project, a database with two big tables, "terminosnoticia" have 400 Million rows and "noticia" 3 Million. I have one query I want to make lighter (it spend from 10s to 400s):
SELECT noticia_id, termino_id
FROM noticia
LEFT JOIN terminosnoticia on terminosnoticia.noticia_id=noticia.id AND termino_id IN (7818,12345)
WHERE noticia.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
AND noticia_id is not null AND termino_id is not null;`
The only viable solution I have to explore is to denormalize the database to include the 'fecha' field in the big table, but, this will multiply the index sizes.
Explain plan:
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| 1 | SIMPLE | terminosnoticia | ref | noticia_id,termino_id | termino_id | 4 | const | 58480 | Using where |
| 1 | SIMPLE | noticia | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.terminosnoticia.noticia_id | 1 | Using where |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
Changing the query and creating the index as suggested, the explain plan is now:
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| 1 | SIMPLE | T | ref | noticia_id,termino_id,terminosnoticia_cpx | terminosnoticia_cpx | 4 | const | 60600 | Using index |
| 1 | SIMPLE | N | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.T.noticia_id | 1 | Using where |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
But the execution time does not vary too much...
Any idea?

As Strawberry pointed out, by having an "AND" in your where clause for NOT NULL
is the same as a regular INNER JOIN and can be reduced to.
SELECT
N.id as noticia_id,
T.termino_id
FROM
noticia N USING INDEX (fecha)
JOIN terminosnoticia T
on N.id = T.noticia_id
AND T.termino_id IN (7818,12345)
WHERE
N.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
Now, that said and aliases applied, I would suggest the following covering indexes as
table index
Noticia ( fecha, id )
terminosnoticia ( noticia_id, termino_id )
This way the query can get all the results directly from the indexes and not have to go to the raw data pages to qualify the other fields.

Assuming noticia_id is noticia's primary key, I would add the following indexes:
create index noticia_fecha_idx on noticia(fecha);
create index terminosnoticia_id_noticia_idx on terminosnoticia(noticia_id);
And try your queries again.
Do include the current execution plan of your query. It might help on helping you figuring this one out.

Try this:
SELECT tbl1.noticia_id, tbl1.termino_id FROM
( SELECT FROM terminosnoticia WHERE
terminosnoticia.termino_id IN (7818,12345)
AND terminosnoticia.noticia_id is not null
) tbl1 INNER JOIN
( SELECT id FROM noticia
WHERE noticia.fecha
BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
) tbl2 ON tbl1.id=tbl2.noticia.id

We're assuming that the noticia_id and termino_id are columns in terminosnoticia table. (We wouldn't have to guess, if all of the column references were qualified with the table name or a short table alias.)
Why is this an outer join? The predicates in the WHERE clause are going to exclude rows with NULL values for columns from terminosnoticia. That's going to negate the "outerness" of the join.
And if we write this as an inner join, those predicates in the WHERE clause are redundant. We already know that noticia_id won't be NULL (if it satisfies the equality predicate in the ON clause). Same for termino_id, that won't be NULL if it's equal to a value in the IN list.
I believe this query will return an equivalent result:
SELECT t.noticia_id
, t.termino_id
FROM noticia n
JOIN terminosnoticia t
ON t.noticia_id = n.id
AND t.termino_id IN (7818,12345)
WHERE n.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
What's left now is figuring out if there's any implicit datatype conversions.
We don't see the datatype of termino_id. So we don't know if that's defined as numeric. It's bad news if it's not, since MySQL will have to perform a conversion to numeric, for every row in the table, so it can do the comparison to the numeric literals.
We don't see the datatypes of the noticia_id, and whether that matches the datatype of the column it's being compared to, the id column from noticia table.
We also don't see the datatype of fecha. Based on the string literals in the between predicate, it looks like it's probably a DATETIME or TIMESTAMP. But that's just a guess. We don't know, since we don't have a table definition available to us.
Once we have verified that there aren't any implicit datatype conversions that are going to bite us...
For the query with the inner join (as above), the best shot at reasonable performance will likely be with MySQL making effective use of covering indexes. (A covering index allows MySQL to satisfy the query directly from from the index blocks, without needing to lookup pages in the underlying table.)
As DRApp's answer already states, the best candidates for covering indexes, for this particular query, would be:
... ON noticia (fecha, id)
... ON terminosnoticia (noticia_id, termino_id)
An index that has those same leading columns in that same order would also be suitable, and would render these indexes redundant.
The addition of these indexes will render other indexes redundant.
The first index would be redundant with ... ON noticia (fecha). Assuming the index isn't enforcing a UNIQUE constraint, it could be dropped. Any query making effective use of that index could use the new index, since fecha is the leading column in the new index.
Similarly, an index ... ON terminosnoticia (noticia_id) would be redundant. Again, assuming it's not a unique index, enforcing a UNIQUE constraint, that index could be dropped as well.

Related

SQL query optimization - really nothing more to improve?

I have the following query. I picked it from mysql slow queries log:
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step ON item_step.item_id = item.id
WHERE
item_step.number = '2' AND
(IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2")) AND
item.time >= '2015-03-01 07:00:00' AND
item.time < '2015-05-01 07:00:00';
As usually I tried to inspect it using explain:
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
| 1 | SIMPLE | item | ALL | PRIMARY,time | NULL | NULL | NULL | 790464 | 38.74 | Using where |
| 1 | SIMPLE | item_step | ref | number,item_id,result2_idx | item_id | 4 | debug_db.item.id | 1 | 100.00 | Using where |
+----+-------------+-----------+------+----------------------------+---------+---------+------------------+--------+----------+-------------+
Adding index to table item on id and time gave nothing.
Actually time column has an index,tables are connected using foreign keys and have an indexes..
I have no idea about what to do here. Is it really impossible to optimize this query to avoid using join_type = ALL ?

Since you already seem to have a FK from item_step.item_id to item.item_id, the only option you have for improvement is focusing on the parts being used to filter out records.
Slightly reformatting your query we have :
SELECT AVG(item.duration) AS dur
FROM `item`
INNER JOIN item_step
ON item_step.item_id = item.id
AND item_step.number = '2'
AND (IS_OK(item_step.result) OR item_step.result2 IN ("R1", "R2"))
WHERE item.time >= '2015-03-01 07:00:00'
AND item.time < '2015-05-01 07:00:00';
First thing to notice is IS_OK(item_step.result). I have no clue what's behind this function but I'm pretty sure it blocks the optimizer from using any index this field efficiently. If the formula is something that can be written in the query directly I would suggest to do so. (e.g. IN (1, 4, 9), or IN (SELECT OK FROM result_values) etc...)
Going by the field-names I'm going to assume that we FIRST want to reduce the item_id list to a minimum first and then use that reduced list to work on the item_step table. To do so you'll need an index on the time field first. I'm assuming that the item_id field is automatically included in the index as it's the PK field, but I'm no MySQL specialist and it might also depend on your storage engine. Anyay, in MSSQL that's how it would work, YMMV.
The second thing to do then is to go with this list of item_ids to the item_step table and reduce the number of records there. For this you'll want a compound index on item_id, number, result2, result. If you manage to write the IS_OK() function 'inline' into the query you might want to try swapping the last two fields around... something you'll need to test.
From what I read here and there, MySQL does not support something like INCLUDE on indexes in the same way as MSSQL does. A way around that would be to create a 'covering' index on time, duration on item. That way, everything can be done from the index directly, at the cost of more disk-space and CPU requirements when adding data to the item table.
In short:
add index on item on time, duration
add index on item_step on item_id, number, result2, result
see if you can inline the IS_OK() function.

MySQL indexes performance on huge tables

TL;DR:
I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?
The background:
I have 2 tables
person, a table containing informations about people (id, birthdate)
works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.
They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.
Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.
I'm trying to get the age of the youngest person in each department, and here is the query I've came up with
SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
FROM person
INNER JOIN works_in
ON works_in.person_id = person.id
WHERE person.birthdate IS NOT NULL
GROUP BY works_in.department
The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.
EXPLAIN for this query gives
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------|
| 1 | SIMPLE | works_in| ALL | NULL | NULL | NULL | NULL | 22496409 | Using temporary; Using filesort |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | dbtest.works_in.person_id| 1 | Using where |
I built a bunch of indexes for the 2 tables,
/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);
EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------|
| 1 | SIMPLE | person | range | PRIMARY,birthdate | birthdate | 4 | NULL | 267818 | Using where; Using index; Using temporary; Using f... |
| 1 | SIMPLE | works_in| ref | person,department_id_person,person_department_id | person_department_id | 4 | dbtest.person.id | 3 | Using index |
However, the execution time of the query has doubled (from ~17s to ~35s).
Why does this makes sense, and what is the correct way to optimize this?
EDIT
Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?
More information concerning the dataset:
There are about 5'000'000 records in the person table.
Of which only 130'000 have a valid (not NULL) birthdate
I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)

For this query:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;
The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.
By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
GROUP BY wi.department;
For this, the best indexes are person(id, birthdate) and works_in(person_id, department).
EDIT:
I cannot think of an easy way to solve the problem. One solution is more powerful hardware.
If you really need this information quickly, then additional work is needed.
One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.
A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.

Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.
Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.
With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.
When to add an index on a SQL table field (MySQL)?
When to use MyISAM and InnoDB?
https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

Optimize query?

My query took 28.39 seconds to run. How can I optimize it?
explain SELECT distinct UNIX_TIMESTAMP(timestamp)*1000 as timestamp,count(a.sig_name) as counter from event a,network n where n.fsi='pays' and n.net=inet_ntoa(a.ip_src) group by date(timestamp) order by timestamp asc;
+----+-------------+-------+--------+---------------+---------+---------+--- ---+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 8177074 | Using temporary; Using filesort |
| 1 | SIMPLE | n | eq_ref | PRIMARY,fsi | PRIMARY | 77 | func | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+

So generally looking at your query, we find that table event a is examining 8,177,074 rows. That is likely the "root" of the slowness, so we want to look at how to reduce the search space using indexes.
The main condition on event a is
n.net=inet_ntoa(a.ip_src)
The problem here is that we need to perform a calculation (inet_ntoa) on every row of a.ip_src, so there is no alternative but to scan the entire table. A potentially better solution would be to invert the comparison and ensure that a.ip_src is indexed.
a.ip_src=inet_aton(n.net)
This will only be better if we are matching less rows in n than we are in a. If that is not the case, you should seriously consider caching the result of this function in the table and creating an index on that.
Lastly I am guessing the timestamp column is in event a, in which case an index will potentially help with ordering and grouping though may not. You could try a multi_column index on (ip_src,timestamp)

Make it a practice to introduce at-least index on columns which can be used in WHERE/JOIN clauses. I've used the at-least because in many cases one should try to use PRIMARY/FOREIGN KEY relations. So if something is already a primary/foriegn key there is no need to index it further.
The above query can be simply improved by introducing the INDEX through the following query:
ALTER TABLE events ADD INDEX idx_ev_ipsrc (ip_src);
Here idx_ev_ipsrc = Name of the index key, and ip_src is the column to be indexed.
Even further enhancement:
Introduce multi-colum index on network table using following query:
ALTER TABLE network ADD INDEX idx_net_fsi_net (fsi,net);
The above will result in even low number of rows.
Note: The above queries are for MySql and can be tailored for other DBs easily.

How do I remove temporary and filesort from my SQL query?

I have been trying to create an index in MySQL, but keep getting temporary and filesort whenever I run an explain on my query.
A simplified version of my tables looks like:
ordered_products
op_id INT UNSIGNED NOT NULL AUTO_INCREMENT
op_orderid INT UNSIGNED NOT NULL
op_orderdate TIMESTAMP NOT NULL
op_productid INT UNSIGNED NOT NULL
products
p_id INT UNSIGNED NOT NULL AUTO_INCREMENT
p_productname VARCHAR(128) NOT NULL
p_enabled TINYINT NOT NULL
The 'ordered_products' table currently has more than 1,000,000 rows and is a record of all products that have been ordered, as well as the orders that they belong to. This table grows rapidly.
The 'products' table currently has around 3,000 rows and contains a list of products that are for sale.
The site displays a list of the top products for a given period (normally the last 3 days) and my query looks like:
SELECT COUNT(op.op_productid) AS ProductCount, op.op_productid
FROM ordered_products op
LEFT JOIN products p ON op.op_productid=p.p_id
WHERE op.op_orderdate>='2014-03-08 00:00:00'
AND p.p_enabled=1
GROUP BY op.op_productid
ORDER BY ProductCount DESC, p.p_productname ASC
When I run that query, it normally takes around 800 milliseconds (0.8 seconds) to execute, which is ridiculous. We've remedied this with caching, however whenever the cache expires, we have a slowdown. I need to fix this.
I have tried to index the tables, but no matter what I try, I can't avoid temporary and filesort. The output from EXPLAIN is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p index PRIMARY,idx_enabled_id_name idx_enabled_id_name 782 \N 1477 Using where; Using index; Using temporary; Using filesort
1 SIMPLE op ref idx_pid_oid_date idx_pid_oid_date 4 test_store.p.p_id 9 Using where; Using index
If I remove the GROUP BY, the filesort disappears, however I need it to ensure the ProductCount value shows me every product count rather than a total sum of all products.
If I remove the GROUP BY and the ORDER BY ProductCount, both temporary and filesort disappear, but now I am left with a very bad result set.
Can anyone please help me solve this? I have tried a multitude of different indexes, and have tried rewriting the SQL numerous times, but can never succeed.
Any help would be greatly appreciated.

You can't get rid of the temp table and filesort while you are using ORDER BY on a calculated column ProductCount. There's no index for the calculated column, so it has to do do the sorting at the time of the query.
I tried experimentally to reproduce your results. I can put an index on op_productid and then the optimizer might use it to perform the GROUP BY.
mysql> EXPLAIN SELECT COUNT(op.op_productid) AS ProductCount, op.op_productid
FROM ordered_products op FORCE INDEX (op_productid) STRAIGHT_JOIN products p
ON op.op_productid=p.p_id
WHERE op.op_orderdate>='2014-03-08 00:00:00' AND p.p_enabled=1
GROUP BY op.op_productid ORDER BY null;
In my case, I had to use STRAIGHT_JOIN and FORCE INDEX to override the optimizer. But that might be due to my test environment, where I have only 1 or 2 rows per table for testing, and it throws off the optimizer's choices. In your real data, it might make a more sensible choice.
Also, don't use LEFT JOIN if you have conditions in the WHERE clause that make the join implicitly an inner join. Learn the types of joins and how they work -- don't always use LEFT JOIN by default.
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| 1 | SIMPLE | op | index | op_productid | op_productid | 4 | NULL | 5 | Using where |
| 1 | SIMPLE | p | ALL | PRIMARY | NULL | NULL | NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
Your only alternative is to store a denormalized table, where the counts are persisted. Then if your cache fails, it isn't an expensive query to refresh the cache.

Optimize mysql NOT IN query by using temporary variable

I was trying to optimize NOT IN clause in mysql: Some how I ended up in the following query:
SELECT #i:=(SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc');
SELECT * FROM word WHERE #i IS NULL OR word_id NOT IN (#i);
There is no relationship between sent_question table and word table. And also I cannot place index on correct_option_word_id.
Can somebody please explain, will this method even optimize the query or not?
UPDATE: As mentioned here that both the methods: NOT IN and LEFT JOIN/IS NULL are almost equally efficient. That's why I don't want to use LEFT JOIN/IS NULL method.
UPDATE 2:
Explain results for original query:
EXPLAIN SELECT * FROM word WHERE word_id NOT IN (SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc');
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
| 1 | PRIMARY | word | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
| 2 | DEPENDENT SUBQUERY | sent_question | ref | fk_question_subscriber1 | fk_question_subscriber1 | 48 | const | 1 | Using where |
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+

You're right in that both the NOT IN and LEFT JOIN/IS NULL method are equally efficient, however, unfortunately, there is no faster option, only slower ones (NOT EXISTS).
Here's your query, simplified:
SELECT *
FROM word
WHERE
word_id NOT IN (SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc')
As you know, MySQL will do the subquery first and use the returned result set for the NOT IN clause. Then, it will scan through all of the rows in word to see if word_id is in the list for each row.
Unfortunately for this case, indexes are inclusive, not exclusive. They don't help with NOT queries. A covering index on word could potentially still be used to avoid accessing the actual table, and provide some IO benefits, but it won't be used in the traditional "lookup" sense. However, since you are returning all columns on the word table, it may not be viable to have such a large index.
The most important index that will be used here is an index on sent_question.msisdn for the subquery. Ensure that you have that index defined. A multi-column "covering" index on (msisdn, correct_option_word_id) would be best.
If you share your design, we can probably offer some design solutions for optimization.

I doubt it'll work at all.
Try
SELECT *
FROM word AS w
LEFT JOIN sent_question AS sq
ON w.word_id = sq.correct_option_word_id AND sq.msisdn='abc'
WHERE sq.correct_option_word_id IS NULL

Give this simple query a try
SELECT
sent_question.*,
word.word_id AS foundWord
FROM sent_question
LEFT JOIN word
ON word.word_id = sent_question.correct_option_word_id
WHERE sent_question.msisdn='abc'
// GROUP BY sent_question.correct_option_word_id // This shouldn't be needed but included for completion
HAVING foundWord IS NULL

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008