MySQL indexes performance on huge tables - mysql

TL;DR:
I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?
The background:
I have 2 tables
person, a table containing informations about people (id, birthdate)
works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.
They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.
Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.
I'm trying to get the age of the youngest person in each department, and here is the query I've came up with
SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
FROM person
INNER JOIN works_in
ON works_in.person_id = person.id
WHERE person.birthdate IS NOT NULL
GROUP BY works_in.department
The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.
EXPLAIN for this query gives
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------|
| 1 | SIMPLE | works_in| ALL | NULL | NULL | NULL | NULL | 22496409 | Using temporary; Using filesort |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | dbtest.works_in.person_id| 1 | Using where |
I built a bunch of indexes for the 2 tables,
/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);
EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------|
| 1 | SIMPLE | person | range | PRIMARY,birthdate | birthdate | 4 | NULL | 267818 | Using where; Using index; Using temporary; Using f... |
| 1 | SIMPLE | works_in| ref | person,department_id_person,person_department_id | person_department_id | 4 | dbtest.person.id | 3 | Using index |
However, the execution time of the query has doubled (from ~17s to ~35s).
Why does this makes sense, and what is the correct way to optimize this?
EDIT
Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?
More information concerning the dataset:
There are about 5'000'000 records in the person table.
Of which only 130'000 have a valid (not NULL) birthdate
I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)

For this query:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;
The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.
By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
GROUP BY wi.department;
For this, the best indexes are person(id, birthdate) and works_in(person_id, department).
EDIT:
I cannot think of an easy way to solve the problem. One solution is more powerful hardware.
If you really need this information quickly, then additional work is needed.
One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.
A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.

Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.
Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.
With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.
When to add an index on a SQL table field (MySQL)?
When to use MyISAM and InnoDB?
https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

Related

Optimize mysql query involving millions of rows

I have, in a project, a database with two big tables, "terminosnoticia" have 400 Million rows and "noticia" 3 Million. I have one query I want to make lighter (it spend from 10s to 400s):
SELECT noticia_id, termino_id
FROM noticia
LEFT JOIN terminosnoticia on terminosnoticia.noticia_id=noticia.id AND termino_id IN (7818,12345)
WHERE noticia.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
AND noticia_id is not null AND termino_id is not null;`
The only viable solution I have to explore is to denormalize the database to include the 'fecha' field in the big table, but, this will multiply the index sizes.
Explain plan:
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
| 1 | SIMPLE | terminosnoticia | ref | noticia_id,termino_id | termino_id | 4 | const | 58480 | Using where |
| 1 | SIMPLE | noticia | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.terminosnoticia.noticia_id | 1 | Using where |
+----+-------------+-----------------+--------+-----------------------+------------+---------+-----------------------------------------+-------+-------------+
Changing the query and creating the index as suggested, the explain plan is now:
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
| 1 | SIMPLE | T | ref | noticia_id,termino_id,terminosnoticia_cpx | terminosnoticia_cpx | 4 | const | 60600 | Using index |
| 1 | SIMPLE | N | eq_ref | PRIMARY,fecha | PRIMARY | 4 | db_resumenes.T.noticia_id | 1 | Using where |
+----+-------------+-------+--------+-------------------------------------------+---------------------+---------+---------------------------+-------+-------------+
But the execution time does not vary too much...
Any idea?
As Strawberry pointed out, by having an "AND" in your where clause for NOT NULL
is the same as a regular INNER JOIN and can be reduced to.
SELECT
N.id as noticia_id,
T.termino_id
FROM
noticia N USING INDEX (fecha)
JOIN terminosnoticia T
on N.id = T.noticia_id
AND T.termino_id IN (7818,12345)
WHERE
N.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
Now, that said and aliases applied, I would suggest the following covering indexes as
table index
Noticia ( fecha, id )
terminosnoticia ( noticia_id, termino_id )
This way the query can get all the results directly from the indexes and not have to go to the raw data pages to qualify the other fields.
Assuming noticia_id is noticia's primary key, I would add the following indexes:
create index noticia_fecha_idx on noticia(fecha);
create index terminosnoticia_id_noticia_idx on terminosnoticia(noticia_id);
And try your queries again.
Do include the current execution plan of your query. It might help on helping you figuring this one out.
Try this:
SELECT tbl1.noticia_id, tbl1.termino_id FROM
( SELECT FROM terminosnoticia WHERE
terminosnoticia.termino_id IN (7818,12345)
AND terminosnoticia.noticia_id is not null
) tbl1 INNER JOIN
( SELECT id FROM noticia
WHERE noticia.fecha
BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
) tbl2 ON tbl1.id=tbl2.noticia.id
We're assuming that the noticia_id and termino_id are columns in terminosnoticia table. (We wouldn't have to guess, if all of the column references were qualified with the table name or a short table alias.)
Why is this an outer join? The predicates in the WHERE clause are going to exclude rows with NULL values for columns from terminosnoticia. That's going to negate the "outerness" of the join.
And if we write this as an inner join, those predicates in the WHERE clause are redundant. We already know that noticia_id won't be NULL (if it satisfies the equality predicate in the ON clause). Same for termino_id, that won't be NULL if it's equal to a value in the IN list.
I believe this query will return an equivalent result:
SELECT t.noticia_id
, t.termino_id
FROM noticia n
JOIN terminosnoticia t
ON t.noticia_id = n.id
AND t.termino_id IN (7818,12345)
WHERE n.fecha BETWEEN '2016-09-16 00:00' AND '2016-09-16 10:00'
What's left now is figuring out if there's any implicit datatype conversions.
We don't see the datatype of termino_id. So we don't know if that's defined as numeric. It's bad news if it's not, since MySQL will have to perform a conversion to numeric, for every row in the table, so it can do the comparison to the numeric literals.
We don't see the datatypes of the noticia_id, and whether that matches the datatype of the column it's being compared to, the id column from noticia table.
We also don't see the datatype of fecha. Based on the string literals in the between predicate, it looks like it's probably a DATETIME or TIMESTAMP. But that's just a guess. We don't know, since we don't have a table definition available to us.
Once we have verified that there aren't any implicit datatype conversions that are going to bite us...
For the query with the inner join (as above), the best shot at reasonable performance will likely be with MySQL making effective use of covering indexes. (A covering index allows MySQL to satisfy the query directly from from the index blocks, without needing to lookup pages in the underlying table.)
As DRApp's answer already states, the best candidates for covering indexes, for this particular query, would be:
... ON noticia (fecha, id)
... ON terminosnoticia (noticia_id, termino_id)
An index that has those same leading columns in that same order would also be suitable, and would render these indexes redundant.
The addition of these indexes will render other indexes redundant.
The first index would be redundant with ... ON noticia (fecha). Assuming the index isn't enforcing a UNIQUE constraint, it could be dropped. Any query making effective use of that index could use the new index, since fecha is the leading column in the new index.
Similarly, an index ... ON terminosnoticia (noticia_id) would be redundant. Again, assuming it's not a unique index, enforcing a UNIQUE constraint, that index could be dropped as well.

MySQL sorting on joined table column extremely slow (temp table)

I have some tables:
object
person
project
[...] (some more tables)
type
The object table has foreign keys to all other tables.
Now I do a query like:
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY object.type_id ASC
LIMIT 25
This works perfectly well and fast, even for big resultsets. For example I have 90000 objects and the query takes about 3 seconds. The result ist quite big because the tables have a lot of columns and all of them are fetched. For info: I'm using Symfony with Propel, InnoDB and the "doSelectJoinAll"-function.
But if do a query like (sort by type.name):
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
The query takes about 200 seconds!
EXPLAIN:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | object | ref | object_FI_2 | object_FI_2 | 4 | const | 164966 | Using where; Using temporary; Using filesort
1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | db.object.person_id | 1
1 | SIMPLE | ... | eq_ref | PRIMARY | PRIMARY | 4 | db.object...._id | 1
1 | SIMPLE | type | eq_ref | PRIMARY | PRIMARY | 4 | db.object.type_id | 1
I saw in the processlist, that MySQL is creating a temporary table for such a sorting on a joined table.
Adding an index to type.name didn't improve the performance. There are only about 800 type rows.
I found out that the many joins and the big result is the problem, because if I do a query with just one join like:
SELECT * FROM object
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
it works as fast as expected.
Is there a way to improve such sorting queries on a big resultset with many joined tables? Or is it just a bad habit to sort on a joined table column and this shouldn't be done anyway?
Thank you
LEFT gets in the way of rearranging the order of the tables. How fast is it without any LEFT? Do you get the same answer?
LEFT may be a red herring... Here's what the optimizer is likely to be doing:
Decide what order to do the tables in. Take into consideration any WHERE filtering and any LEFTs. Because of WHERE object.customer_id = XXX, object is likely to be the best table to start with.
Get the rows from object that satisfy the WHERE.
Get the columns needed from the other tables (do the JOINs).
Sort according to the ORDER BY ** see below
Deliver the first 25 rows.
** Let's dig deeper into these two:
WHERE object.customer_id = XXX ORDER BY object.id
WHERE object.customer_id = XXX ORDER BY virtually-anything-else
You have INDEX(customer_id), correct? And the table is InnoDB, correct? Well, each secondary index implicitly includes the PRIMARY KEY, as if you had said INDEX(customer_id, id). The optimal index for the first WHERE + ORDER BY is precisely that. It will locate XXX and scan 25 rows, then stop. You might say that steps 2,4,5 are blended together.
The second WHERE just gather all the stuff through step 4. This could be thousands of rows. Hence it is likely to be a lot slower.
See also article on building optimal indexes.

Optimize query?

My query took 28.39 seconds to run. How can I optimize it?
explain SELECT distinct UNIX_TIMESTAMP(timestamp)*1000 as timestamp,count(a.sig_name) as counter from event a,network n where n.fsi='pays' and n.net=inet_ntoa(a.ip_src) group by date(timestamp) order by timestamp asc;
+----+-------------+-------+--------+---------------+---------+---------+--- ---+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 8177074 | Using temporary; Using filesort |
| 1 | SIMPLE | n | eq_ref | PRIMARY,fsi | PRIMARY | 77 | func | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
So generally looking at your query, we find that table event a is examining 8,177,074 rows. That is likely the "root" of the slowness, so we want to look at how to reduce the search space using indexes.
The main condition on event a is
n.net=inet_ntoa(a.ip_src)
The problem here is that we need to perform a calculation (inet_ntoa) on every row of a.ip_src, so there is no alternative but to scan the entire table. A potentially better solution would be to invert the comparison and ensure that a.ip_src is indexed.
a.ip_src=inet_aton(n.net)
This will only be better if we are matching less rows in n than we are in a. If that is not the case, you should seriously consider caching the result of this function in the table and creating an index on that.
Lastly I am guessing the timestamp column is in event a, in which case an index will potentially help with ordering and grouping though may not. You could try a multi_column index on (ip_src,timestamp)
Make it a practice to introduce at-least index on columns which can be used in WHERE/JOIN clauses. I've used the at-least because in many cases one should try to use PRIMARY/FOREIGN KEY relations. So if something is already a primary/foriegn key there is no need to index it further.
The above query can be simply improved by introducing the INDEX through the following query:
ALTER TABLE events ADD INDEX idx_ev_ipsrc (ip_src);
Here idx_ev_ipsrc = Name of the index key, and ip_src is the column to be indexed.
Even further enhancement:
Introduce multi-colum index on network table using following query:
ALTER TABLE network ADD INDEX idx_net_fsi_net (fsi,net);
The above will result in even low number of rows.
Note: The above queries are for MySql and can be tailored for other DBs easily.

query for what customers have bought together with the listed product

i'm trying to get optimize an very old query that i can't wrap my head around. the result that i want to archive is that i want to recommend the visitor on a web shop what other customers have shown interest in, i.e. what else they have bought together with the product that the visitor is looking at.
i have a subquery but it's very slow, takes ~15s on ~8 000 000 rows.
the layout is that all products that are put in a users basket are kept in a table wsBasket and separated by a basketid (which in another table is associated with a member).
in this example i want to list all the most popular products that users have bought together with productid 427, but not list the productid 427 itself.
SELECT productid, SUM(quantity) AS qty
FROM wsBasket
WHERE basketid IN
(SELECT basketid
FROM wsBasket
WHERE productid=427) AND productid!=427
GROUP by productid
ORDER BY qty
DESC LIMIT 0,4;
any help is much appreciated! hope this makes any sense at all to at least someone :)
UPDATE 1:
thanks for your comments guys here are my answers, they didn't fit in the comments-field.
Using EXPLAIN on the above query i got the fllowing. Please note, I do not have any indexes on the table (except for primary key on the id-field), i want to modify the query to benefit from indexes and place indexes on the right keys.
+----+--------------------+----------+------+---------------+------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+------+---------------+------+---------+------+------+----------------------------------------------+
| 1 | PRIMARY | wsBasket | ALL | NULL | NULL | NULL | NULL | 2821 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | wsBasket | ALL | NULL | NULL | NULL | NULL | 2821 | Using where |
+----+--------------------+----------+------+---------------+------+---------+------+------+----------------------------------------------+
Two obvious indexes to add: one on basketid and a second on productid: then retry the query and a new EXPLAIN to see that the indexes are being used
As well as ensuring that suitable indexes exist on productid and basketid, you will often benefit from structuring your query as a simple join rather than a subquery, especially in MySQL.
SELECT b1.productid, SUM(b1.quantity) AS qty
FROM wsBasket AS b0
JOIN wsBasket AS b1 ON b1.basketid=b0.basketid
WHERE b0.productid=427 AND b1.productid<>427
GROUP BY b1.productid
ORDER BY qty DESC
LIMIT 4
For me, on a possibly-similar dataset, the join resulted in two select_type: SIMPLE rows in the EXPLAIN output, whereas the subquery method spat out a horrible-for-performance DEPENDENT SUBQUERY. Consequently the join was well over an order of magnitude faster.
The two fields you mainly use for searching in this query are productid and basketid.
When you search for records having productid equal to 427, Database has no clue where to find this record. It doesn't even know that if it does find one matching, that there will not be another matching one, so it has to look through the entire table, potentially thousands of records.
An index is a separate file that is sorted, and contains only the field/s you're interested in sorting on. so creating an index saves a immense amount of time!

How can I speed up this SQL query on MySQL 4.1?

I have a SQL query that takes a very long time to run on MySQL (it takes several minutes). The query is run against a table that has over 100 million rows, so I'm not surprised it's slow. In theory, though, it should be possible to speed it up as I really only want to get back the rows from the large table (let's call it A) that have a reference in another table, B.
So my query is:
SELECT id FROM A, B where A.ref = B.ref;
(A has over 100 million rows; B has just a few thousand).
I've added INDEXes:
alter table A add index(ref);
alter table B add index(ref);
But it's still very slow (several minutes -- I'd be happy with one minute).
Unfortunately, I'm stuck with MySQL 4.1.22, so I can't use views.
I'd rather not copy all of the relevant rows from A into a separate, smaller table, as the rows that I need will change from time to time. On the other hand, at the moment that's the only solution I can think of.
Any suggestions welcome!
EDIT: Here's the output of running EXPLAIN on my query:
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| 1 | SIMPLE | B | ALL | B_ref,ref | NULL | NULL | NULL | 16718 | Using where |
| 1 | SIMPLE | A | ref | A_REF,ref | A_ref | 4 | DATABASE.B.ref | 5655 | |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
(In redacting my original query example, I chose to use "ref" as my column name, which happens to be the same as one of the types, but hopefully that's not too confusing...)
The query optimizer is probably already doing the best that it can, but in the unlikely event that it's reading the giant table (A) first, you can explicitly tell it to read B first using the STRAIGHT_JOIN syntax:
SELECT STRAIGHT_JOIN id FROM B, A where B.ref = A.ref;
From the answers, it seems like you're doing the most efficient thing you can with the SQL. The A table seems to be the big problem, how about splitting it into three individual tables, kind of like a local version of sharding? Alternatively, is it worth denormalising the B table into the A table, assuming B doesn't have too many columns?
Finally, you could just have to buy a faster box to run it on - there's no substitute for horsepower!
Good luck.
SELECT id FROM A JOIN B ON A.ref = B.ref
You may be able to optimize further by using an appropriate type of join e.g. LEFT JOIN
http://en.wikipedia.org/wiki/Join_(SQL)