how can I optimize a query with multiple joins (already have indexes)? - mysql

SELECT citing.article_id as citing, lac_a.year, r.id_when_cited, cited_issue.country, citing.num_citations
FROM isi_lac_authored_articles as lac_a
JOIN isi_articles citing ON (lac_a.article_id = citing.article_id)
JOIN isi_citation_references r ON (citing.article_id = r.article_id)
JOIN isi_articles cited ON (cited.id_when_cited = r.id_when_cited)
JOIN isi_issues cited_issue ON (cited.issue_id = cited_issue.issue_id);
I have indexes on all the fields being JOINED on.
Is there anything I can do? My tables are large (some 1 Million records, the references tables has 500 million records, the articles table has 25 Million).
This is what EXPLAIN has to say:
+----+-------------+-------------+--------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
| 1 | SIMPLE | cited_issue | ALL | NULL | NULL | NULL | NULL | 1156856 | |
| 1 | SIMPLE | cited | ref | isi_articles_id_when_cited,isi_articles_issue_id | isi_articles_issue_id | 49 | func | 19 | Using where |
| 1 | SIMPLE | r | ref | isi_citation_references_article_id,isi_citation_references_id_when_cited | isi_citation_references_id_when_cited | 17 | mimir_dev.cited.id_when_cited | 4 | Using where |
| 1 | SIMPLE | lac_a | eq_ref | PRIMARY | PRIMARY | 16 | mimir_dev.r.article_id | 1 | |
| 1 | SIMPLE | citing | eq_ref | PRIMARY | PRIMARY | 16 | mimir_dev.r.article_id | 1 | |
+----+-------------+-------------+--------+--------------------------------------------------------------------------+---------------------------------------+---------+-------------------------------+---------+-------------+
5 rows in set (0.07 sec)

If you realy need all the returned data, I would suggest two things:
You, probably, know the data better than MySQL and you can try to make advantage of it if MySQL is not correct in its assumptions. Currently, MySQL thinks that it is easier to full scan the whole isi_issues table at the beginning, and if the result is really going to include all issues, than the assumption is correct. But if there are many issues that should not be in the result, you may want to force another order of the joins that you consider more correct. It is you, who knows which table applies the strongest restrictions and which are the smallest to full scan (you will anyway need to full scan something, since there is no WHERE clause).
You can make profit from covering indexes (that is indexes that contain enough data in itself and not needing to touch the row data). For example, having an index (article_id, num_citations) on isi_articles and (article_id, year) on isi_lac_authored_articles and even (country) on isi_issues will significantly speed up that query as long as the indexes fit in memory, but, from the other side, will make you indexes larger and slightly slow dow inserts into the table.

i think it's the best you can do. i mean at least it's not using nested/multiple queries. you should do a little benchmark on the sql. you could at least limit your results at the least as possible. 15-30 rows for a return set is pretty fine per page (this depends on the app, but 15-30 for me is the tolerance range)
i believe in mySQL (phpMyAdmin, console, GUI whatever) they return some sort of "execution time" which is the time that it took to the query to process. compare that with a benchmark of the query using your server-side code. then compare that with the query run using the server-side code and outputting it with your app interface included after that.
by this, you can see where your bottle-neck is - that is where you optimize.

Unless the result of your query is input to some other query or system, it is useless to return that much(3M) rows. That would be clever to return just an acceptable amount of rows per query(like 1000) that is for visualizing.

Looking at your SQL - the lack of a WHERE clause means it is pulling all rows from:
JOIN isi_issues cited_issue ON (cited.issue_id = cited_issue.issue_id)
You could look at partitioning the large isi_issues table, this would allow MySQL to perform a bit quicker (smaller files are easier to handle)
Or alternatively you can loop the statement and use a LIMIT clause.
LIMIT 0,100000
then
LIMIT 100001, 200000
This will let the statements run quicker and you can deal with the data in batches.

Related

MySQL returning only part of results from FULL TEXT SEARCH after version 5.6 to 5.7 upgrade

I have a query that consists of two full text searches in boolean mode (combined with OR operator) that worked just fine on MySQL 5.6 and that fails after bumping MySQL version 5.7. Both DBs have the exact same set of records, both are hosted on AWS (InnoDB, aurora).
Query below (don't pay too much attention to the table/column names as I tried to anonymise them):
SELECT
cars.id
FROM cars
INNER JOIN driver_licenses ON driver_licenses.car_id = cars.id
INNER JOIN drivers ON drivers.id = driver_licenses.driver_id AND drivers.noobie_driver = 0
WHERE (
(MATCH(cars.name) AGAINST ('mark*' IN BOOLEAN MODE))
OR (MATCH(drivers.first_name, drivers.last_name, drivers.email) AGAINST ('mark*' IN BOOLEAN MODE))
);
Of course I have the fulltext index on the [first_name, last_name, email] columns, as well a btree index on noobie_driver. There are two indices on cars.name - one btree and the other one fulltext.
Before the upgrade, query returned proper results (counted in hundreds compared to a few million records in total).
After the upgrade - it seems that the query/optimizer focuses only on the first condition and completely disregards the second full text search (by driver's names and email) and returns only few records - related directly to the search result of cars.name.
When queries are ran separately (first time for cars.name and then for drivers details) and then combined, they return same results as before the upgrade.
Also when I force to ignore noobie_driver index (or remove the noobie_driver condition), both full text search conditions are taken into consideration.
Running EXPLAIN in both DBs return the same results.
+----+-------------+---------------------------+------------+--------+------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+-----------------------------------------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------------------+------------+--------+------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+-----------------------------------------------------+------+----------+-------------+
| 1 | SIMPLE | drivers | NULL | ref | PRIMARY,index_drivers_on_noobie_driver | index_drivers_on_noobie_driver | 1 | const | 6798 | 100.00 | NULL |
| 1 | SIMPLE | driver_licenses | NULL | ref | index_driver_licenses_on_car_id,index_driver_licenses_on_driver_id | index_driver_licenses_on_driver_id | 5 | Rental.drivers.id | 1 | 100.00 | Using where |
| 1 | SIMPLE | cars | NULL | eq_ref | PRIMARY | PRIMARY | 4 | Rental.driver_licenses.car_id | 1 | 100.00 | Using where |
+----+-------------+---------------------------+------------+--------+------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+-----------------------------------------------------+------+----------+-------------+
Tomorrow I'll be working on rebuilding the index/table(s) to see if that brings any changes to the behaviour on 5.7, once it's done I'll come back with more details. Running OPTIMIZE TABLE on all 3 tables haven't fixed anything here.
I'm wondering:
Have I missed something and it is a feature now in 5.7 now that it behaves this way?
How to overcome the issue and keep the exact same query (so without ignoring the index or performing two separate queries to combine the results afterwards)?
OK, dropping and recreating the index on noobie_driver column seems to do the trick on smoke-environment database that contains just few thousands records in the drivers table
DROP INDEX index_drivers_on_noobie_driver ON drivers;
CREATE INDEX index_drivers_on_noobie_driver USING BTREE ON drivers(noobie_driver);
BUT with production data that handles ~2kk records in the drivers" table, dropping and recreating an index did not help. I'm starting to believe it could be related to some bug strictly related to MySQL version.
Will be updating the question once I learn something new

Mysql: explain returns more rows than the actual number

I have a table, which contains 40 M rows by counting.
select count(*) from xxxs;
returns 38000389
but the explain:
mysql> explain select * from xxxs where s_uuid = "21eaef";
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | xxxs | NULL | ALL | NULL | NULL | NULL | NULL | 56511776 | 10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.06 sec)
why the rows is 56M which is much larger than 40 M?
Thanks
UPDATE
1, the above query may take several minutes. is it normal? How to tune the performance?
2, I plan to create an index on s_uuid. I guess it will improve the performance. Am I right?
The "rows" in EXPLAIN is an estimate based on statistics that were gathered in the recent past. The value is rarely exact; sometimes it is even off by more than a factor of two.
Still, the estimate is usually "good enough" for the Optimizer to decide how to perform the query.
Another place to see this estimate of row count is via
SHOW TABLE STATUS LIKE 'xxxs';
(As mentioned in a Comment) Adding this is likely to speed up select * from xxxs where s_uuid = "21eaef";:
INDEX(s_uuid)
I say "likely to" because, if a lot of rows have s_uuid = "21eaef", the Optimizer will shun the index and simply scan the entire table rather than bouncing back and forth from the index's BTree and the data's BTree. You can see the "shun" in EXPLAIN by having Possible keys = idx_uuid but key = NULL.
There will be cases where the Optimizer makes the 'wrong' choice. But we can discuss that in another Q&A.

MySQL InnoDB indexes slowing down sorts

I am using MySQL 5.6 on FreeBSD and have just recently switched from using MyISAM tables to InnoDB to gain advances of foreign key constraints and transactions.
After the switch, I discovered that a query on a table with 100,000 rows that was previously taking .003 seconds, was now taking 3.6 seconds. The query looked like this:
SELECT *
-> FROM USERS u
-> JOIN MIGHT_FLOCK mf ON (u.USER_ID = mf.USER_ID)
-> WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
I noticed that if I removed the ORDER BY clause, the execution time dropped back down to .003 seconds, so the problem is obviously in the sorting.
I then discovered that if I added back the ORDER BY but removed indexes on the columns referred to in the query (STATUS and ACCESS_ID), the query execution time would take the normal .003 seconds.
Then I discovered that if I added back the indexes on the STATUS and ACCESS_ID columns, but used IGNORE INDEX (STATUS,ACCESS_ID), the query would still execute in the normal .003 seconds.
Is there something about InnoDB and sorting results when referencing an indexed column in a WHERE clause that I don't understand?
Or am I doing something wrong?
EXPLAIN for the slow query returns the following results:
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | u | ref | PRIMARY,STATUS,ACCESS_ID | STATUS | 2 | const | 53902 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | mf | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.u.USER_ID | 1 | NULL |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
EXPLAIN for the fast query returns the following results:
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| 1 | SIMPLE | mf | index | PRIMARY | STREAK | 2 | NULL | 100 | NULL |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.mf.USER_ID | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
Any help would be greatly appreciated.
In the slow case, MySQL is making an assumption that the index on STATUS will greatly limit the number of users it has to sort through. MySQL is wrong. Presumably most of your users are ACTIVE. MySQL is picking up 50k user rows, checking their ACCESS_ID, joining to MIGHT_FLOCK, sorting the results and taking the first 100 (out of 50k).
In the fast case, you have told MySQL it can't use either index on USERS. MySQL is using its next-best index, it is taking the first 100 rows from MIGHT_FLOCK using the STREAK index (which is already sorted), then joining to USERS and picking up the user rows, then checking that your users are ACTIVE and have an ACCESS_ID at or above 8. This is much faster because only 100 rows are read from disk (x2 for the two tables).
I would recommend:
drop the index on STATUS unless you frequently need to retrieve INACTIVE users (not ACTIVE users). This index is not helping you.
Read this question to understand why your sorts are so slow. You can probably tune InnoDB for better sort performance to prevent these kind of problems.
If you have very few users with ACCESS_ID at or above 8 you should see a dramatic improvement already. If not you might have to use STRAIGHT_JOIN in your select clause.
Example below:
SELECT *
FROM MIGHT_FLOCK mf
STRAIGHT_JOIN USERS u ON (u.USER_ID = mf.USER_ID)
WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
STRAIGHT_JOIN forces MySQL to access the MIGHT_FLOCK table before the USERS table based on the order in which you specify those two tables in the query.
To answer the question "Why did the behaviour change" you should start by understanding the statistics that MySQL keeps on each index: http://dev.mysql.com/doc/refman/5.6/en/myisam-index-statistics.html. If statistics are not up to date or if InnoDB is not providing sufficient information to MySQL, the query optimiser can (and does) make stupid decisions about how to join tables.

How to speed up mysql select in database with highly redundant key values

I have a very simple MYSQL database with only 3 columns but several millions of rows.
Two of the colums (hid1, hid2) describe study objects (about 50,000 of them) and the third column (score) is the result of a comparison of hid1 with hid2. Thus, the number of rows is max(hid1)*max(hid2), which is quite a big number. Because the table has to be written only once and read many million times, I selected a MyISAM table (I hope this was a good idea). Initially, it was planned that I would retrieve 'score' for a given pair of hid1,hid2 but it turned out to be more convenient to retrieve all scores (and hid2) for a given hid1.
My table ("result") looks like this:
+-------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+-------+
| hid1 | mediumint(8) unsigned | YES | MUL | NULL | |
| hid2 | mediumint(8) unsigned | YES | | NULL | |
| score | float | YES | | NULL | |
+-------+-----------------------+------+-----+---------+-------+
and a typical query would be
select hid1,hid2,score from result where hid1=13531 into outfile "/tmp/ttt"
Here is the problem: The query just takes too long, at least sometimes. For some 'hid1' values, I get the result back in under a second. For other hid1 (particularly for big numbers), I have to wait for up to 40 sec. As I said, I have to run thousands of these queryies, so I am interested in speeding things up.
Let me reiterate: there are about 50,000 hits to the query, and I don't need them in any particular order. Am I doing something wrong here, or is a relational database like MySQL not up to this task?
What I already tried is to increase the key_buffer in /etc/mysql/my.conf
this appeared to help, but not much. The index on hid1 is a few GB, does the key_buffer have to be bigger than the index size to be effective?
Any hint would be appreciated.
Edit: here is an example run with the corresponding 'explain' output:
select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt"
Query OK, 16465 rows affected (31.88 sec)
As you can see below, the index hid1_idx is actually being used:
mysql> explain select hid1,hid2,score from result where hid1=132885 into outfile "/tmp/ttt";
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
| 1 | SIMPLE | result | ref | hid1_index | hid1_index | 4 | const | 15456 | Using where |
+----+-------------+--------+------+---------------+------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
What I do find puzzling is the fact that query with low numbers for hid1 always are much faster than those with high numbers. This is not what I would expect from using an index.
Two random suggestions, based on a query pattern that always involve equality filter on hid1:
Use InnoDB table instead and take advantage of a clustered index on (hid1, hid2). That way all rows belonging to the same hid will be physically located together, and this will speed up retreival.
Hash-partition the table on hid1, with a suitable nr of partitions.
The simplest way to optimize a query like that, would be to use an index. A simple thing like
alter table results add index(hid1)
would improve the query you sent. Even more, if you want to search by both fields at once, you can use both fields in the index.
alter table results add index(hid1, hid2)
That way, MySQL can access results in a very organized way, and find the information you want.
If you run an explain on the first query, you might see something like
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ALL | | 7765605| Using where
After adding the index, you should see
| select_type | table | type|possible_keys| rows |Extra
| SIMPLE | results| ref |hid1 | 2816304|
Which is telling you, in the first case, that it needs to check ALL the rows, and in the second case, that it can find the information using a ref
If you know the combination of hid1 and hid2 is unique, you should consider making that your primary key. That will automatically also add an index to hid1. See: http://dev.mysql.com/doc/refman/5.5/en/multiple-column-indexes.html
Also, check the output of EXPLAIN. See: http://dev.mysql.com/doc/refman/5.5/en/select-optimization.html and related links.

MySQL has indexed tables and EXPLAIN looks good, but still not using index

I am trying to optimize a query, and all looks well when I got to "EXPLAIN" it, but it's still coming up in the "log_queries_not_using_index".
Here is the query:
SELECT t1.id_id,t1.change_id,t1.like_id,t1.dislike_id,t1.user_id,t1.from_id,t1.date_id,t1.type_id,t1.photo_id,t1.mobile_id,t1.mobiletype_id,t1.linked_id
FROM recent AS t1
LEFT JOIN users AS t2 ON t1.user_id = t2.id_id
WHERE t2.active_id=1 AND t1.postedacommenton_id='0' AND t1.type_id!='Friends'
ORDER BY t1.id_id DESC LIMIT 35;
So it grabs like a 'wallpost' data, and then I joined in the USERS table to make sure the user is still an active user (the number 1), and two small other "ANDs".
When I run this with the EXPLAIN in phpmyadmin it shows
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | t1 | index | user_id | PRIMARY | 4 | NULL | 35 | Using where
1 | SIMPLE | t2 | eq_ref | PRIMARY,active_id | PRIMARY | 4 | hnet_user_info.t1.user_id | 1 | Using where
It shows the t1 query found 35 rows using "WHERE", and the t2 query found 1 row (the user), using "WHERE"
So I can't figure out why it's showing up in the log_queries_not_using_index report.
Any tips? I can post more info if you need it.
tldr; ignore the "not using index warning". A query execution time of 1.3 milliseconds is not a problem; there is nothing to optimize here - look at the entire performance profile to find bottlenecks.
Trust the database engine. The database query planner will use indices when it determines that doing so is beneficial. In this case, due to the low cardinality estimates (35x1), the query planner decided that there was no reason to use indices for the actual execution plan. If indices were used in a trivial case like this it could actually increase the query execution time.
As always, use the 97/3 rule.