retrieving top-ranking rows from large tables using FULLTEXT is very slow - mysql

When we log into our database with mysql-client and launch these queries:
first test query:
select a.*
from ads a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"bmw serie 3"' in boolean mode)
order by a.ranking asc limit 0, 10;
The result is:
10 rows in set (1 min 5.37 sec)
second test query:
select a.*
from ads a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"ford mondeo"' in boolean mode)
order by a.ranking asc limit 0, 10;
The result is:
10 rows in set (2 min 13.88 sec)
These queries are too slow. Is there a way to improve this?
The 'ads' table contains 2 millions rows, triggers are set to duplicate the data into search title. Search titles contains the id, title and label of each row in ads.
Table 'ads' is powered by innoDB and 'searchs_titles' by myISAM with a fulltext index on the label field.
Do we have too many columns? Too many indexes? Too many rows?
Is it a bad query?
Thanks a lot for the time you will spend helping us!
Edit: add explain
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | s | fulltext | id_ad,label | label | 0 | | 1 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY,id,id_2,id_3 | PRIMARY | 4 | XXXXXX.s.id_ad | 1 | |

Pro tip: Never use * in a SELECT statement in production software (unless you have a very good reason). By asking for all columns, you are denying the optimizer access to information about how best to exploit your indexes.
Observation: you're ordering by ads.ranking and taking ten results. But ads.ranking has very low cardinality -- according to that image in your question, it has 26 distinct values. Is your query working correctly?
Observation: You've said that the fulltext part of your search takes .77 seconds. I mean this part:
select s.id
from searchs_titles AS s
where match(s.label) against ('"ford mondeo"' in boolean mode)
That is good. It means we can focus on the rest of the query.
You also said you've been testing with the insertions to the table turned off. That's good because it rules out contention as a cause for the slow queries.
Suggestion: Create a suitable compound index for ads. For your present query, try an index on (id, ranking) This may allow your ORDER BY operation to avoid a full table scan.
Then, try this query to extract the set of ten a.id values you need, and then retrieve the data rows. This will exploit your compound index.
select z.*
from ads AS z
join ( select a.id, a.ranking
from ads AS a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"ford mondeo"' in boolean mode)
order by a.ranking asc
limit 0, 10
) AS b ON z.id = b.id
order by z.ranking
This uses a subquery to do the order by ... limit ... datashuffling operation on a small subset of the columns. This should make the retrieval of the appropriate id values much faster. Then the outer query fetches the appropriate rows.
The bottom line is this: ORDER BY ... LIMIT ... can be a very expensive operation if it's done on lots of data. But if you can arrange for it to be done on a minimal choice of columns, and those columns are indexed correctly, it can be very fast.

Related

How does Index Scope work in Mysql?

In the MySQL manual there is a page on index hinting that mentions that you can specify the index hinting for specific parts of the query.
You can specify the scope of an index hint by adding a FOR clause to the hint. This provides more fine-grained control over the optimizer's selection of an execution plan for various phases of query processing. To affect only the indexes used when MySQL decides how to find rows in the table and how to process joins, use FOR JOIN. To influence index usage for sorting or grouping rows, use FOR ORDER BY or FOR GROUP BY.
However, there is little to no more information about how this works or what it actually does in the MySQL optimizer. As well in practice it appears to be negligible in actually improving anything.
Here is a test query and what explain says about the query:
SELECT
`property`.`primary_id` AS `id`
FROM `California` `property`
USE INDEX FOR JOIN (`Zipcode Bedrooms`)
USE INDEX FOR ORDER BY (`Zipcode Bathrooms`)
INNER JOIN `application_zipcodes` `az`
ON `az`.`application_id` = '18'
AND `az`.`zipcode` = `property`.`zipcode`
WHERE `property`.`city` = 'San Jose'
AND `property.`zipcode` = '95133'
AND `property`.property_type` = 'Residential'
AND `property`.`style` = 'Condominium'
AND `property`.`bedrooms` = '3'
ORDER BY `property`.`bathrooms` ASC
LIMIT 15
;
Explain:
EXPLAIN SELECT `property`.`primary_id` AS `id` FROM `California` `property` USE INDEX FOR JOIN (`Zipcode Bedrooms`) USE INDEX FOR ORDER BY (`Zipcode Bathrooms`) INNER JOIN `application_zipcodes` `az` ON `az`.`application_id` = '18' AND `az`.`zipcode` = `property`.`zipcode` WHERE `property`.`city` = 'San Jose' AND `property.`zipcode` = '95133' AND `property`.property_type` = 'Residential' AND `property`.`style` = 'Condominium' AND `property`.`bedrooms` = '3' ORDER BY `property`.`bathrooms` ASC LIMIT 15\g
+------+-------------+----------+--------+---------------+---------+---------+------------------------------------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+--------+---------------+---------+---------+------------------------------------+------+----------------------------------------------------+
| 1 | SIMPLE | Property | ref | Zip Bed | Zip Bed | 17 | const,const | 2364 | Using index condition; Using where; Using filesort |
| 1 | SIMPLE | az | eq_ref | PRIMARY | PRIMARY | 7 | const,Property.zipcode | 1 | Using where; Using index |
+------+-------------+----------+--------+---------------+---------+---------+------------------------------------+------+----------------------------------------------------+
2 rows in set (0.01 sec)
So to summarize I am basically wondering how the index scope was meant to be used, as this doesn't seem to do anything when I add or remove the line USE INDEX FOR ORDER BY (Zipcode Bathrooms).
I have yet to figure out how multiple hints can be used. MySQL will almost never use more than one index per SELECT. The only exception I know of is with "index merge", which is not relevant in your example.
The Optimizer usually focuses on finding a good index for the WHERE clause. If it entirely covers the WHERE, without any "ranges", then it checks to see if there are GROUP BY and ORDER BY fields, in the right order, to use. If it can handle all of WHERE, GROUP BY, and ORDER BY, then it can actually optimize the LIMIT (but not OFFSET).
If the Optimizer can't consume all the WHERE, it may reach into the ORDER BY in hopes avoiding the "filesort" that ORDER BY otherwise requires.
None of this allows for different indexes for different clauses. A single hint may encourage the use of one of the above cases (above) in preference to the other; I don't know.
Don't use utf8 for zipcode; it makes things bulkier than necessary (3 bytes per character). In general, shrinking the size of the table will help performance some. Or, if you have a huge dataset, it may help perf a lot. (Avoiding I/O is very important.)
Bathrooms is not very selective; there is not much to gain even if it would be possible.
az.application_id is the big monkey wrench in the query; what is it?

Why does the query execute so much slower when all the columns involved are the same and only the where condition changes?

I have this query:
SELECT 1 AS InputIndex,
IF(TRIM(DeviceInput1Name = '', 0, IF(INSTR(DeviceInput1Name, '|') > 0, 2, 1)) AS InputType,
(SELECT Value1_1 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueLeft,
(SELECT Value1_2 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueRight
FROM devices
WHERE DeviceIMEI = 'Some_Search_Value';
This completes fairly quickly (in up to 0.01 seconds). However, running the same query with WHERE clause as such
WHERE DeviceIMEI = 'Some_Other_Search_Value';
makes it run for upwards of 14 seconds! Some search values finish very quickly, while others run way too long.
If I run EXPLAIN on either query, I get the following:
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | PRIMARY | devices | ref | DeviceIMEI | DeviceIMEI | 28 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
Also, here's the actual number of records, just so it's clear:
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Search_Value';
+----------+
| count(*) |
+----------+
| 1017946 |
+----------+
1 row in set (0.17 sec)
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Other_Search_Value';
+----------+
| count(*) |
+----------+
| 306100 |
+----------+
1 row in set (0.04 sec)
Any ideas why changing a search value in the WHERE clause would cause the query to execute so slowly, even when the number of physical records to search through is lower?
Note there is no need for you to rewrite the query, just explain why the above happens.
UPDATE: I have tried running two separate queries instead of one with dependent subqueries to get the information I need (first I select DeviceID from devices by DeviceIMEI, then select from devicevalues by DeviceID I got from the previous query) and all queries return instantly. I suppose the only solution is to run these queries in a transaction, so I'll be making a stored procedure to do this. This, however, still doesn't answer my question which puzzles me.
I dont think that 1017946 is equivalent to the number of rows returned by your very first query.Your first query returns all rows from devices with some correlated queries,your count query returns all common rows between the 2 tables.If this is so the problem might be cardinality issues namely some_other_values constitute a much larger proportion of the rows in your first query than some_value so Mysql chooses a table scan.
If I understand correctly, the query is the same, and only the searched value changes.
There are three real possibilities as I can see, the first much likelier than the others:
The fast query only appears to be fast. And that's why it is in the MySQL query cache already. Try disabling the cache, running with NO_SQL_CACHE, or run the slow query twice. If the second way round runs in 0.01s instead of 14s, you'll know this is the case.
One query has to look way more records than the other. An IMEI may have lots of rows in devicevalues, another might have next no none. Apparently you are in such a condition, and what makes this unlikely is (apart from the times involved) the fact that it is the slower IMEI which actually has less matches.
The slow query is indeed slow. This means that a particular subset of data is hard to locate or hard to retrieve. The first may be due to an overdue reindexing or to filesystem fragmentation of very large indexes. The second can also be due to fragmentation of the tablespace, or to other condition which splits up records (for example the database is partitioned). A search in a small partition is wont to be faster than a search in a large partition.
But the time differences involved aren't equal in the three cases, and a 1400x difference seems to me an unlikely consequence of (2) or (3). The first possibility seems way more appealing.
Update you seem not to be using indexes on your subqueries. Have you an index such as
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime);
If you can, you can try a covering index:
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime, Value1_1, Value1_2);

Efficiently fetching 25 rows with highest sum of two columns (MySQL)

I have a top list page on my website, which fetches 25 rows with highest values in a particular column. I have no issues fetching the top list if it is based on one column (score for instance), but when more columns are involved, I faced some performance issues.
In the problematic case I want to select 25 rows, ordered by a sum of two columns in a descending order.
SELECT username, rank1 + rank2 AS rank FROM users ORDER BY rank DESC LIMIT 25
The query works, but takes approximately 0.25 seconds to finish, in contrast to queries on single column which take about 0.0003. Below is the result for explain query:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | accounts | ALL | NULL | NULL | NULL | NULL | 517874 | Using filesort
Both rank1 and rank2 are indexed, but clearly the indexes are not used for this query. Is there a way to improve the performance by somehow editing the query or the indexes?
MySQL does not handle this situation very well. Other databases (Oracle, Postgres, SQL Server, for example) offer some form of function-based indexes which can directly solve this problem. To do this in MySQL requires adding a new column to the table, then adding a trigger to keep it up-to-date. And finally an index on the new column. Perhaps a lot of work.
In some situations, you might be able to assume that the top XXX by the sum is going to be in the top YYY for each ranking. If this is true, then a query such as this will improve performance:
select ur1.*
from (select u.*
from users u
order by rank1 desc
limit 1000
) ur1 join
(select u.*
from users u
order by rank2 desc
limit 1000
) ur2
on ur1.username = ur2.username
order by ur1.rank1 + ur1.rank2 desc
limit 25;
This extracts the top 1000 (or whatever values) by each ranking and then identifies users common to the two lists. Hopefully there are 25 such users (for your application). At the very least, this should perform better than the overall query. You can first try this. If it returns 25 rows, then great. Otherwise, go for your original query.

Why is MySQL slow when using LIMIT in my query?

I'm trying to figure out why is one of my query slow and how I can fix it but I'm a bit puzzled on my results.
I have an orders table with around 80 columns and 775179 rows and I'm doing the following request :
SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC LIMIT 200
which returns 38 rows in 4.5s
When removing the ORDER BY I'm getting a nice improvement :
SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL LIMIT 200
38 rows in 0.30s
But when removing the LIMIT without touching the ORDER BY I'm getting an even better result :
SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC
38 rows in 0.10s (??)
Why is my LIMIT so hungry ?
GOING FURTHER
I was trying a few things before sending my answer and after noticing that I had an index on creation_date (which is a datetime) I removed it and the first query now runs in 0.10s. Why is that ?
EDIT
Good guess, I have indexes on the others columns part of the where.
mysql> explain SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC LIMIT 200;
+----+-------------+--------+-------+------------------------+---------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+------------------------+---------------+---------+------+------+-------------+
| 1 | SIMPLE | orders | index | id_state_idx,id_mp_idx | creation_date | 5 | NULL | 1719 | Using where |
+----+-------------+--------+-------+------------------------+---------------+---------+------+------+-------------+
1 row in set (0.00 sec)
mysql> explain SELECT * FROM orders WHERE id_state = 2 AND id_mp IS NOT NULL ORDER BY creation_date DESC;
+----+-------------+--------+-------+------------------------+-----------+---------+------+-------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+------------------------+-----------+---------+------+-------+----------------------------------------------------+
| 1 | SIMPLE | orders | range | id_state_idx,id_mp_idx | id_mp_idx | 3 | NULL | 87502 | Using index condition; Using where; Using filesort |
+----+-------------+--------+-------+------------------------+-----------+---------+------+-------+----------------------------------------------------+
Indexes do not necessarily improve performance. To better understand what is happening, it would help if you included the explain for the different queries.
My best guess would be that you have an index in id_state or even id_state, id_mp that can be used to satisfy the where clause. If so, the first query without the order by would use this index. It should be pretty fast. Even without an index, this requires a sequential scan of the pages in the orders table, which can still be pretty fast.
Then when you add the index on creation_date, MySQL decides to use that index instead for the order by. This requires reading each row in the index, then fetching the corresponding data page to check the where conditions and return the columns (if there is a match). This reading is highly inefficient, because it is not in "page" order but rather as specified by the index. Random reads can be quite inefficient.
Worse, even though you have a limit, you still have to read the entire table because the entire result set is needed. Although you have saved a sort on 38 records, you have created a massively inefficient query.
By the way, this situation gets significantly worse if the orders table does not fit in available memory. Then you have a condition called "thrashing", where each new record tends to generate a new I/O read. So, if a page has 100 records on it, the page might have to be read 100 times.
You can make all these queries run faster by having an index on orders(id_state, id_mp, creation_date). The where clause will use the first two columns and the order by will use the last.
Same problem happened in my project,
I did some test, and found out that LIMIT is slow because of row lookups
See:
MySQL ORDER BY / LIMIT performance: late row lookups
So, the solution is:
(A)when using LIMIT, select not all columns, but only the PK columns
(B)Select all columns you need, and then join with the result set of (A)
SQL should likes:
SELECT
*
FROM
orders O1 <=== this is what you want
JOIN
(
SELECT
ID <== fetch the PK column only, this should be fast
FROM
orders
WHERE
[your query condition] <== filter record by condition
ORDER BY
[your order by condition] <== control the record order
LIMIT 2000, 50 <== filter record by paging condition
) as O2
ON
O1.ID = O2.ID
ORDER BY
[your order by condition] <== control the record order
in my DB,
the old SQL which select all columns using "LIMIT 21560, 20", costs about 4.484s.
the new sql costs only 0.063s. The new one is about 71 times faster
I had a similar issue on a table of 2.5 million records. Removing the limit part the query took a few seconds. With the limit part it stuck forever.
I solved with a subquery. In your case it would became:
SELECT *
FROM
(SELECT *
FROM orders
WHERE id_state = 2
AND id_mp IS NOT NULL
ORDER BY creation_date DESC) tmp
LIMIT 200
I noted that the original query was fast when the number of selected rows was greater than the limit parameter. Se the query became extremely slow when the limit parameter was useless.
Another solution is trying forcing index. In your case you can try with
SELECT *
FROM orders force index (id_mp_idx)
WHERE id_state = 2
AND id_mp IS NOT NULL
ORDER BY creation_date DESC
LIMIT 200
Problem is that mysql is forced to sort data on the fly. My query of deep offset like:
ORDER BY somecol LIMIT 99990, 10
Took 2.5s.
I fixed it by creating a new table, which has presorted data by column somecol and contains only ids, and there the deep offset (without need to use ORDER BY) takes 0.09s.
0.1s is not still enough fast though. 0.01s would be better.
I will end up creating a table that holds the page number as special indexed column, so instead of doing limit x, y i will query where page = Z.
i just tried it and it is fast as 0.0013. only problem is, that the offseting is based on static numbers (presorted in pages by 10 items for example.. its not that big problem though.. you can still get out any data of any pages.)

MySQL: nested select speed problem

I have follow tables:
|ELEMENTS|
------------
|id_element|
|id_catalog|
|value|
|CATALOG|
------------
|id_catalog|
|catalog_name|
|show|
|status|
I tried to add different indecies (several variants):
1) ELEMENT: pair(id_element, id_catalog) and id_element and id_catalog
2) ELEMENT: pair(id_element, id_catalog) and id_element
3) ELEMENT: pair(id_element, id_catalog) and id_catalog
4) ELEMENT: id_element and id_catalog
1) CATALOG: pair(show, status) and id_catalog
2) CATALOG: id_catalog and show and status
Execute follow select:
SELECT DISTINCT `id_element` FROM `ELEMENTS`
WHERE (id_catalog IN (SELECT `id_catalog` FROM `CATALOG` WHERE status=1 AND show = 1)) limit 10
If there are some rows then it works very fast. But if it is empty - it takes more than 4 sec.
At the same time "SELECTid_catalogFROMCATALOGWHERE status=1 AND show = 1" works fast both there are some rows and empty.
In the table ELEMENTS there are 100.000 records
In the table CATALOG there are 15.000 records
Also I tried "join" but it takes more time than it was before.
Why empty query works so long and what I should do to increase speed rate?
Here are explain answer:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | 'PRIMARY', |'ELEMENTS' | 'index' | '' | null | null | null | 270044 | 'Using where; Using temporary'
2 | 'DEPENDENT SUBQUERY' | 'CATALOG' | 'unique_subquery' | 'PRIMARY,pair,id_catalog' | 'PRIMARY' | '4' | 'func' | 1 | 'Using where'
I guess indexing CATALOG(status,show) would allow a quick answer to the sub-select.
And then some index on ELEMENTS(id_catalog) would speed up the answer to the main question.
Maybe it depends on the statistics on these columns: it they are no selective enough, you'll end up with many rows anyway.
Could you show the output of EXPLAIN when using the two indexes above?
Why not simply writing a join to help the optimizer do its job?
SELECT DISTINCT id_element
FROM elements JOIN catalog ON elements.id_catalog=catalog.id_catalog
WHERE status=1 AND show = 1
LIMIT 10
(untested)
Well, the reason you're having the problem is that you're pulling up the entire catalog database for each request and finding every match between the element and the catalog. If MySQL finds 10 entries, it bails out, but if it never finds them it will continue to check your entire database. I would use an EXISTS query to try and get some performance increase.
SELECT DISTINCT(e.id_element)
FROM ELEMENTS e
WHERE EXISTS (
SELECT *
FROM CATALOG c
WHERE c.id_catalog = e.id_catalog
AND c.status = 1
AND c.show = 1)
LIMIT 10;
This will decrease the amount of time MySQL spends looking for the catalog for each element by imposing a LIMIT 1 on the inner query, but you always run the risk of a long search time when there are possibly no matches.
I would put these indices there:
CREATE INDEX idx_element_1 ON ELEMENT (id_catalog);
CREATE INDEX idx_catalog_1 ON CATALOG (status, show);
Also these, although they might not be needed for your query (these should probably be primary keys, unless you have duplicates):
CREATE INDEX idx_element_2 ON ELEMENT (id_element);
CREATE INDEX idx_catalog_2 ON CATALOG (id_catalog);
Could you drop other indices and create these and check back with the query results?
Thx to all. I solved it by table denormalization. Because there are too much data in this dables which are separated.
I decided to combine it to one table. And now it works perfect. Now query always takes 0.03 second.