I have a aggregate query with two levels deep subqueries. What is strange is that the two subqueries run acceptably fast but the outside query unacceptably slow.
The basic idea behind the query is to use a table to find all elements linked to a key, selected by one of the elements queries. This resultant set should then be provided to the outside query that will match it according to its own keys/indexes.
Here with all outputs and statements:
We start with the two table definitions
CREATE TABLE `table1` (
`id1` int(11) NOT NULL DEFAULT '0',
`id2` int(11) NOT NULL,
`value` int(11) DEFAULT '0',
PRIMARY KEY (`id1`,`id2`),
KEY `k_id1` (`id1`),
KEY `k_id2` (`id2`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `lookuptable1` (
`id3` int(11) NOT NULL,
`id4` int(11) NOT NULL,
PRIMARY KEY (`id3`,`id4`),
UNIQUE KEY `id4_idx` (`id4`),
KEY `id3_idx` (`id3`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The inside subquery with it's own subquery
SELECT lt1.id4
FROM lookuptable1 lt1
WHERE lt1.id3 = (SELECT pt1.id3
FROM lookuptable1 pt1
WHERE pt1.id4 = 5960)
+-----------+
| id4 |
+-----------+
| 5960 |
| 17215 |
| 3625734 |
| 9312798 |
+-----------+
4 rows in set (0.00 sec)
As you can see: Fast enough.
But the outside query is where the bad bottleneck lies.
Complete query
SELECT
t1.id1,
sum(t1.value)
FROM table1 t1
WHERE t1.id2 = 3 AND t1.id1 IN
(
SELECT lt1.id4
FROM lookuptable1 lt1
WHERE lt1.id3 = (SELECT pt1.id3
FROM lookuptable1 pt1
WHERE pt1.id4 = 5960)
);
+-----------+-----------------------+
| id 1. | sum(t1.value) |
+-----------+-----------------------+
| 9312798 | 0 |
+-----------+-----------------------+
1 row in set (8.01 sec)
That is 8 seconds too slow
herewith the Explain extended for this query:
+----+--------------------+-------+--------+-------------------+-------------+---------+------------+---------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+--------+-------------------+-------------+---------+------------+---------+----------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL. | PRIMARY | 8 | NULL. | 1454343 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | lt1 | eq_ref | PRIMARY,id3,id4 | PRIMARY | 8 | const,func | 1 | 100.00 | Using where; Using index |
| 3 | SUBQUERY | pt1 | const | id4 | id4_idx | 4 | | 1 | 100.00 | Using index |
+----+--------------------+-------+--------+-------------------+-------------+---------+------------+---------+----------+--------------------------+
As I understand from this, the outside query doesn't actually use the index that it could.
What could we possibly be doing wrong in this query. Surely it should be running much much faster.
I tried running the outside query with the subqueries' result copy-pasted inside the IN clause (in other words the subqueries aren't run. It runs normally fast. Here's the explain extended then:
+----+-------------+-------+-------+----------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+-------+----------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | t1 | range | PRIMARY,k_id1 | PRIMARY | 4 | NULL | 5 | 100.00 | Using where |
+----+-------------+-------+-------+----------------+---------+---------+------+------+----------+-------------+
Oh yeah. This is running on MySQL 5.5
you could avoid the IN clause using an inner join
SELECT
t1.id1,
sum(t1.value)
FROM table1 t1
INNER JOIN (
SELECT lt1.id4
FROM lookuptable1 lt1
WHERE lt1.id3 = (SELECT pt1.id3
FROM lookuptable1 pt1
WHERE pt1.id4 = 5960)
) t on t.id4 = t1.id1 and t1.id2 = 3
and this could improve your query ..
be sure you have a proper index on table1 (id1, id2)
Related
I just learning MYSQL, I have MySql subquery like this:
EXPLAIN EXTENDED SELECT brand_name, stars, hh_stock, hh_stock_value, sales_monthly_1, sales_monthly_2, sales_monthly_3, sold_monthly_1, sold_monthly_2,
sold_monthly_3, price_uvp, price_ecp, price_default, price_margin AS margin, vc_percent as vc, cogs, products_length, products_id, material_expenses,
MAX(price) AS products_price, SUM(total_sales) AS total_sales,
IFNULL(MAX(active_age), DATEDIFF(NOW(), products_date_added)) AS products_age, DATEDIFF(NOW(), products_date_added) AS jng_products_age,
AVG(sales_weekly) AS sales_weekly, AVG(sales_monthly) AS sales_monthly, SUM(total_sold) AS total_sold, SUM(total_returned) AS total_returned,
((SUM(total_returned)/SUM(total_sold)) * 100) AS returned_rate
FROM
(
SELECT p.products_id, jc.price, jc.price_end_customer AS price_ecp, jc.total_sales, jc.active_age, jc.sales_weekly,
jc.sales_monthly, jc.total_sold, jc.total_returned, jc.price_uvp, p.price_margin, p.vc_percent, p.material_expenses,
p.products_date_added, p.stars , pb.brand_name, p.family_id, p.products_price_default AS price_default, pl.sales_monthly_1,
pl.sales_monthly_2, pl.sales_monthly_3, pl.sold_monthly_1, pl.sold_monthly_2, pl.sold_monthly_3, pst.stock AS hh_stock,
(pst.stock * p.average_stock_value) AS hh_stock_value, pnc.products_length,
IF(ploc.cogs IS NULL OR ploc.cogs=0,
(CASE p.complexity
WHEN 'F' THEN ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+1.7+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+1.7+0.25+2.2),2)
WHEN 'E' THEN ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+1.7+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+1.7+0.25+2.2),2)
WHEN 'N' THEN ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+2.4+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+2.4+0.25+2.2),2)
WHEN 'M' THEN ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+2.4+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+2.4+0.25+2.2),2)
WHEN 'I' THEN ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+3.5+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+3.5+0.25+2.2),2)
WHEN 'H' THEN ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+3.5+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+3.5+0.25+2.2),2)
ELSE ROUND(5*(p.material_expenses+(7.5/100*p.material_expenses)+5+0.25+2.2)/100+(p.material_expenses+(7.5/100*p.material_expenses)+5+0.25+2.2),2) END), ploc.cogs) AS cogs
FROM products p
LEFT JOIN jng_sp_catalog jc ON jc.products_id=p.products_id
LEFT JOIN products_description pd ON pd.products_id = p.products_id AND pd.language_id = 2
LEFT JOIN products_description2 pd2 ON pd2.products_id = p.products_id
LEFT JOIN products_brand pb ON pb.products_brand_id = p.products_brand_id
LEFT JOIN products_log pl ON pl.products_id = p.products_id
LEFT JOIN products_log_static pls ON pls.products_id=p.products_id
LEFT JOIN products_local ploc ON ploc.products_id = p.products_id
LEFT JOIN products_non_configurator pnc ON pnc.products_id = p.products_id
INNER JOIN
(
SELECT shp.products_id, CONCAT(',', GROUP_CONCAT(shp.styles_id), ',') AS styles_id
FROM styles_has_products shp GROUP BY shp.products_id HAVING styles_id NOT LIKE '%,1967,%') subquery_styles ON subquery_styles.products_id = p.products_id
LEFT JOIN products_stock_temp pst ON pst.products_id=p.products_id WHERE p.active_status='1' AND p.categories_top_id = '1') dt GROUP BY products_id ORDER BY products_id;
The result of explain is like this:
+----+-------------+------------+------------+--------+---------------------+-------------+---------+------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+--------+---------------------+-------------+---------+------------------------------------+--------+----------+----------------------------------------------+
| 1 | PRIMARY | p | NULL | ALL | PRIMARY | NULL | NULL | NULL | 40458 | 1.00 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | pb | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_brand_id | 1 | 100.00 | NULL |
| 1 | PRIMARY | ploc | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_id | 1 | 100.00 | NULL |
| 1 | PRIMARY | pl | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_id | 1 | 100.00 | Using where |
| 1 | PRIMARY | pls | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_id | 1 | 100.00 | Using index |
| 1 | PRIMARY | pst | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_id | 1 | 100.00 | NULL |
| 1 | PRIMARY | pd2 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_id | 1 | 100.00 | Using index |
| 1 | PRIMARY | pnc | NULL | eq_ref | PRIMARY | PRIMARY | 4 | manobo_central.p.products_id | 1 | 100.00 | Using where |
| 1 | PRIMARY | pd | NULL | eq_ref | PRIMARY | PRIMARY | 8 | manobo_central.p.products_id,const | 1 | 100.00 | Using index |
| 1 | PRIMARY | jc | NULL | ref | products_id | products_id | 4 | manobo_central.p.products_id | 4 | 100.00 | Using where |
| 1 | PRIMARY | <derived3> | NULL | ref | <auto_key0> | <auto_key0> | 4 | manobo_central.p.products_id | 10 | 100.00 | Using where |
| 3 | DERIVED | shp | NULL | index | PRIMARY,products_id | PRIMARY | 8 | NULL | 208226 | 100.00 | Using index; Using filesort |
+----+-------------+------------+------------+--------+---------------------+-------------+---------+------------------------------------+--------+----------+----------------------------------------------+
I have options in mind.
I will drop subquery and use VIEWS to output the data just like using query. Because i have subquery in FROM, so i will use VIEWS from VIEWS. But some said it will affected in performances. How you guys think about this?
I will still using subquery, but will try and search how to optimize the query. For this one, i wanted to ask you guys, for the first result row in EXPLAIN TABLE, It shows table production p which the type 'all', how to avoid 'all' ? I've managed to use type 'eq_ref' for others table, but still have no clue why the product table is 'all'?
Again,
Do you think i need to switch to VIEW? Or just try to optimise again the subquery.
Many Thanks!
EDIT: table products index
create index family_id on products (family_id);
create index idx_products_date_added on products (products_date_added);
create index material_expenses on products (material_expenses);
create index products_brand_id on products (products_brand_id);
create index products_ean on products (products_ean);
create index products_status on products (products_status);
create index tb_status on products (tb_status);
EDIT: table style_has_products
CREATE TABLE `styles_has_products` (
`styles_id` int(10) unsigned NOT NULL DEFAULT '0',
`products_id` int(10) unsigned NOT NULL DEFAULT '0',
`date_added` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`styles_id`,`products_id`),
KEY `products_id` (`products_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
first and foremost never write such a complex query for real time use. i'll suggest do batch process and maintain data warehouse. and use real time query on data warehouse.
still there are many things you should not do to SQL query on real time use to get performance. like never use more join operation, never put more if else conditions , never apply group by especially if table is huge, look for proper index , partition structure in table.
The first thing I notice is your subquery_styles. You don't use its result other than to filter. Criteria, however, belongs in the WHERE clause in my opinion. As it seems you want to exclude products for which exists a style_id 1967, I'd use NOT EXISTS or NOT IN:
WHERE p.active_status = 1
AND p.categories_top_id = 1
AND p.products_id NOT IN
(
SELECT products_id
FROM styles_has_products
WHERE styles_id = 1967
)
The second thing is that there is no appropriate index for your query. You are selecting products with active_status 1 and categories_top_id 1, but there is no index on these columns. With the third condition on product_id not matching style_id 1967, I'd suggest one of the following indexes:
create index idx1 on products (active_status, categories_top_id, products_id);
create index idx2 on products (categories_top_id, active_status, products_id);
Create both, see which is being used, and drop the other.
A last point that can and maybe should be optimized/changed is your aggregation. But in order to help here, I must know the table's unique keys. As soon as you post them, I'll extend this answer :-)
Building on what Thorsten suggests, instead of NOT IN ( SELECT ), use
NOT EXISTS( SELECT * FROM styles_has_products
WHERE products_id = p.products_id
AND styles_id = 1967 )
styles_has_products needs INDEX(products_id, styles_id) in either order.
Please show us SHOW CREATE TABLE styles_has_products. If it is a many:many mapping table, then see the tips here .
Indexes need to be on the table you are going into, not coming from. So the list of indexes for products probably won't be used. This composite index may be useful:
INDEX(categories_top_id, active_status) -- in either order
VIEWs are just syntactic sugar; they do not inherently provide any performance benefit. In some situations they hurt performance.
pd, pd2, pls, and some others, are not used; remove their JOINs.
The SUMs and AVGs will probably be incorrect. This is because of the "explode-implode" happening with JOIN + GROUP BY. Cleanup some of the other stuff, then we can discuss how to rearrange things so the the SUMs and AVGs are done with only one row per product_id.
I have this query:
SELECT DISTINCT
t1.`signature_id` AS id1,
t2.`signature_id` AS id2,
COUNT(DISTINCT t3.serial) AS weight
FROM `gc_con_sig` AS t1
INNER JOIN `gc_con_sig` AS t2
ON ((t1.`signature_id` != t2.`signature_id`)
AND (t1.`petition_id` = t2.`petition_id`))
INNER JOIN `wtp_data_petitions` AS t3
ON (t3.`serial` = t1.`petition_serial`)
GROUP BY t1.`signature_id`, t2.`signature_id`
HAVING weight > 0;
It essentially get the permutations of signature_ids, and the number of petitions they've both signed (weight).
That I'm trying to run against this table (gc_con_sig):
`petition_id` varchar(64) NOT NULL DEFAULT '' COMMENT 'Petition ID defined by API',
`signature_id` varchar(34) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
`petition_serial` int(11) DEFAULT NULL,
KEY `signature_id` (`signature_id`),
KEY `petition_id` (`petition_id`),
KEY `signature_petition_idx` (`signature_id`,`petition_id`),
KEY `pcidx` (`petition_id`,`signature_id`),
KEY `sig_pet_ser_idx` (`petition_serial`)
This is the explain I get:
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | t1 | ALL | petition_id,pcidx,sig_pet_ser_idx | NULL | NULL | NULL | 200659 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | wtp.t1.petition_serial | 1 | Using index |
| 1 | SIMPLE | t2 | ref | petition_id,pcidx | pcidx | 194 | wtp.t1.petition_id | 5016 | Using where; Using index |
+----+-------------+-------+--------+--------------------------------------------------------+---------+---------+------------------------+--------+----------------------------------------------+
I've optimized the various mysql configurations mysqltuner has told me to, but this query doesn't run (at least within an hour) on a machine with 17GB ram (12GB allocated to mysql).
Any ideas?
Can signatures be on multiple petitions? Can serial be NULL?
Assuming the answers are "no" to both questions, you might try:
SELECT t1.`signature_id` AS id1, t2.`signature_id` AS id2,
COUNT(*) AS weight
FROM `gc_con_sig` t1 INNER JOIN
`gc_con_sig` t2
ON (t1.`signature_id` != t2.`signature_id`) AND
(t1.`petition_id` = t2.`petition_id`)
GROUP BY t1.`signature_id`, t2.`signature_id`;
The count(distinct serial) is counting the non-NULL values in the field. If all values are not NULL and there are no duplicates, then this is equivalent to count(*).
The having clause is not needed because the on clause basically guarantees that there is at least one match.
And, finally, select distinct is never needed when you are using a group by correctly.
I have a huge table like
CREATE TABLE IF NOT EXISTS `object_search` (
`keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
with around 39 million rows (using over 1 GB space) containing the indexed data for 1 million records in the object table (where object_id points at).
Now searching through this with a query like
SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2
is already significantly faster than doing a LIKE search on the composed keywords field in the object table but still takes up to 1 minute.
It's explain looks like:
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| 1 | SIMPLE | search | ref | PRIMARY | PRIMARY | 42 | const | 345180 | 100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
The full explain with joined object and object_color and object_locale table, while the above query is run in a subquery to avoid overhead, looks like:
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 182544 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | search.object_id | 1 | 100.00 | |
| 2 | DERIVED | search | ref | PRIMARY | PRIMARY | 42 | | 345180 | 100.00 | Using where; Using index |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
My top goal would be to be able to scan through this within 1 or 2 seconds.
So, are there further techniques to improve search speed for keywords?
Update 2013-08-06:
Applying most of Neville K's suggestion I now have the following setup:
CREATE TABLE `object_search_keyword` (
`keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
PRIMARY KEY (`keyword_id`),
FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
CREATE TABLE `object_search` (
`keyword_id` int(10) unsigned NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The new query's explain looks like this:
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | 100.00 | |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0 | | 1 | 100.00 | Using where; Using temporary; Using filesort |
| 3 | DERIVED | object_search | ref | PRIMARY | PRIMARY | 4 | object_keyword.keyword_id | 2190225 | 100.00 | Using index |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
The many derives are coming from the keyword comparing subquery being nested into another subquery which does nothing but count the amount of rows returned:
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
GROUP BY search.object_id HAVING hits = 2
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC
The above query will actually run within ~6 seconds, since it searches for two keywords. The more keywords I search for, the faster the search goes down.
Any way to further optimize this?
Update 2013-08-07
The blocking thing seems almost certainly to be the appended ORDER BY statement. Without it, the query executes in less than a second.
So, is there any way to sort the result faster? Any suggestions welcome, even hackish ones that would require post processing somewhere else.
Update 2013-08-07 later that day
Alright ladies and gentlemen, nesting the WHERE and ORDER BY statements in another layer of subquery to not let it bother with tables it doesn't need roughly doubled it's performance again:
SELECT wowrapper.*, locale.title
FROM (
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
GROUP BY search.media_id HAVING hits = 1
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS search
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
WHERE 1
ORDER BY object.object_id DESC
) AS wowrapper
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id)
LIMIT 0,48
Searches that took 12 seconds (single keyword, ~200K results) now take 6, and a search for two keywords that took 6 seconds (60K results) now takes around 3.5 secs.
Now this is already a massive improvement, but is there any chance to push this further?
Update 2013-08-08 early that day
Undid that last nested variation of the query, since it actually slowed down other variations of it...
I'm now trying some other things with different table layouts and FULLTEXT indexes using MyISAM for a dedicated search table with a combined keyword field (comma separated in a TEXT field).
Update 2013-08-08
Alright, a plain fulltext index doesnt really help.
Back to the previous setup, the only thing blocking is the ORDER BY (which resorts to using a temporary table and filesort). Without it a search is complete within less than a second!
So basically what's left of all this is:
How do I optimize the ORDER BY statement to run faster, likely by eliminating the use of the temporary table?
Full text search will be much faster than using the standard SQL string comparison features.
Secondly, if you have a high degree of redundancy in the keywords, you could consider a "many to many" implementation:
Keywords
--------
keyword_id
keyword
keyword_object
-------------
keyword_id
object_id
objects
-------
object_id
......
If this reduces the string comparison from 39 million rows to 100K rows (roughly the size of the English dictionary), you may also see a distinct improvement, as the query would only have to perform 100K string comparisons, and joining on an integer keyword_id and object_id field should be much, much faster than doing 39M string comparisons.
The best solution for this will be a FULLTEXT search, but you will probably need a MyISAM table for that. You can setup a mirror table and update it with some events and triggers or if you have a slave replicating from your server you can change its table to MyISAM and use it for searches.
For this query the only thing I can come up with is to rewrite it as:
SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'
and I'm not sure it will be faster this way.
Also you will need to have an index on object_id (assuming your PK is (key_word, object_id)).
If you have seldom INSERTs and often SELECTs you could optimize your data for the reads i.e. recalculate the number of object_ids per keyword and directly store it in the database. The SELECTs would then be very fast, the INSERTs would take some seconds though,.
Goal of query:
Display race by district.
Query:
SELECT school_data_schools_outer.district_id,
school_data_race_ethnicity_raw_outer.year,
school_data_race_ethnicity_raw_outer.race,
ROUND(
SUM( school_data_race_ethnicity_raw_outer.count) /
(SELECT SUM(count)
FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_inner
INNER JOIN school_data_schools as school_data_schools_inner
USING (school_id)
WHERE school_data_schools_outer.district_id = school_data_schools_inner.district_id
AND school_data_race_ethnicity_raw_outer.year = school_data_race_ethnicity_raw_inner.year) * 100, 2)
FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_outer
INNER JOIN school_data_schools as school_data_schools_outer USING (school_id)
GROUP BY school_data_schools_outer.district_id,
school_data_race_ethnicity_raw_outer.year,
school_data_race_ethnicity_raw_outer.race
mysql> explain SELECT school_data_schools_outer.district_id, school_data_race_ethnicity_raw_outer.year, school_data_race_ethnicity_raw_outer.race,ROUND(SUM(school_data_race_ethnicity_raw_outer.count)/( SELECT SUM(count) FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_inner INNER JOIN school_data_schools as school_data_schools_inner USING (school_id) WHERE school_data_schools_outer.district_id = school_data_schools_inner.district_id and school_data_race_ethnicity_raw_outer.year = school_data_race_ethnicity_raw_inner.year ) * 100,2) FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_outer INNER JOIN school_data_schools as school_data_schools_outer USING (school_id) GROUP BY school_data_schools_outer.district_id, school_data_race_ethnicity_raw_outer.year, school_data_race_ethnicity_raw_outer.race;
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
| 1 | PRIMARY | school_data_race_ethnicity_raw_outer | ALL | school_id,school_id_2 | NULL | NULL | NULL | 84012 | Using temporary; Using filesort |
| 1 | PRIMARY | school_data_schools_outer | eq_ref | PRIMARY | PRIMARY | 257 | rocdocs_main_drupal_7.school_data_race_ethnicity_raw_outer.school_id | 1 | |
| 2 | DEPENDENT SUBQUERY | school_data_race_ethnicity_raw_inner | ref | school_id,year,school_id_2 | year | 4 | func | 8402 | |
| 2 | DEPENDENT SUBQUERY | school_data_schools_inner | eq_ref | PRIMARY | PRIMARY | 257 | rocdocs_main_drupal_7.school_data_race_ethnicity_raw_inner.school_id | 1 | Using where |
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
4 rows in set (0.00 sec)
mysql>
mysql> describe school_data_race_ethnicity_raw;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| school_id | varchar(255) | NO | MUL | NULL | |
| year | int(11) | NO | MUL | NULL | |
| race | varchar(255) | NO | | NULL | |
| count | int(11) | NO | | NULL | |
+-----------+--------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
mysql> describe school_data_schools;
+-------------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------------+------+-----+---------+-------+
| school_id | varchar(255) | NO | PRI | NULL | |
| grade_level | varchar(255) | NO | | NULL | |
| district_id | varchar(255) | NO | | NULL | |
| school_name | varchar(255) | NO | | NULL | |
| address | varchar(255) | NO | | NULL | |
| city | varchar(255) | NO | | NULL | |
| lat | decimal(20,10) | NO | | NULL | |
| lon | decimal(20,10) | NO | | NULL | |
+-------------+----------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
NOTE: I also have tried:
select sds.school_id,
detail.year,
detail.race,
ROUND((detail.count / summary.total) * 100 ,2) as percent
FROM school_data_race_ethnicity_raw as detail
inner join school_data_schools as sds USING (school_id)
inner join (
select sds2.district_id, year, sum(count) as total
from school_data_race_ethnicity_raw
inner join school_data_schools as sds2 USING (school_id)
group by sds2.district_id, year
) as summary on summary.district_id = sds.district_id
and summary.year = detail.year
This is slow beacuse:
You have no index in use on school_data_race_ethnicity_raw_outer, so it's scanning each of the ~84,000 rows
You are using a correlated subquery which means that your complex calculation has to be run once per row i.e. 84,000 times.
The best approach is not to use a correlated subquery, but if not, then to make it go fast, you need to use covering indexes so that the whole of that inner query (and the other parts via their own indexes) can be run lightning fast using just the index. For a great tutorial on the subject of indexes, check this out. It taught me a lot! Right now, your inner query just uses the year index on school_data_race_ethnicity_raw, so it has to look up the rest of the stuff it needs by reading 8000 rows for every one of the 84000 calculations. Indexes will make this far faster e.g. create a composite index on school_data_race_ethnicity_raw and you will find it helps:
CREATE index inner_composite ON school_data_race_ethnicity_raw (year, district_id, schoolid, count)
This will allow all the fields used in the WHERE to be gotten from the index, then the join field, then the field you want for the select. You should see it show up in the 'key' column of your explain result. Also, if you get it right, you'll see 'using index' in the right-most column, showing that no table access is happening, which is orders of magnitude faster.
You can experiment quick-and-dirty style by adding loads of indexes for the columns that the query mentions and see what gets picked up in the key column. If something appears, read your query to see what other columns from that table are in use, then add a new index with those columns added in too on the right hand side and see if that works better. Remember to delete the unused indexes once you find out what works.
MySQL doesn't allow you to directly index the SUM of a column, which would be the fastest way, so unless you want to move to another DB (good idea if you can), this will always be a little slow.
This should be all you need to aggregate your data to get a count of race by district, not sure why you are doing so much math in your original, as it is unnecessary to achieve your goal, and is forcing some crazy sub queries.
SELECT SUM(students.count) as studentCount, School.district_id, students.race
FROM school_data_schools schools,
school_data_race_ethnicity_raw students
WHERE shools.school_id = students.school_id
GROUP BY district_id, race
You probably also want an index on school_data_race_ethnicity_raw.school_id (alone, not as part of a multiple column key)
EDIT was not aware OP was looking for a percentage breakdown, and not just totals
SELECT ((studentCount / districtTotal) * 100) as percentage, district_id, race
FROM(
SELECT SUM(students.count) as studentCount, Schools.district_id, students.race,
(SELECT SUM(inStudents.count)
FROM school_data_schools inSchools,
school_data_race_ethnicity_raw inStudents
WHERE inSchools.school_id = inStudents.school_id
AND inSchools.district_ID = Schools.district_id
GROUP BY inSchools.district_id) as districtTotal
FROM school_data_schools schools,
school_data_race_ethnicity_raw students
WHERE schools.school_id = students.school_id
GROUP BY district_id, race
) table1
This will run pretty quick, still need to make sure there is an index on school_data_race_ethnicity_raw.school_id that is not part of a multiple column index. you can see it in action here, though my test case is rather small, it does seem to check out.
I've researched this, but I still cannot explain why:
SELECT cl.`cl_boolean`, l.`l_name`
FROM `card_legality` cl
INNER JOIN `legality` l ON l.`legality_id` = cl.`legality_id`
WHERE cl.`card_id` = 23155
Is significantly slower than:
SELECT cl.`cl_boolean`, l.`l_name`
FROM `card_legality` cl
LEFT JOIN `legality` l ON l.`legality_id` = cl.`legality_id`
WHERE cl.`card_id` = 23155
115ms Vs 478ms. They are both using InnoDB and there are relationships defined. The 'card_legality' contains approx 200k rows, while the 'legality' table contains 11 rows. Here is the structure for each:
CREATE TABLE `card_legality` (
`card_id` varchar(8) NOT NULL DEFAULT '',
`legality_id` int(3) NOT NULL,
`cl_boolean` tinyint(1) NOT NULL,
PRIMARY KEY (`card_id`,`legality_id`),
KEY `legality_id` (`legality_id`),
CONSTRAINT `card_legality_ibfk_2` FOREIGN KEY (`legality_id`) REFERENCES `legality` (`legality_id`),
CONSTRAINT `card_legality_ibfk_1` FOREIGN KEY (`card_id`) REFERENCES `card` (`card_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
And:
CREATE TABLE `legality` (
`legality_id` int(3) NOT NULL AUTO_INCREMENT,
`l_name` varchar(16) NOT NULL DEFAULT '',
PRIMARY KEY (`legality_id`)
) ENGINE=InnoDB AUTO_INCREMENT=12 DEFAULT CHARSET=latin1;
I could simply use LEFT-JOIN, but it doesn't seem quite right... any thoughts, please?
UPDATE:
As requested, I've included the results of explain for each. I had run it previously, but I dont pretend to have a thorough understanding of it..
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE cl ALL PRIMARY NULL NULL NULL 199747 Using where
1 SIMPLE l eq_ref PRIMARY PRIMARY 4 hexproof.co.uk.cl.legality_id 1
AND, inner join:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE l ALL PRIMARY NULL NULL NULL 11
1 SIMPLE cl ref PRIMARY,legality_id legality_id 4 hexproof.co.uk.l.legality_id 33799 Using where
It is because of the varchar on card_id. MySQL can't use the index on card_id as card_id as described here mysql type conversion. The important part is
For comparisons of a string column with a number, MySQL cannot use an
index on the column to look up the value quickly. If str_col is an
indexed string column, the index cannot be used when performing the
lookup in the following statement:
SELECT * FROM tbl_name WHERE str_col=1;
The reason for this is that there are many different strings that may
convert to the value 1, such as '1', ' 1', or '1a'.
If you change your queries to
SELECT cl.`cl_boolean`, l.`l_name`
FROM `card_legality` cl
INNER JOIN `legality` l ON l.`legality_id` = cl.`legality_id`
WHERE cl.`card_id` = '23155'
and
SELECT cl.`cl_boolean`, l.`l_name`
FROM `card_legality` cl
LEFT JOIN `legality` l ON l.`legality_id` = cl.`legality_id`
WHERE cl.`card_id` = '23155'
You should see a huge improvement in speed and also see a different EXPLAIN.
Here is a similar (but easier) test to show this:
> desc id_test;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| id | varchar(8) | NO | PRI | NULL | |
+-------+------------+------+-----+---------+-------+
1 row in set (0.17 sec)
> select * from id_test;
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+----+
9 rows in set (0.00 sec)
> explain select * from id_test where id = 1;
+----+-------------+---------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | id_test | index | PRIMARY | PRIMARY | 10 | NULL | 9 | Using where; Using index |
+----+-------------+---------+-------+---------------+---------+---------+------+------+--------------------------+
1 row in set (0.00 sec)
> explain select * from id_test where id = '1';
+----+-------------+---------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | SIMPLE | id_test | const | PRIMARY | PRIMARY | 10 | const | 1 | Using index |
+----+-------------+---------+-------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)
In the first case there is Using where; Using index and the second is Using index. Also ref is either NULL or CONST. Needless to say, the second one is better.
L2G has it pretty much summed up, although I suspect it could be because of the varchar type used for card_id.
I actually printed out this informative page for benchmarking / profiling quickies. Here is a quick poor-mans profiling technique:
Time a SQL on MySQL
Enable Profiling
mysql> SET PROFILING = 1
...
RUN your SQLs
...
mysql> SHOW PROFILES;
+----------+------------+-----------------------+
| Query_ID | Duration | Query |
+----------+------------+-----------------------+
| 1 | 0.00014600 | SELECT DATABASE() |
| 2 | 0.00024250 | select user from user |
+----------+------------+-----------------------+
mysql> SHOW PROFILE for QUERY 2;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000034 |
| checking query cache for query | 0.000033 |
| checking permissions | 0.000006 |
| Opening tables | 0.000011 |
| init | 0.000013 |
| optimizing | 0.000004 |
| executing | 0.000011 |
| end | 0.000004 |
| query end | 0.000002 |
| freeing items | 0.000026 |
| logging slow query | 0.000002 |
| cleaning up | 0.000003 |
+--------------------------------+----------+
Good-luck, oh and please post your findings!
I'd try EXPLAIN on both of those queries. Just prefix each SELECT with EXPLAIN and run them. It gives really useful info on how mySQL is optimizing and executing queries.
I'm pretty sure that MySql has better optimization for Left Joins - no evidence to back this up at the moment.
ETA : A quick scout round and I can't find anything concrete to uphold my view so.....