I have a SQL query which looks simple but runs very slow ~4s:
SELECT tblbooks.*
FROM tblbooks LEFT JOIN
tblauthorships ON tblbooks.book_id = tblauthorships.book_id
WHERE (tblbooks.added_by=3 OR tblauthorships.author_id=3)
GROUP BY tblbooks.book_id
ORDER BY tblbooks.book_id DESC
LIMIT 10
EXPLAIN result:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
| 1 | SIMPLE | tblbooks | index | fk_books__users_1 | PRIMARY | 62 | NULL | 10 | Using where |
| 1 | SIMPLE | tblauthorships | ref | book_id | book_id | 62 | tblbooks.book_id | 1 | Using where |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
2 rows in set (0.000 sec)
If I run the above query individually on each part of OR in WHERE statement, both queries return result in less than 0.01s.
Simplified schema:
tblbooks (~1 million rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------------------+----------------+
| id | int(10) unsigned | NO | MUL | NULL | auto_increment |
| book_id | varchar(20) | NO | PRI | NULL | |
| added_by | int(11) unsigned | NO | MUL | NULL | |
+---------------+-----------------------+------+-----+---------------------+----------------+
tblauthorships (< 100 rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------------------+----------------+
| authorship_id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| book_id | varchar(20) | NO | MUL | NULL | |
| author_id | int(11) unsigned | NO | MUL | NULL | |
+---------------+------------------+------+-----+---------------------+----------------+
Both book_id and author_id columns in tblauthorships have their index created.
Can anyone point me to the right direction?
Note: I'm aware of book_id varchar issue.
My usual analogy for indexing is a telephone book. It's sorted by last name then by first name. If you look up a person by last name, you can find them efficiently. If you look up a person by last name AND first name, it's also efficient. But if you look up a person by first name only, the sort order of the book doesn't help, and you have to search every page the hard way.
Now what happens if you need to search a telephone book for a person by last name OR first name?
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas' OR first_name = 'Thomas';
This is just as bad as searching only by first name. Since all entries matching the first name you searched should be included in the result, you have to find them all.
Conclusion: Using OR in an SQL search is hard to optimize, given that MySQL can use only one index per table in a given query.
Solution: Use two queries and UNION them:
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas'
UNION
SELECT * FROM TelephoneBook WHERE first_name = 'Thomas';
The two individual queries each use an index on the respective column, then the results of both queries are unified (by default UNION eliminates duplicates).
In your case you don't even need to do the join for one of the queries:
(SELECT b.*
FROM tblbooks AS b
WHERE b.added_by=3)
UNION
(SELECT b.*
FROM tblbooks AS b
INNER JOIN tblauthorships AS a USING (book_id)
WHERE a.author_id=3)
ORDER BY book_id DESC
LIMIT 10
The two answers so far are not very optimal. Since they have both UNION and LIMIT, let me further optimize their answers:
( SELECT ...
ORDER BY ...
LIMIT 10
) UNION DISTINCT
( SELECT ...
ORDER BY ...
LIMIT 10
)
ORDER BY ...
LIMIT 10
This gives each SELECT a chance to optimize the ORDER BY and LIMIT, making them faster. Then the UNION DISTINCT dedups. Finally, the first 10 are peeled off to make the resultset.
If there will be pagination via OFFSET, this optimization gets trickier. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or
Also... Your table needs two indexes:
INDEX(added_by)
INDEX(author_id)
(Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.)
Related
For a single language dictionary with about 10k words on it, where some words are repeated but with different meaning, would it be ok to use a single table design?
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| word | varchar(128) | NO | | NULL | |
| definition | varchar(500) | NO | | NULL | |
| example | text | NO | | NULL | |
| date | datetime | NO | | NULL | |
| votes | int(4) | NO | | 0 | |
| name | varchar(30) | NO | | NULL | |
+------------+--------------+------+-----+---------+----------------+
Example queries im using:
SELECT * FROM definitions WHERE word = ? ORDER BY votes DESC LIMIT 10
SELECT word, definition FROM definitions ORDER BY date DESC LIMIT 4
SELECT DISTINCT word FROM definitions WHERE word LIKE ? LIMIT 100
Also the votes row get updated everytime someone votes.
Would be better to have a one-to-many design instead? My main goal is performance.
your table looks like it would be stable and only searching will be performed on it.
the only column that will cause the table to perform insert or update operation may affect your performance. You should only get the votes to other table along with word id. whenever a vote is inserted , it will not perform insert operation on your main table. that will increase your table performance in longer terms.
Select data from both table using join.
For only 10K words (or did you mean rows), and those queries, performance will be 'good enough'. However, these are needed:
INDEX(date)
INDEX(word, votes)
Hint.. If new definitions will come in often, then ORDER BY votes DESC LIMIT 10 will tend to not show them (when there are more than 10). So, you should probably have some formula involving the date at which the definition was added and the number of votes. It might be something like votes / TIMESTAMPDIFF(DAY, date, NOW()) or to temper it: (votes + 1) / DATEDIFF(DAY, date, NOW() + INTERVAL 2 DAY). That would go in the ORDER BY.
I have this query which takes me more than 117 seconds on a mysql database.
select users.*, users_oauth.* FROM users LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id WHERE (
(MATCH (user_email) AGAINST ('sometext')) OR
(MATCH (user_firstname) AGAINST ('sometext')) OR
(MATCH (user_lastname) AGAINST ('sometext')) )
ORDER BY user_date_accountcreated DESC LIMIT 1400, 50
How can I use a subquery in order to optimize it ?
The 3 fields are fulltext :
ALTER TABLE `users` ADD FULLTEXT KEY `email_fulltext` (`user_email`);
ALTER TABLE `users` ADD FULLTEXT KEY `firstname_fulltext` (`user_firstname`);
ALTER TABLE `users` ADD FULLTEXT KEY `lastname_fulltext` (`user_lastname`);
There is only one search input in a website to search in different table users fields.
If the limit is for example LIMIT 0,50, the query will run in less than 3 seconds but when the LIMIT increase the query becomes very slow.
Thanks.
Use a single FULLTEXT index:
FULLTEXT(user_email, user_firstname, user_lastname)
And change the 3 matches to just one:
MATCH (user_email, user_firstname, user_lastname) AGAINST ('sometext')
Here's another issue: ORDER BY ... DESC LIMIT 1400, 50. Read about the evils of pagination via OFFSET . That has a workaround, but I doubt if it would apply to your statement.
Do you really have thousands of users matching the text? Does someone (other than a search engine robot) really page through 29 pages? Think about whether it makes sense to really have such a long-winded UI.
And a 3rd issue. Consider "lazy eval". That is, find the user ids first, then join back to users and users_oauth to get the rest of the columns. It would be a single SELECT with the MATCH in a derived table, then JOIN to the two tables. If the ORDER BY an LIMIT can be in the derived table, it could be a big win.
Please indicate which table each column belongs to -- my last paragraph is imprecise because of not knowing about the date column.
Update
In your second attempt, you added OR, which greatly slows things down. Let's turn that into a UNION to try to avoid the new slowdown. First let's debug the UNION:
( SELECT * -- no mention of oauth columns
FROM users -- No JOIN
WHERE users.user_id LIKE ...
ORDER BY user_id DESC
LIMIT 0, 50
)
UNION ALL
( SELECT * -- no mention of oauth columns
FROM users
WHERE MATCH ...
ORDER BY user_id DESC
LIMIT 0, 50
)
Test it by timing each SELECT separately. If one of the is still slow, then let's focus on it. Then test the UNION. (This is a case where using the mysql commandline tool may be more convenient than PHP.)
By splitting, each SELECT can use an optimal index. The UNION has some overhead, but possibly less than the inefficiency of OR.
Now let's fold in users_oauth.
First, you seem to be missing a very important INDEX(oauth_user_id). Add that!
Now let's put them together.
SELECT u.*
FROM ( .... the entire union query ... ) AS u
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
ORDER BY user_id DESC -- yes, repeat
LIMIT 0, 50 -- yes, repeat
Yes #Rick
I changed the index fulltext to:
ALTER TABLE `users`
ADD FULLTEXT KEY `fulltext_adminsearch` (`user_email`,`user_firstname`,`user_lastname`);
And now there is some php conditions, $_POST['search'] can be empty:
if(!isset($_POST['search'])) {
$searchId = '%' ;
} else {
$searchId = $_POST['search'] ;
}
$searchMatch = '+'.str_replace(' ', ' +', $_POST['search']);
$sqlSearch = $dataBase->prepare(
'SELECT users.*, users_oauth.*
FROM users
LEFT JOIN users_oauth ON users.user_id = users_oauth.oauth_user_id
WHERE ( users.user_id LIKE :id OR
(MATCH (user_email, user_firstname, user_lastname)
AGAINST (:match IN BOOLEAN MODE)) )
ORDER BY user_id DESC LIMIT 0,50') ;
$sqlSearch->execute(array('id' => $searchId,
'match' => $searchMatch )) ;
The users_oauth table has a column with user_id:
Table users:
+--------------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+-----------------+------+-----+---------+----------------+
| user_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| user_activation_key | varchar(40) | YES | | NULL | |
| user_email | varchar(40) | NO | UNI | | |
| user_login | varchar(30) | YES | | NULL | |
| user_password | varchar(40) | YES | | NULL | |
| user_firstname | varchar(30) | YES | | NULL | |
| user_lastname | varchar(50) | YES | | NULL | |
| user_lang | varchar(2) | NO | | en
+--------------------------+-----------------+------+-----+---------+----------------+
Table users_oauth:
+----------------------+-----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+-----------------+------+-----+---------+----------------+
| oauth_id | int(8) unsigned | NO | PRI | NULL | auto_increment |
| oauth_user_id | int(8) unsigned | NO | | NULL | |
| oauth_google_id | varchar(30) | YES | UNI | NULL | |
| oauth_facebook_id | varchar(30) | YES | UNI | NULL | |
| oauth_windowslive_id | varchar(30) | YES | UNI | NULL | |
+----------------------+-----------------+------+-----+---------+----------------+
The Left Join is long, the request takes 3 seconds with, 0,0158 seconds wihtout.
It would be more rapid to make a sql request for each 50 rows.
Would it be more rapid with a subquery ? How to make it with a subquery ?
Thanks
I have a huge table like
CREATE TABLE IF NOT EXISTS `object_search` (
`keyword` varchar(40) COLLATE latin1_german1_ci NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
with around 39 million rows (using over 1 GB space) containing the indexed data for 1 million records in the object table (where object_id points at).
Now searching through this with a query like
SELECT object_id, COUNT(object_id) AS hits
FROM object_search
WHERE keyword = 'woman' OR keyword = 'house'
GROUP BY object_id
HAVING hits = 2
is already significantly faster than doing a LIKE search on the composed keywords field in the object table but still takes up to 1 minute.
It's explain looks like:
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
| 1 | SIMPLE | search | ref | PRIMARY | PRIMARY | 42 | const | 345180 | 100.00 | Using where; Using index |
+----+-------------+--------+------+---------------+---------+---------+-------+--------+----------+--------------------------+
The full explain with joined object and object_color and object_locale table, while the above query is run in a subquery to avoid overhead, looks like:
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 182544 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | search.object_id | 1 | 100.00 | |
| 2 | DERIVED | search | ref | PRIMARY | PRIMARY | 42 | | 345180 | 100.00 | Using where; Using index |
+----+-------------+-------------------+--------+---------------+-----------+---------+------------------+--------+----------+---------------------------------+
My top goal would be to be able to scan through this within 1 or 2 seconds.
So, are there further techniques to improve search speed for keywords?
Update 2013-08-06:
Applying most of Neville K's suggestion I now have the following setup:
CREATE TABLE `object_search_keyword` (
`keyword_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`keyword` varchar(64) COLLATE latin1_german1_ci NOT NULL,
PRIMARY KEY (`keyword_id`),
FULLTEXT KEY `keyword_ft` (`keyword`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_german1_ci;
CREATE TABLE `object_search` (
`keyword_id` int(10) unsigned NOT NULL,
`object_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`keyword_id`,`media_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The new query's explain looks like this:
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | Using temporary; Using filesort |
| 1 | PRIMARY | object_color | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | object | eq_ref | PRIMARY | PRIMARY | 4 | object_search.object_id | 1 | 100.00 | |
| 1 | PRIMARY | locale | eq_ref | object_id | object_id | 4 | object_search.object_id | 1 | 100.00 | |
| 2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | 100.00 | |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 24381 | 100.00 | |
| 4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DERIVED | object_keyword | fulltext | PRIMARY,keyword_ft | keyword_ft | 0 | | 1 | 100.00 | Using where; Using temporary; Using filesort |
| 3 | DERIVED | object_search | ref | PRIMARY | PRIMARY | 4 | object_keyword.keyword_id | 2190225 | 100.00 | Using index |
+----+-------------+----------------+----------+--------------------+------------+---------+---------------------------+---------+----------+----------------------------------------------+
The many derives are coming from the keyword comparing subquery being nested into another subquery which does nothing but count the amount of rows returned:
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.object_id, COUNT(turbo.object_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(woman) +(house)')
GROUP BY search.object_id HAVING hits = 2
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS turbo
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS object_color ON (search.object_id = object_color.object_id)
LEFT JOIN object_locale AS locale ON (search.object_id = locale.object_id)
ORDER BY timestamp_upload DESC
The above query will actually run within ~6 seconds, since it searches for two keywords. The more keywords I search for, the faster the search goes down.
Any way to further optimize this?
Update 2013-08-07
The blocking thing seems almost certainly to be the appended ORDER BY statement. Without it, the query executes in less than a second.
So, is there any way to sort the result faster? Any suggestions welcome, even hackish ones that would require post processing somewhere else.
Update 2013-08-07 later that day
Alright ladies and gentlemen, nesting the WHERE and ORDER BY statements in another layer of subquery to not let it bother with tables it doesn't need roughly doubled it's performance again:
SELECT wowrapper.*, locale.title
FROM (
SELECT SQL_NO_CACHE object.object_id, ..., #rn AS numrows
FROM (
SELECT *, #rn := #rn + 1
FROM (
SELECT SQL_NO_CACHE search.media_id, COUNT(search.media_id) AS hits
FROM object_keyword AS kwd
INNER JOIN object_search AS search ON (kwd.keyword_id = search.keyword_id)
WHERE MATCH (kwd.keyword) AGAINST ('+(frau)')
GROUP BY search.media_id HAVING hits = 1
) AS numrowswrapper
CROSS JOIN (SELECT #rn := 0) CONST
) AS search
INNER JOIN object AS object ON (search.object_id = object.object_id)
LEFT JOIN object_color AS color ON (search.object_id = color.object_id)
WHERE 1
ORDER BY object.object_id DESC
) AS wowrapper
LEFT JOIN object_locale AS locale ON (jfwrapper.object_id = locale.object_id)
LIMIT 0,48
Searches that took 12 seconds (single keyword, ~200K results) now take 6, and a search for two keywords that took 6 seconds (60K results) now takes around 3.5 secs.
Now this is already a massive improvement, but is there any chance to push this further?
Update 2013-08-08 early that day
Undid that last nested variation of the query, since it actually slowed down other variations of it...
I'm now trying some other things with different table layouts and FULLTEXT indexes using MyISAM for a dedicated search table with a combined keyword field (comma separated in a TEXT field).
Update 2013-08-08
Alright, a plain fulltext index doesnt really help.
Back to the previous setup, the only thing blocking is the ORDER BY (which resorts to using a temporary table and filesort). Without it a search is complete within less than a second!
So basically what's left of all this is:
How do I optimize the ORDER BY statement to run faster, likely by eliminating the use of the temporary table?
Full text search will be much faster than using the standard SQL string comparison features.
Secondly, if you have a high degree of redundancy in the keywords, you could consider a "many to many" implementation:
Keywords
--------
keyword_id
keyword
keyword_object
-------------
keyword_id
object_id
objects
-------
object_id
......
If this reduces the string comparison from 39 million rows to 100K rows (roughly the size of the English dictionary), you may also see a distinct improvement, as the query would only have to perform 100K string comparisons, and joining on an integer keyword_id and object_id field should be much, much faster than doing 39M string comparisons.
The best solution for this will be a FULLTEXT search, but you will probably need a MyISAM table for that. You can setup a mirror table and update it with some events and triggers or if you have a slave replicating from your server you can change its table to MyISAM and use it for searches.
For this query the only thing I can come up with is to rewrite it as:
SELECT s1.object_id
FROM object_search s1
JOIN object_search s2 ON s2.object_id = s1.object_id AND s2.key_word = 'word2'
JOIN object_search s3 ON s3.object_id = s1.object_id AND s3.key_word = 'word3'
....
WHERE s1.key_word = 'word1'
and I'm not sure it will be faster this way.
Also you will need to have an index on object_id (assuming your PK is (key_word, object_id)).
If you have seldom INSERTs and often SELECTs you could optimize your data for the reads i.e. recalculate the number of object_ids per keyword and directly store it in the database. The SELECTs would then be very fast, the INSERTs would take some seconds though,.
Goal of query:
Display race by district.
Query:
SELECT school_data_schools_outer.district_id,
school_data_race_ethnicity_raw_outer.year,
school_data_race_ethnicity_raw_outer.race,
ROUND(
SUM( school_data_race_ethnicity_raw_outer.count) /
(SELECT SUM(count)
FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_inner
INNER JOIN school_data_schools as school_data_schools_inner
USING (school_id)
WHERE school_data_schools_outer.district_id = school_data_schools_inner.district_id
AND school_data_race_ethnicity_raw_outer.year = school_data_race_ethnicity_raw_inner.year) * 100, 2)
FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_outer
INNER JOIN school_data_schools as school_data_schools_outer USING (school_id)
GROUP BY school_data_schools_outer.district_id,
school_data_race_ethnicity_raw_outer.year,
school_data_race_ethnicity_raw_outer.race
mysql> explain SELECT school_data_schools_outer.district_id, school_data_race_ethnicity_raw_outer.year, school_data_race_ethnicity_raw_outer.race,ROUND(SUM(school_data_race_ethnicity_raw_outer.count)/( SELECT SUM(count) FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_inner INNER JOIN school_data_schools as school_data_schools_inner USING (school_id) WHERE school_data_schools_outer.district_id = school_data_schools_inner.district_id and school_data_race_ethnicity_raw_outer.year = school_data_race_ethnicity_raw_inner.year ) * 100,2) FROM school_data_race_ethnicity_raw as school_data_race_ethnicity_raw_outer INNER JOIN school_data_schools as school_data_schools_outer USING (school_id) GROUP BY school_data_schools_outer.district_id, school_data_race_ethnicity_raw_outer.year, school_data_race_ethnicity_raw_outer.race;
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
| 1 | PRIMARY | school_data_race_ethnicity_raw_outer | ALL | school_id,school_id_2 | NULL | NULL | NULL | 84012 | Using temporary; Using filesort |
| 1 | PRIMARY | school_data_schools_outer | eq_ref | PRIMARY | PRIMARY | 257 | rocdocs_main_drupal_7.school_data_race_ethnicity_raw_outer.school_id | 1 | |
| 2 | DEPENDENT SUBQUERY | school_data_race_ethnicity_raw_inner | ref | school_id,year,school_id_2 | year | 4 | func | 8402 | |
| 2 | DEPENDENT SUBQUERY | school_data_schools_inner | eq_ref | PRIMARY | PRIMARY | 257 | rocdocs_main_drupal_7.school_data_race_ethnicity_raw_inner.school_id | 1 | Using where |
+----+--------------------+--------------------------------------+--------+----------------------------+---------+---------+----------------------------------------------------------------------+-------+---------------------------------+
4 rows in set (0.00 sec)
mysql>
mysql> describe school_data_race_ethnicity_raw;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| school_id | varchar(255) | NO | MUL | NULL | |
| year | int(11) | NO | MUL | NULL | |
| race | varchar(255) | NO | | NULL | |
| count | int(11) | NO | | NULL | |
+-----------+--------------+------+-----+---------+----------------+
5 rows in set (0.00 sec)
mysql> describe school_data_schools;
+-------------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------------+------+-----+---------+-------+
| school_id | varchar(255) | NO | PRI | NULL | |
| grade_level | varchar(255) | NO | | NULL | |
| district_id | varchar(255) | NO | | NULL | |
| school_name | varchar(255) | NO | | NULL | |
| address | varchar(255) | NO | | NULL | |
| city | varchar(255) | NO | | NULL | |
| lat | decimal(20,10) | NO | | NULL | |
| lon | decimal(20,10) | NO | | NULL | |
+-------------+----------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
NOTE: I also have tried:
select sds.school_id,
detail.year,
detail.race,
ROUND((detail.count / summary.total) * 100 ,2) as percent
FROM school_data_race_ethnicity_raw as detail
inner join school_data_schools as sds USING (school_id)
inner join (
select sds2.district_id, year, sum(count) as total
from school_data_race_ethnicity_raw
inner join school_data_schools as sds2 USING (school_id)
group by sds2.district_id, year
) as summary on summary.district_id = sds.district_id
and summary.year = detail.year
This is slow beacuse:
You have no index in use on school_data_race_ethnicity_raw_outer, so it's scanning each of the ~84,000 rows
You are using a correlated subquery which means that your complex calculation has to be run once per row i.e. 84,000 times.
The best approach is not to use a correlated subquery, but if not, then to make it go fast, you need to use covering indexes so that the whole of that inner query (and the other parts via their own indexes) can be run lightning fast using just the index. For a great tutorial on the subject of indexes, check this out. It taught me a lot! Right now, your inner query just uses the year index on school_data_race_ethnicity_raw, so it has to look up the rest of the stuff it needs by reading 8000 rows for every one of the 84000 calculations. Indexes will make this far faster e.g. create a composite index on school_data_race_ethnicity_raw and you will find it helps:
CREATE index inner_composite ON school_data_race_ethnicity_raw (year, district_id, schoolid, count)
This will allow all the fields used in the WHERE to be gotten from the index, then the join field, then the field you want for the select. You should see it show up in the 'key' column of your explain result. Also, if you get it right, you'll see 'using index' in the right-most column, showing that no table access is happening, which is orders of magnitude faster.
You can experiment quick-and-dirty style by adding loads of indexes for the columns that the query mentions and see what gets picked up in the key column. If something appears, read your query to see what other columns from that table are in use, then add a new index with those columns added in too on the right hand side and see if that works better. Remember to delete the unused indexes once you find out what works.
MySQL doesn't allow you to directly index the SUM of a column, which would be the fastest way, so unless you want to move to another DB (good idea if you can), this will always be a little slow.
This should be all you need to aggregate your data to get a count of race by district, not sure why you are doing so much math in your original, as it is unnecessary to achieve your goal, and is forcing some crazy sub queries.
SELECT SUM(students.count) as studentCount, School.district_id, students.race
FROM school_data_schools schools,
school_data_race_ethnicity_raw students
WHERE shools.school_id = students.school_id
GROUP BY district_id, race
You probably also want an index on school_data_race_ethnicity_raw.school_id (alone, not as part of a multiple column key)
EDIT was not aware OP was looking for a percentage breakdown, and not just totals
SELECT ((studentCount / districtTotal) * 100) as percentage, district_id, race
FROM(
SELECT SUM(students.count) as studentCount, Schools.district_id, students.race,
(SELECT SUM(inStudents.count)
FROM school_data_schools inSchools,
school_data_race_ethnicity_raw inStudents
WHERE inSchools.school_id = inStudents.school_id
AND inSchools.district_ID = Schools.district_id
GROUP BY inSchools.district_id) as districtTotal
FROM school_data_schools schools,
school_data_race_ethnicity_raw students
WHERE schools.school_id = students.school_id
GROUP BY district_id, race
) table1
This will run pretty quick, still need to make sure there is an index on school_data_race_ethnicity_raw.school_id that is not part of a multiple column index. you can see it in action here, though my test case is rather small, it does seem to check out.
I am trying to avoid filesort but not getting luck in removing it from inner query. If I move condition to outer query then it shows me nothing.
Create table articles (
article_id Int UNSIGNED NOT NULL AUTO_INCREMENT,
editor_id Int UNSIGNED NOT NULL,
published_date Datetime,
Primary Key (book_id)) ENGINE = InnoDB;
Create Index published_date_INX ON articles (published_date);
Create Index editor_id_INX ON articles (editor_id);
EXPLAIN SELECT article_id, article_date FROM articles AA INNER JOIN
(
SELECT article_id
FROM articles
WHERE editor_id=1
ORDER BY published_date DESC
LIMIT 100, 5
) ART USING (article_id);
+----+-------------+------------+--------+---------------+-----------+---------+----------------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+-----------+---------+----------------+--------+----------------+
| 1 | | PRIMARY | <derived2> | ALL | NULL | NULL | NULL NULL | 5 | |
| 1 | | PRIMARY | AA | eq_ref | PRIMARY | PRIMARY | 4 ART.article_id | 1 | |
| 2 | | DERIVED | articles | ALL | editor_id | editor_id | 5 | 114311 | Using filesort |
+----+-------------+------------+--------+---------------+-----------+---------+----------------+--------+----------------+
3 rows in set (30.31 sec)
Any suggestion how to remove filesort from this query?
Maybe you can try adding an index on editor_id, published_date.
create index edpub_INX on articles (editor_id, published_date);
On your inner query:
SELECT article_id
FROM articles
WHERE editor_id=1
ORDER BY published_date DESC
LIMIT 100, 5
MySQL's query planner is probably thinking that filtering by editor_id (using the index) and then ordering by published_date is better than using published_date_INX and then filtering by editor_id. And query planner is probably right.
So, if you want to "help" on that specific query, create an index on editor_id, published_date and see if it helps your query run faster.