How can I avoid a full table scan on this mysql query? - mysql

explain
select
*
from
zipcode_distances z
inner join
venues v
on z.zipcode_to=v.zipcode
inner join
events e
on v.id=e.venue_id
where
z.zipcode_from='92108' and
z.distance <= 5
I'm trying to find all "events at venues within 5 miles of zipcode 92108", however, I am having a hard time optimizing this query.
Here is what the explain looks like:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, e, ALL, idx_venue_id, , , , 60024,
1, SIMPLE, v, eq_ref, PRIMARY,idx_zipcode, PRIMARY, 4, comedyworld.e.venue_id, 1,
1, SIMPLE, z, ref, idx_zip_from_distance,idx_zip_to_distance,idx_zip_from_to, idx_zip_from_to, 30, const,comedyworld.v.zipcode, 1, Using where; Using index
I'm getting a full table scan on the "e" table, and I can't figure out what index I need to create to get it to be fast.
Any advice would be appreciated
Thank you

Based on the EXPLAIN output in your question, you already have all the indexes the query should be using, namely:
CREATE INDEX idx_zip_from_distance
ON zipcode_distances (zipcode_from, distance, zipcode_to);
CREATE INDEX idx_zipcode ON venues (zipcode, id);
CREATE INDEX idx_venue_id ON events (venue_id);
(I'm not sure from your index names whether idx_zip_from_distance really includes the zipcode_to column. If not, you should add it to make it a covering index. Also, I've included the venues.id column in idx_zipcode for completeness, but, assuming it's the primary key for the table and that you're using InnoDB, it will be included automatically anyway.)
However, it looks like MySQL is choosing a different, and possibly suboptimal, query plan, where it scans through all events, finds their venues and zip codes, and only then filters the results on distance. This could be the optimal query plan, if the cardinality of the events table was low enough, but from the fact that you're asking this question I assume it's not.
One reason for the suboptimal query plan could be the fact that you have too many indexes which are confusing the planner. For instance, do you really need all three of those indexes on the zipcode table, given that the data it stores is presumably symmetric? Personally, I'd suggest only the index I described above, plus a unique index (which can also be the primary key, if you don't have an artificial one) on (zipcode_to, zipcode_from) (preferably in that order, so that any occasional queries on zipcode_to=? can make use of it).
However, based on some testing I did, I suspect the main issue why MySQL is choosing the wrong query plan comes simply down to the relative cardinalities of your tables. Presumably, your actual zipcode_distances table is huge, and MySQL isn't smart enough to realize quite how much the conditions in the WHERE clause really narrow it down.
If so, the best and simplest fix may be to simply force MySQL to use the indexes you want:
select
*
from
zipcode_distances z
FORCE INDEX (idx_zip_from_distance)
inner join
venues v
FORCE INDEX (idx_zipcode)
on z.zipcode_to=v.zipcode
inner join
events e
FORCE INDEX (idx_venue_id)
on v.id=e.venue_id
where
z.zipcode_from='92108' and
z.distance <= 5
With that query, you should indeed get the desired query plan. (You do need FORCE INDEX here, since with just USE INDEX the query planner could still decide to use a table scan instead of the suggested index, defeating the purpose. I had this happen when I first tested this.)
Ps. Here's a demo on SQLize, both with and without FORCE INDEX, demonstrating the issue.

Have indexed the columns in both tables?
e.id and v.venue_id
If you do not, creates indexes in both tables. If you already have, it could be that you have few records in one or more tables and analyzer detects that it is more efficient to perform a full scan rather than an indexed read.

You could use a subquery:
select * from zipcode_distances z, venues v, events e
where
z.id in (select id from zipcode z where z.zipcode_from='92108' and z.distance <= 5)
and z.zipcode_to=v.zipcode
and v.id=e.venue_id

You are selecting all columns from all tables (select *) so there is little point in the optimizer using an index when the query engine will then have to do a lookup from the index to the table on every single row.

Related

How can i speed up the left join in my query using indexes?

I am new to SQL. At the moment I am experiencing some slower MySQL queries. I think I need to improve my indexes but not sure how.
drop temporary table if exists temp ;
CREATE TEMPORARY TABLE temp
(index idx_a (EXTRACT_DATE, project_id, SERVICE_NAME) )
select distinct DATE(c.EXTRACT_DATETIME) as EXTRACT_DATE,p.project_id, p.project_name, c.CLUSTER_NAME, c.SERVICE_NAME,
UPPER(CONCAT(SUBSTRING_INDEX(c.ENV_NAME, '-', 1),'-',c.CLUSTER_NAME)) as CLUSTER_ID
from p
left join c
on p.project_id = c.project_id ;
The short answer is that you need indexes at least to optimize the lookups done by the JOIN. The explain shows that both tables you are joining are doing a full table scan, then joining them the hard was, using "block nested loop" which indicates it is not using an index.
It would help to at least create an index on c.project_id.
ALTER TABLE c ADD INDEX (project_id);
This would mean there is still a table-scan to read the p table (estimated 5720 rows), but at least when it needs to find the related rows in c, it only reads the rows it needs, without doing a table-scan of 287K rows for each row of p.
The query you posted in an earlier question had another condition:
where DAYNAME(c.EXTRACT_DATETIME) = 'Friday' ;
I don't know why you haven't included this condition in the new question you posted.
If this is still a condition you need to handle, this could help optimize the query further. MySQL 5.7 (which you said in the other question you are using) supports virtual columns, defined for an expression, and you can index virtual columns.
ALTER TABLE c
ADD COLUMN isFriday AS (DAYNAME(EXTRACT_DATETIME) = 'Friday'),
ADD INDEX (isFriday);
Then if you search on the new isFriday column, or even if you search on the same expression used for the virtual column definition, it will use the index.
So what you really need is an index on c that uses both columns, one for the join, and then for the additional condition.
ALTER TABLE c
ADD COLUMN isFriday AS (DAYNAME(EXTRACT_DATETIME) = 'Friday'),
ADD INDEX (project_id, isFriday);
You aren’t filtering on anything other than the outer join column. This leads me to expect that most of the rows in both tables are going to need reading. In order to do this only once, you may be best off using a hash join rather than a nested loop and index. A hash join will allow both tables to be read completely once rather than the back and forth approach of a nested loop which will likely mean the same pages read each time a row is looked up.
In order to use hash joins, you need to be running and a version of MySQL at least above version 8. It would be recommended to use the latest available stable release.

Need some clarification on indexes (WHERE, JOIN)

We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time
You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.

Indexes in MySQL

I've only started using INDEXes in my MySQL database and I'm a little unsure if what I have in mind will work. I have a TEXT field that can store a large body of text and will need to be searched, along with another id INT field. If I have an INDEX on say my id_column field and a FULLTEXT index on my text_column, will MySQL use both in a query such as
SELECT * FROM notes WHERE id_column='123' AND MATCH(text_column) AGAINST(search_text)
??
Secondly, I have a group of columns that can be used frequently for searching in combination together. If I create a multi-column INDEX in these columns, the index wills till work if the columns used are together left-to-right in the index. But what happens if the user leaves out a particular column, say B, and searches using A, B, D in an index like (A, B, C, D) ???
For question 1:
Yes, the query will use both indices. FULLTEXT indices can be kind of tricky, however, so it's a good idea to read the MySQL documentation thoroughly on them and use EXPLAIN on your queries to make sure they are properly utilizing indices.
For question 2:
If you have a multiple column index, the index has to have the same columns in the same order as the query to be used. So in your example, the index wouldn't be utilized.
EXPLAIN is a very powerful tool for understanding how queries use indices, and it's a good idea to use it frequently (especially on queries which are programatically generated). http://dev.mysql.com/doc/refman/5.0/en/explain.html
There is no guarantee that MySQL will use both two indexes for the same table in one query. In general, no. But sometimes it activates an "index merge," searching both indexes and combining the results.
Not all queries can do this, however. You should read about this feature here: http://dev.mysql.com/doc/refman/5.6/en/index-merge-optimization.html
Regarding multi-column indexes, if you have an index on columns A, B, C, D, and you do a search on columns A, B, D, then the index may be used, but only so far as it narrows down the search based on your conditions for columns A and B.
You can see evidence of this if you use EXPLAIN and look at the "ken_len" field. The key_len will be the total number of bytes in the columns that are used in that multi-column index. Fo example, if A, B, C, D are four 4-byte integers, the key_len could be as much as 16. But if only A and B are used, the key_len will be 8.
Given this query:
SELECT * FROM notes
WHERE id_column='123'
AND MATCH(text_column) AGAINST(search_text)
the only way the optimizer will perform it (to my knowledge) is to
Use FULLTEXT(text_column) to do the second part of the search, then
Filter out those without id_column='123'; no index will be used for this step.
That's the general rule when mixing FULLTEXT and non-fulltext indexes -- FULLTEXT first; no other indexes used.
However... Here is a trick that sometimes speeds up complex queries:
SELECT b.*
FROM (
SELECT id -- assuming this is the PRIMARY KEY
FROM notes
WHERE MATCH(text_column) AGAINST(search_text)
) AS a
JOIN notes AS b -- "self join"
ON b.id = a.id -- just the PK
JOIN ((other tables)) ON ...
WHERE ((other messy or bulky stuff)) ...
The idea is to use the subquery to condense down to a short list of small values (the ids), then reach back in (or futher JOIN) to get the bulky stuff.
For building optimal composite indexes for some simple queries, see my index cookbook.

(Why) Can't MySQL use index in such cases?

1 - PRIMARY used in a secondary index, e.g. secondary index on (PRIMARY,column1)
2 - I'm aware mysql cannot continue using the rest of an index as soon as one part was used for a range scan, however: IN (...,...,...) is not considered a range, is it? Yes, it is a range, but I've read on mysqlperformanceblog.com that IN behaves differently than BETWEEN according to the use of index.
Could anyone confirm those two points? Or tell me why this is not possible? Or how it could be possible?
UPDATE:
Links:
http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/
http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/comment-page-1/#comment-952521
UPDATE 2: example of nested SELECT:
SELECT * FROM user_d1 uo
WHERE EXISTS (
SELECT 1 FROM `user_d1` ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
AND ui.id=uo.id
)
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
So, the outer SELECT uses timestamp_lastonline for sorting, the inner either PK to connect with the outer or birthdate for filtering.
What other options rather than this query are there if MySQL cannot use index on a range scan and for sorting?
The column(s) of the primary key can certainly be used in a secondary index, but it's not often worthwhile. The primary key guarantees uniqueness, so any columns listed after it cannot be used for range lookups. The only time it will help is when a query can use the index alone
As for your nested select, the extra complication should not beat the simplest query:
SELECT * FROM user_d1 uo
WHERE uo.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
MySQL will choose between a birthdate index or a timestamp_lastonline index based on which it feels will have the best chance of scanning fewer rows. In either case, the column should be the first one in the index. The birthdate index will also carry a sorting penalty, but might be worthwhile if a large number of recent users will have birth dates outside of that range.
If you wish to control the order, or potentially improve performance, a (timestamp_lastonline, birthdate) or (birthdate, timestamp_lastonline) index might help. If it doesn't, and you really need to select based on the birthdate first, then you should select from the inner query instead of filtering on it:
SELECT * FROM (
SELECT * FROM user_d1 ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
) as uo
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
Even then, MySQL's optimizer might choose to rewrite your query if it finds a timestamp_lastonline index but no birthdate index.
And yes, IN (..., ..., ...) behaves differently than BETWEEN. Only the latter can effectively use a range scan over an index; the former would look up each item individually.
2.IN will obviously differ from BETWEEN. If you have an index on that column, BETWEEN will need to get the starting point and it's all done. If you have IN, it will look for a matching value in the index value by value thus it will look for the values as many times as there are values compared to BETWEEN's one time look.
yes #Andrius_Naruševičius is right the IN statement is merely shorthand for EQUALS OR EQUALS OR EQUALS has no inherent order whatsoever where as BETWEEN is a comparison operator with an implicit greater than or less than and therefore absolutely loves indexes
I honestly have no idea what you are talking about, but it does seem you are asking a good question I just have no notion what it is :-). Are you saying that a primary key cannot contain a second index? because it absolutely can. The primary key never needs to be indexed because it is ALWAYS indexed automatically, so if you are getting an error/warn (I assume you are?) about supplementary indices then it's not the second, third index causing it it's the PRIMARY KEY not needing it, and you mentioning that probably is the error. Having said that I have no idea what question you asked - it's my answer to my best guess as to your actual question.

How to make my MySQL SUM() query more faster

I have about 1 million rows on users table and have columns A AA B BB C CC D DD E EE F FF by example to count int values 0 & 1
SELECT
CityCode,SUM(A),SUM(B),SUM(C),SUM(D),SUM(E),SUM(F),SUM(AA),SUM(BB),SUM(CC),SUM(DD),SUM(EE),SUM(FF)
FROM users
GROUP BY CityCode
Result 8 rows in set (24.49 sec).
How to make my statement more faster?
Use explain to to know the excution plan of your query.
Create atleast one or more Index. If possible make CityCode primary key.
Try this one
SELECT CityCode,SUM(A),SUM(B),SUM(C),SUM(D), SUM(E),SUM(F),SUM(AA),SUM(BB),SUM(CC),SUM(DD),SUM(EE),SUM(FF)
FROM users
GROUP BY CityCode,A,B,C,D,E,F,AA,BB,CC,DD,EE,FF
Create an index on the CityCode column.
I believe it is not because of SUM(), try to say select CityCode from users group by CityCode; it should take neary the same time...
Use better hardware
increase caching size - if you use InnoDB engine, then increase the innodb_buffer_pool_size value
refactor your query to limit the number of users (if business logic permits that, of course)
You have no WHERE clause, which means the query has to scan the whole table. This will make it slow on a large table.
You should consider how often you need to do this and what the impact of it being slow is. Some suggestions are:
Don't change anything - if it doesn't really matter
Have a table which contains the same data as "users", but without any other columns that you aren't interested in querying. It will still be slow, but not as slow, especially if there are bigger ones
(InnoDB) use CityCode as the first part of the primary key for table "users", that way it can do a PK scan and avoid any sorting (may still be too slow)
Create and maintain some kind of summary table, but you'll need to update it each time a user changes (or tolerate stale data)
But be sure that this optimisation is absolutely necessary.