MySQL-select query is very slow when there are a few results - mysql

I have table1 with 1M rows in my db.
columns: {id, name, timestamp, tag, r, g, b}
indexes: {primary: id, index: timestamp, index: (tag,r,g,b)}
each rows has a tag (which is an integer) and a color, which is saved by its components (r,g,b) in seperate columns. my queries are supposed to be like:
SELECT * from table1 WHERE tag=... AND (r>... AND r<... AND g>... AND g<... AND b>... AND b<...) ORDER BY timestamp DESC LIMIT 24;
the problem is that when there are only a few records in the db for the selected filters (tag and color), the query is very slow (15 seconds). it is also notable that when I remove ORDER BY timestamp DESC from the query, it runs very fast, even if there are a few results. how to solve the issue and make the query fast?

I'm not sure what you mean by "few", but 15 seconds seems like a long time.
You want an index on this query, on (tag, r, g, b).
That said, this is not an optimal index; or more precisely, it is about as optimal as you can get in MySQL. The real type of index that you want is an RD-Tree, which is optimized for ranges on different dimensions. The primary use-case is GIS (geographic information systems).
However, I don't think that MySQL supports RD-Trees as a generic index type. Hopefully, tag is highly selective and the above index will work well.

INDEX(tag, timestamp)
may help somewhat.
The general problem is that the Optimizer sees two semi-useful indexes but with no inadequate clues as to which one to pick. And then it picks the less beneficial one.
Adding these may help when you have a relatively narrow choice for g or b:
INDEX(tag, g)
INDEX(tag, b)
Unfortunately you have 4 "ranges" in the the WHERE clause (timestamp, r, g, b) and the Optimizer can use only one. I stuck tag in front of each (including your extant (tag, r, g, b), which won't get beyond r).
= tests should go first; the index can end with one range; any subsequent range test (g,b, in your case) will be ignored in the index.

Related

MySQL - Poor performance in a select from a simple table

I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.
If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.
Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/
Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!

MySQL multiple index optimization

I have a question about optimizing sql queries with multiple index.
Imagine I have a table "TEST" with fields "A, B, C, D, E, F".
In my code (php), I use the following "WHERE" query :
Select (..) from TEST WHERE a = 'x' and B = 'y'
Select (..) from TEST WHERE a = 'x' and B = 'y' and F = 'z'
Select (..) from TEST WHERE a = 'x' and B = 'y' and (D = 'w' or F = 'z')
what is the best approach to get the best speed when running queries?
3 multiple Index like (A, B), (A, B, F) and (A, B, D, F)?
Or A single multiple index (A, B, D, F)?
I would tend to say that the 3 index would be best even if the space of index in the database will be larger.
In my problem, I search the best execution time not the space.
The database being of a reasonable size.
Multiple-column indexes:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
In other words, it is a waste of space an computing power to define an index that covers the same first N columns as another index and in the same order.
The best way to exam the index is to practice. Use "explain" in mysql, it will give you a query plan and tell you which index to use. In addition, it will give you an estimate time for your query to run. Here is an example
explain select * from TEST WHERE a = 'x' and B = 'y'
It is hard to give definitive answers without experiments.
BUT: ordinarily an index like (A,B,D) is considered to be superfluous if you have an index on (A,B,D,F). So, in my opinion you only need the one multicolumn index.
There is one other consideration. If your table has a lot of columns and a lot of rows and your SELECT list has a small subset of those columns, you might consider including those columns in your index. For example, if your query says SELECT D,F,G,H FROM ... you should try creating an index on
(A,B,D,F,G,H)
as it will allow the query to be satisfied from the index without having to refer back to the rows of the table. This can sometimes help performance a great deal.
It's hard to explain well, but generally you should use as few indexes as you can get away with, using as many columns of the common queries as you can, with the most commonly queried columns first.
In your example WHERE clauses, A and B are always included. These should thus be part of an index. If A is more commonly used in a search then list that first, if B is more commonly used then list that first. MySQL can partially use the index as long as each column (seen from the left) in the index is used in the WHERE clause. So if you have an index ( A, B, C ) then WHERE ( A = .. AND B = .. AND Z = .. ) can still use that index to narrow down the search. If you have a WHERE ( B = .. AND Z = .. ) clause then A isn't part of the search condition and it can't be used for that index.
You want the single multiple column index A, B, D, F OR A, B, F, D (only one of these at a time can be used), but which depends mostly on the number of times D or F are queried for, and the distribution of data. Say if most of the values in D are 0 but one in a hundred values are 1 then that column would have a poor key distribution and thus putting the index on that column wouldn't be all that useful.
The optimiser can use a composite index for where conditions that follow the order of the index with no gaps:
An index on (A,B,F) will cover the first two queries.
The last query is a bit trickier, because of the OR. I think only the A and B conditions will be covered by (A,B,F) but using a separate index (D) or index (F) may speed up the query depending on the cardinality of the rows.
I think an index on (A,B,D,F) can only be used for the A and B conditions on all three queries. Not the F condition on query two, because the D value in the index can be anything and not the D and F conditions because of the OR.
You may have to add hints to the query to get the optimiser to use the best index and you can see which indexes are being used by running an EXPLAIN ... on the query.
Also, adding indexes slows down DML statements and can cause locking issues, so it's best to avoid over-indexing where possible.

How can I avoid a full table scan on this mysql query?

explain
select
*
from
zipcode_distances z
inner join
venues v
on z.zipcode_to=v.zipcode
inner join
events e
on v.id=e.venue_id
where
z.zipcode_from='92108' and
z.distance <= 5
I'm trying to find all "events at venues within 5 miles of zipcode 92108", however, I am having a hard time optimizing this query.
Here is what the explain looks like:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, e, ALL, idx_venue_id, , , , 60024,
1, SIMPLE, v, eq_ref, PRIMARY,idx_zipcode, PRIMARY, 4, comedyworld.e.venue_id, 1,
1, SIMPLE, z, ref, idx_zip_from_distance,idx_zip_to_distance,idx_zip_from_to, idx_zip_from_to, 30, const,comedyworld.v.zipcode, 1, Using where; Using index
I'm getting a full table scan on the "e" table, and I can't figure out what index I need to create to get it to be fast.
Any advice would be appreciated
Thank you
Based on the EXPLAIN output in your question, you already have all the indexes the query should be using, namely:
CREATE INDEX idx_zip_from_distance
ON zipcode_distances (zipcode_from, distance, zipcode_to);
CREATE INDEX idx_zipcode ON venues (zipcode, id);
CREATE INDEX idx_venue_id ON events (venue_id);
(I'm not sure from your index names whether idx_zip_from_distance really includes the zipcode_to column. If not, you should add it to make it a covering index. Also, I've included the venues.id column in idx_zipcode for completeness, but, assuming it's the primary key for the table and that you're using InnoDB, it will be included automatically anyway.)
However, it looks like MySQL is choosing a different, and possibly suboptimal, query plan, where it scans through all events, finds their venues and zip codes, and only then filters the results on distance. This could be the optimal query plan, if the cardinality of the events table was low enough, but from the fact that you're asking this question I assume it's not.
One reason for the suboptimal query plan could be the fact that you have too many indexes which are confusing the planner. For instance, do you really need all three of those indexes on the zipcode table, given that the data it stores is presumably symmetric? Personally, I'd suggest only the index I described above, plus a unique index (which can also be the primary key, if you don't have an artificial one) on (zipcode_to, zipcode_from) (preferably in that order, so that any occasional queries on zipcode_to=? can make use of it).
However, based on some testing I did, I suspect the main issue why MySQL is choosing the wrong query plan comes simply down to the relative cardinalities of your tables. Presumably, your actual zipcode_distances table is huge, and MySQL isn't smart enough to realize quite how much the conditions in the WHERE clause really narrow it down.
If so, the best and simplest fix may be to simply force MySQL to use the indexes you want:
select
*
from
zipcode_distances z
FORCE INDEX (idx_zip_from_distance)
inner join
venues v
FORCE INDEX (idx_zipcode)
on z.zipcode_to=v.zipcode
inner join
events e
FORCE INDEX (idx_venue_id)
on v.id=e.venue_id
where
z.zipcode_from='92108' and
z.distance <= 5
With that query, you should indeed get the desired query plan. (You do need FORCE INDEX here, since with just USE INDEX the query planner could still decide to use a table scan instead of the suggested index, defeating the purpose. I had this happen when I first tested this.)
Ps. Here's a demo on SQLize, both with and without FORCE INDEX, demonstrating the issue.
Have indexed the columns in both tables?
e.id and v.venue_id
If you do not, creates indexes in both tables. If you already have, it could be that you have few records in one or more tables and analyzer detects that it is more efficient to perform a full scan rather than an indexed read.
You could use a subquery:
select * from zipcode_distances z, venues v, events e
where
z.id in (select id from zipcode z where z.zipcode_from='92108' and z.distance <= 5)
and z.zipcode_to=v.zipcode
and v.id=e.venue_id
You are selecting all columns from all tables (select *) so there is little point in the optimizer using an index when the query engine will then have to do a lookup from the index to the table on every single row.

(Why) Can't MySQL use index in such cases?

1 - PRIMARY used in a secondary index, e.g. secondary index on (PRIMARY,column1)
2 - I'm aware mysql cannot continue using the rest of an index as soon as one part was used for a range scan, however: IN (...,...,...) is not considered a range, is it? Yes, it is a range, but I've read on mysqlperformanceblog.com that IN behaves differently than BETWEEN according to the use of index.
Could anyone confirm those two points? Or tell me why this is not possible? Or how it could be possible?
UPDATE:
Links:
http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/
http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/comment-page-1/#comment-952521
UPDATE 2: example of nested SELECT:
SELECT * FROM user_d1 uo
WHERE EXISTS (
SELECT 1 FROM `user_d1` ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
AND ui.id=uo.id
)
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
So, the outer SELECT uses timestamp_lastonline for sorting, the inner either PK to connect with the outer or birthdate for filtering.
What other options rather than this query are there if MySQL cannot use index on a range scan and for sorting?
The column(s) of the primary key can certainly be used in a secondary index, but it's not often worthwhile. The primary key guarantees uniqueness, so any columns listed after it cannot be used for range lookups. The only time it will help is when a query can use the index alone
As for your nested select, the extra complication should not beat the simplest query:
SELECT * FROM user_d1 uo
WHERE uo.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
MySQL will choose between a birthdate index or a timestamp_lastonline index based on which it feels will have the best chance of scanning fewer rows. In either case, the column should be the first one in the index. The birthdate index will also carry a sorting penalty, but might be worthwhile if a large number of recent users will have birth dates outside of that range.
If you wish to control the order, or potentially improve performance, a (timestamp_lastonline, birthdate) or (birthdate, timestamp_lastonline) index might help. If it doesn't, and you really need to select based on the birthdate first, then you should select from the inner query instead of filtering on it:
SELECT * FROM (
SELECT * FROM user_d1 ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
) as uo
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
Even then, MySQL's optimizer might choose to rewrite your query if it finds a timestamp_lastonline index but no birthdate index.
And yes, IN (..., ..., ...) behaves differently than BETWEEN. Only the latter can effectively use a range scan over an index; the former would look up each item individually.
2.IN will obviously differ from BETWEEN. If you have an index on that column, BETWEEN will need to get the starting point and it's all done. If you have IN, it will look for a matching value in the index value by value thus it will look for the values as many times as there are values compared to BETWEEN's one time look.
yes #Andrius_Naruševičius is right the IN statement is merely shorthand for EQUALS OR EQUALS OR EQUALS has no inherent order whatsoever where as BETWEEN is a comparison operator with an implicit greater than or less than and therefore absolutely loves indexes
I honestly have no idea what you are talking about, but it does seem you are asking a good question I just have no notion what it is :-). Are you saying that a primary key cannot contain a second index? because it absolutely can. The primary key never needs to be indexed because it is ALWAYS indexed automatically, so if you are getting an error/warn (I assume you are?) about supplementary indices then it's not the second, third index causing it it's the PRIMARY KEY not needing it, and you mentioning that probably is the error. Having said that I have no idea what question you asked - it's my answer to my best guess as to your actual question.

Is there a better way to index multiple columns than creating an index for each permutation?

Suppose I have a database table with columns a, b, and c. I plan on doing queries on all three columns, but I'm not sure which columns in particular I'm querying. There's enough rows in the table that an index immensely speeds up the search, but it feels wrong to make all the permutations of possible indexes (like this):
a
b
c
a, b
a, c
b, c
a, b, c
Is there a better way to handle this problem? (It's very possible that I'll be just fine indexing a, b, c alone, since this will cut down on the number of rows quickly, but I'm wondering if there's a better way.)
If you need more concrete examples, in the real-life data, the columns are city, state, and zip code. Also, I'm using a MySQL database.
In MS SQL the index "a, b, c" will cover you for scenarios "a"; "a, b"; and "a, b, c". So you would only need the following indexes:
a, b, c
b, c
c
Not sure if MySQL works the same way, but I would assume so.
To use indexes for all possible equality conditions on N columns, you will need C([N/2], N) indexes, that is N! / ([N/2]! * (N - [N/2])!)
See this article in my blog for detailed explanations:
Creating indexes
You can also read the strict mathematical proof by Russian mathematician Egor Timoshenko (update: now in English).
One can, however, get decent performance with less indexes using the following techniques:
Index merging
If the columns col1, col2 and col3 are selective, then this query
SELECT *
FROM mytable
WHERE col1 = :value1
AND col2 = :value2
AND col3 = :value3
can use three separate indexes on col1, col2 and col3, select the ROWID's that match each condition separately and them find their intersection, like in:
SELECT *
FROM (
SELECT rowid
FROM mytable
WHERE col1 = :value1
INTERSECT
SELECT rowid
FROM mytable
WHERE col2 = :value2
INTERSECT
SELECT rowid
FROM mytable
WHERE col3 = :value3
) mo
JOIN mytable mi
ON mi.rowid = mo.rowid
Bitmap indexing
PostgreSQL can build temporary bitmap indexes in memory right during the query.
A bitmap index is quite a compact contiguous bit array.
Each bit set for the the array tells that the corresponging tid should be selected from the table.
Such an index can take but 128M of temporary storage for a table with 1G rows.
The following query:
SELECT *
FROM mytable
WHERE col1 = :value1
AND col2 = :value2
AND col3 = :value3
will first allocate a zero-filled bitmap large enough to cover all possible tid's in the table (that is large enough to take all tid's from (0, 0) to the last tid, not taking missing tid's into account).
Then it will seek the first index, setting the bits to 1 if they satisfy the first condition.
Then it will scan the second index, AND'ing the bits that satisfy the second condition with a 1. This will leave 1 only for those bits that satisfy both conditions.
Same for the third index.
Finally, it will just select rows with the tid's corresponding to the bits set.
The tid's will be fetched sequentially, so it's very efficient.
The more the indexes you create the more your performance will be hit during update and delete operations. Because the index itself might get updated.
Yes, you can use multiple-column indexes. Something like
CREATE TABLE temp (
id INT NOT NULL,
a INT NULL,
b INT NULL,
c INT NULL,
PRIMARY KEY (id),
INDEX ind1 (a,b,c),
INDEX ind2 (a,b)
);
This type of index i.e. ind1 will surely help you in queries like
SELECT * FROM temp WHERE a=2 AND b=3 AND c=4;
Similarly, ind2 will help you in queries like
SELECT * FROM temp WHERE a=2 AND b=3;
But these indexes won't be used if the query is some thing like
SELECT * FROM temp WHERE a=2 OR b=3 OR c=4;
Here you will need separate indexes on a, b, and c.
So instead of having so many indexes, I would agree with what John said i.e. have indexes on a,b,c and if you feel that your workload covers more multi-column queries then you can switch to multi-column indexes.
cheers
Given that your columns are actually City, State and Zip Code, I would suggest just the following indexes:
INDEX(ZipCode)
If I am correct, Zip Codes are not duplicated across the USA, so it's pointless adding City or State information to the index as well because they will be the same value for all Zip Codes. E.g., 90210 is always Los Angeles, CA.
INDEX(City(5)) or INDEX(City(5)), State)
This is just an index on the first five letters of the city name. In many cases, this will be specific enough that having the State indexed wouldn't provide any useful filtering. E.g., 'Los A' will almost certainly be records from Los Angeles, CA. Maybe there is another small town in the USA starting with 'Los A', but there will be so few records it's not worth cluttering the index with State data as well. On the other hand, some city names appear in many states (Springfield comes to mind), so in those cases it is better to have the State indexed as well. You will need to figure out for yourself which index is most suited to your set of data. If in doubt, I would go with the second index (City and State).
INDEX(State, sort_field)
State is a pretty broad index (quite possibly NY and CA alone will have 30% of the records). If you plan displaying this information to the user, say, 30 records at a time, then you would have a query ending in
... WHERE STATE = "NY"
ORDER BY <sort_field>
LIMIT <number>, 30
To make that query efficient, you need to include the sorting column in the State index. So if you're showing pages ordered by Last Name (presuming you have that column), then you would use INDEX(State, LastName(3)), otherwise MySQL has to sort all of the 'NY' records before it can give you the 30 you want.
It's depend on your sql-query.
index (a, b, c) is different to index(b, c, a) or index(a, c, b)