The query is
SELECT row
FROM `table`
USE INDEX(`indexName`)
WHERE row1 = '0'
AND row2 = '0'
AND row3 >= row4
AND (row5 = '0' OR row5 LIKE 'value')
I have the following MySQL Query which I've created a index for using;
CREATE INDEX indexName ON `table` (row1, row2, row3, row5);
However, the performance is not really good. It's extracting about 17,000+ rows out of a 5.9+ million row table in anywhere from 6-12 seconds.
It seems like the bottleneck is the row3 >= row4 - because without that part in the code it runs in 0.6-0.7 seconds.
(from Comment)
The row (placeholder column name) is actually the id (primary key, index) column in the table, which is the result set I'm outputting later on. I'm outputting an array of IDs that are matching the parameters in my query, and then selecting a random ID from that array to gather data through the final query on a specific row. This was done as a workaround for rand(). Any adjustments needed based on that knowledge?
17K rows is not a tiny result set. Large result sets often take time just because of the overhead of delivering the data from the MySQL server to the program requesting them.
The contents of the 'value' you use in row5 LIKE 'value' matter a great deal to query performance. If 'value' starts with a wildcard character like % your query will be slow.
That being said, you need a so-called covering index. You've tried to create one with the index you created. It's close but not perfect.
Your query filters on equality to constant values on row1, row2, and row5, so those columns should come first in your index. The query planner can random-access your index to the first matching entry, and then sequentially scan the index until it gets to the last matching entry. That's as fast as it gets.
Then you want to examine row3 and row4 (to compare them). Those columns should come next in the index. Finally, if your query's SELECT clause mentions a subset of the columns in your table you should put the rest of those columns in the index. So, based on the query in your question, your index should be
CREATE INDEX indexName ON `table` (row1, row2, row5, row3, row4, row);
The query planner will be able to satisfy the entire query by scanning through a subset of the index, using a so-called index range scan. That should be decently fast.
Pro tip: don't force the query planner's hand with USE INDEX(). Instead, structure your indexes to handle your queries efficiently.
An index can't be used to compare two columns in the same table (at best, it could be used for an index scan rather than a table scan if all output fields are contained in the index), so there basically is no "correct" way to do this.
If you have control over the structure AND the processes the fill the table, you could add a calculated field that holds the difference between the two fields. Then add that field to the index and adjust your query to use that field instead of the other 2.
It ain't pretty and doesn't offer a lot of flexibility (eg. if you want to compare another field, you need to add it as well etc), but it does get the job done.
(This is an adaptation of http://mysql.rjweb.org/doc.php/random )
Let's actually fold the randomization into the query. This will eliminate gathering a bunch of ids, processing them, and then reaching back into the table. It will also avoid the need for an extra index.
Find min and max id values.
Pick a random id between min and max.
Scan forward, looking for the first row with col1...col5 matching the criteria.
Something like...
SELECT b.* -- should replace with actual list of columns
FROM
( SELECT id
FROM tbl
WHERE id >= ( SELECT MIN(id) +
( MAX(id) - MIN(id)
- 22 -- somewhat avoids running off end
) * RAND()
FROM tbl )
AND col1 = 0 ... -- your various criteria
ORDER BY id
LIMIT 1
) AS a
JOIN tbl AS b USING(id);
Pros/cons:
Probably faster than anything else you can devise.
If there RAND() hits too late in the table, it will return nothing. In this (rare) case, run the query again, but starting at 0.
Big gaps in id will lead to a bias in which id is returned. (The link above discusses some kludges to handle such.)
Related
I am using MySQL 8 version.
Let suppose I have a table i.e. event and there are approx 15 columns in this. Let suppose column names are from a,b,c ... up to m having varchar datatype. This is a master table so having one-to-one records in this. I provided a dashboard for this table and the client can select fields to filter records as per their need.
There are 12 fields on that filter are applying. The query is creating based on selected fields.
If field a is selected then query will be like select * from event where a = <some value>.
If b and c selected then the query will be like select *from event where b= <some value> and c= <some value>
So Can you suggest to me how can I create an index for better optimization?
There is a limit to the number of indexes you can have on a table -- both an absolute limit (64) and a practical limit (much less).
I suggest you start with 12 2-column 'composite' indexes. Have a different starting column for each of the 12. Have a "likely" second column.
Over time, watch what the users typically pick and add/subtract indexes accordingly.
Keep in mind these things:
The Optimizer does not care what order the WHERE clause is.
The Optimizer does care what order the columns of an index are in.
The best index starts with column(s) that are tested for =.
Usually, when a column is tested with a range (eg date, or price), further columns in the index are not useful.
More tips (that you seem to have found): http://mysql.rjweb.org/doc.php/index_cookbook_mysql
INDEX(a,b) will do a pretty good job for WHERE a=1 AND B=2 AND c=3. It won't be as good as INDEX(a,b,c). But I am suggesting that you have to make tradeoffs -- Be happy with "pretty good"; you can't achieve "perfect" in all cases.
I have these queries:
1st query:
SELECT (..) FROM db WHERE A = const AND B > const AND C >= const ORDER BY B DESC LIMIT const
2nd query (different db):
SELECT (...) FROM db' WHERE A' = const ORDER BY X' DESC LIMIT const
Question about 1st query:
Is it sufficient to have a multiple row index (A, B, C) or do I need an additional single row index (B) (or a different one) because of the ORDER BYstatement?
Question about 2nd query: Do I need a multiple row index (A', X') or two single row indices (A'), (X') to make us of them in this query?
It is an important thing to know that MySQL will use at most one index (for searching, filtering and ordering) per table and subquery (so basically per row in explain), so you can use only one index here.
For your first query, an index (A,B) will allow MySQL to do a range scan and use the order. If you use (A,B,C), the column C cannot be used in the range condition (because B is already a range), but MySQL will save the time to read the actual tabledata to get the value for C to check the last condition. So (A,B,C) is in general the fastest choice here.
"In general", because you can of course have a data distribution where another index would be best: If you e.g. have only one or two rows that match C >= const and 10M+ rows with A = const, using an index on just C would be fastest. And if C is a very big column (e.g. varchar(700)), it could blow up the index and slow it down. But to estimate such exceptions would require deeper knowledge of your data.
For your second query, (A', X') will be the best choice. If you have the two indexes (A'), (X'), MySQL will in most cases (unless A' is unique, but then you wouldn't need an order by anyway) use the index on X' and hope it will find matching rows for A' soon. This will sometimes be unexpectedly and painfully slow if you only have some rows that match A' = const (because it has to jump back and forth in the table (that is ordered by the primary key) in the order of X' to find rows that match the condition for A').
You might get the same problem for your first query if you have the indexes (A) and (B) (but not (A,B) or (A,B,C)) there: MySQL will probably use (B) instead of (A) (but check the explain to make sure). Even if you just add one index now, this can e.g. happen when you add the index (B) to optimize a different query next week and forgot about this query, so I'd suggest to stick with (at least) (A,B)
I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.
If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.
Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/
Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!
If I'm trying to increase the performance of a query that uses 4 different columns from a specific table, should I create 4 different indexes (one with each column individually) or should I create 1 index with all columns included?
One index with all 4 values is by my experience the fastest. If you use a where, try to put the columns in an order that makes it useful for the where.
An index with all four columns; the columns used in the WHERE should go first, and those for which you do == compare should go first of all.
Sometimes, giving priority to integer columns gives better results; YMMV.
So for example,
SELECT title, count(*) FROM table WHERE class = 'post' AND topic_id = 17
AND date > ##BeginDate and date < ##EndDate;
would have an index on: topic_id, post, date, and title, in this order.
The "title" in the index is only used so that the DB may find the value of "title" for those records matching the query, without the extra access to the data table.
The more balanced the distribution of the records on the first fields, the best results you will have (in this example, say 10% of the rows have topic_id = 17, you would discard the other 90% without ever having to run a string comparison with 'post' -- not that string comparisons are particularly costly. Depending on the data, you might find it better to index date first and post later, or even use date first as a MySQL PARTITION.
Single index is usually more effective than index merge, so if you have condition like f1 = 1 AND f2 = 2 AND f3 = 3 AND f4 = 4 single index would right decision.
To achieve best performance enumerate index fields in descending order of cardinality (count of distinct values), this will help to reduce analyzed rows count.
Index of less than 4 fields can be more effective, as it requires less memory.
http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/
This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.