MySQL - Poor performance in a select from a simple table - mysql

I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.

If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.

Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/

Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!

Related

Index not used against MySQL SET column

I have a large data table containing details by date and across 3 independent criteria with around 12 discreet values for each criteria. That is, each criteria field in the table is defined as a 12 value ENUM. Users pull summary data by date and any filtering across the three criteria, including none at all. To make a single criteria lookup efficient, 3 separate indexes are required (date,CriteriaA), (date,CriteriaB), (date,CriteriaC). 4 indexes if you want to lookup against any of the 3 (date,A,B,C),(date,A,C),(date,B,C),(date,C).
In an attempt to be more efficient in the lookup, I built a SET column containing all 36 values from the 3 criteria. All values across the criteria are unique and none are a subset of any other. I added an index to this set (date, set_col). Queries against this table using a set lookup fails to take advantage of the index, however. Neither FIND_IN_SET('Value',set_col), set_col LIKE '%Value%', nor set_col & [pos. in set] triggers the index (according to explain and overall resultset return speed).
Is there a trick to indexing SET columns?
I tried queries like
Select Date, count(*)
FROM tbl
where DATE between [Start] and [End]
and FIND_IN_SET('Value',set_col)
group by Date
I would expect it to run nearly as fast as a lookup against the individual criteria column that has an index against it. But instead it runs as fast when just an index against DATE exists. Same number of rows processed according to Explain.
It's not possible to index SET columns for arbitrary queries.
A SET type is basically a bitfield, with one bit set for each of the values defined for your set. You could search for a specific bit pattern in such a bitfield, or you could search for a range of specific bit patterns, or an inequality, etc. But searching for rows where one specific bit is set in the bitfield is not going to be indexable.
FIND_IN_SET() is really searching for a specific bit set in the bitfield. It will not use an index for this predicate. The best you can hope to do for optimization is to have an index that narrows down the examined rows based on the other search term on date. Then among the rows matching the date range, the FIND_IN_SET() will be applied row-by-row.
It's the same problem as searching for substrings. The following predicates will not use an index on the column:
SELECT ... WHERE SUBSTRING(mytext, 5, 8) = 'word'
SELECT ... WHERE LOCATE(mytext, 'word') > 0
SELECT ... WHERE mytext LIKE '%word%'
A conventional index on the data would be alphabetized from the start of the string, not from some arbitrary point in the middle of the string. This is why fulltext indexing was created as an alternative to a simple B-tree index on the whole string value. But there's no special index type for bitfields.
I don't think the SET data type is helping in your case.
You should use your multi-column indexes with permutations of the columns.
Go back to 3 ENUMs. Then have
INDEX(A, date),
INDEX(B, date),
INDEX(C, date)
Those should significantly help with queries like
WHERE A = 'foo' AND date BETWEEN...
and somewhat help for
WHERE A = 'foo' AND date BETWEEN...
AND B = 'bar'
If you will also have queries without A/B/C, then add
INDEX(date)
Note: INDEX(date, A) is no better than INDEX(date) when using a "range". That is, I recommend against the indexes you mentioned.
FIND_IN_SET(), like virtually all other function calls, is not sargable . However enum=const is sargable since it is implemented as a simple integer.
You did not mention
WHERE A IN ('x', 'y') AND ...
That is virtually un-indexable. However, my suggestions are better than nothing.

Mysql Index On columns of table

I am using MySQL 8 version.
Let suppose I have a table i.e. event and there are approx 15 columns in this. Let suppose column names are from a,b,c ... up to m having varchar datatype. This is a master table so having one-to-one records in this. I provided a dashboard for this table and the client can select fields to filter records as per their need.
There are 12 fields on that filter are applying. The query is creating based on selected fields.
If field a is selected then query will be like select * from event where a = <some value>.
If b and c selected then the query will be like select *from event where b= <some value> and c= <some value>
So Can you suggest to me how can I create an index for better optimization?
There is a limit to the number of indexes you can have on a table -- both an absolute limit (64) and a practical limit (much less).
I suggest you start with 12 2-column 'composite' indexes. Have a different starting column for each of the 12. Have a "likely" second column.
Over time, watch what the users typically pick and add/subtract indexes accordingly.
Keep in mind these things:
The Optimizer does not care what order the WHERE clause is.
The Optimizer does care what order the columns of an index are in.
The best index starts with column(s) that are tested for =.
Usually, when a column is tested with a range (eg date, or price), further columns in the index are not useful.
More tips (that you seem to have found): http://mysql.rjweb.org/doc.php/index_cookbook_mysql
INDEX(a,b) will do a pretty good job for WHERE a=1 AND B=2 AND c=3. It won't be as good as INDEX(a,b,c). But I am suggesting that you have to make tradeoffs -- Be happy with "pretty good"; you can't achieve "perfect" in all cases.

MySQL: required indexes for multi-column ordering

Having a 10+ million table with three columns: one, two, three and SQL query like SELECT * FROM table ORDER BY one, two, three LIMIT 1 - do I really need to create a multi-column index using all three columns?
I know for sure that if one and two matches, there would be max 10 rows with distinct three.
Is it enough for fast SELECTs? -
CREATE INDEX MY_INDEX ON table (one, two);
With INDEX(one, two, three), the query will go straight down the BTree to the one (LIMIT 1) desired row.
With INDEX(one, two), the query will go straight down the BTree to the first such row, then scan forward the up-to-10 rows, save them to a tmp table, sort them (ORDER BY includes three) (probably done in memory), and deliver the first one. Although this sounds more complex it will not (in this example) be much slower.
It will not be a "table scan" ("ALL"), but perhaps a "range" scan. Use EXPLAIN SELECT ... to see.
If three is a bulky string, then the 3-col index will be bulkier; this has some impact on disk space and performance.
If you need only (one, two) for some other queries, then either index works reasonably well (barring the "bulky" comment).
If you do SELECT one, two, three FROM ..., the 3-part index will be better because it will be "covering". SELECT * won't have such a bonus.
Bottom line: Either index is "OK", many other factors factor in, making it hard to say for sure what to do.
You might think MySQL is clever enough to only read at most the first 10 rows using the index and then sort these. Unfortunately, it isn't (because the optimizer doesn't regard the limit at this point). You can verify that by using explain select ..., it will show that MySQL will do a full table scan ("ALL").
The documentation describes conditions to be able to use an index to optimize order by:
The index can also be used even if the ORDER BY does not match the index exactly, as long as all unused portions of the index and all extra ORDER BY columns are constants in the WHERE clause.
Your third column does not satisfy this. So this query will not use this index (which does not mean that it might not be usefull for something else).
Since MySQL 5.6, there is however the so-called filesort priority queue optimization to accommodate the limit: while MySQL will still read the whole table, it will not sort the whole table (which would be a time consuming process), but will stop when it knows what the first row will be, which makes your query acceptable fast.
But you can rewrite your query to do exactly what you are thinking of:
SELECT * FROM
(select * from table ORDER BY one, two LIMIT 10) sub
order by one, two, three limit 1;
This will read the first 10 rows using that index, and then just sort these. It will of course only work correctly if you are absolutely sure you will only have at most 10 rows.
A more general way to optimize your query independently from knowing the maximum number of possible rows would be e.g.
SELECT * FROM table
where one = (select min(one) from table)
order by one, two, three limit 1;
This will use the index to reduce the number of rows that have to be read and filesorted by looking up the lowest value for one first (using the index) and only considering these rows. You can similarly include a condition for two.
Or you can simply can use all three columns in your index (although depending on the size of your third column, it can make sense to not do this). These kind of optimizations tend to catch up at one point. If you e.g. use the first method, and in 2 year there will be 11 rows possible, you (or your successor) will have to remember that you have this implied condition in your code.

(Why) Can't MySQL use index in such cases?

1 - PRIMARY used in a secondary index, e.g. secondary index on (PRIMARY,column1)
2 - I'm aware mysql cannot continue using the rest of an index as soon as one part was used for a range scan, however: IN (...,...,...) is not considered a range, is it? Yes, it is a range, but I've read on mysqlperformanceblog.com that IN behaves differently than BETWEEN according to the use of index.
Could anyone confirm those two points? Or tell me why this is not possible? Or how it could be possible?
UPDATE:
Links:
http://www.mysqlperformanceblog.com/2006/08/10/using-union-to-implement-loose-index-scan-to-mysql/
http://www.mysqlperformanceblog.com/2006/08/14/mysql-followup-on-union-for-query-optimization-query-profiling/comment-page-1/#comment-952521
UPDATE 2: example of nested SELECT:
SELECT * FROM user_d1 uo
WHERE EXISTS (
SELECT 1 FROM `user_d1` ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
AND ui.id=uo.id
)
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
So, the outer SELECT uses timestamp_lastonline for sorting, the inner either PK to connect with the outer or birthdate for filtering.
What other options rather than this query are there if MySQL cannot use index on a range scan and for sorting?
The column(s) of the primary key can certainly be used in a secondary index, but it's not often worthwhile. The primary key guarantees uniqueness, so any columns listed after it cannot be used for range lookups. The only time it will help is when a query can use the index alone
As for your nested select, the extra complication should not beat the simplest query:
SELECT * FROM user_d1 uo
WHERE uo.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
MySQL will choose between a birthdate index or a timestamp_lastonline index based on which it feels will have the best chance of scanning fewer rows. In either case, the column should be the first one in the index. The birthdate index will also carry a sorting penalty, but might be worthwhile if a large number of recent users will have birth dates outside of that range.
If you wish to control the order, or potentially improve performance, a (timestamp_lastonline, birthdate) or (birthdate, timestamp_lastonline) index might help. If it doesn't, and you really need to select based on the birthdate first, then you should select from the inner query instead of filtering on it:
SELECT * FROM (
SELECT * FROM user_d1 ui
WHERE ui.birthdate BETWEEN '1990-05-04' AND '1991-05-04'
) as uo
ORDER BY uo.timestamp_lastonline DESC
LIMIT 20
Even then, MySQL's optimizer might choose to rewrite your query if it finds a timestamp_lastonline index but no birthdate index.
And yes, IN (..., ..., ...) behaves differently than BETWEEN. Only the latter can effectively use a range scan over an index; the former would look up each item individually.
2.IN will obviously differ from BETWEEN. If you have an index on that column, BETWEEN will need to get the starting point and it's all done. If you have IN, it will look for a matching value in the index value by value thus it will look for the values as many times as there are values compared to BETWEEN's one time look.
yes #Andrius_Naruševičius is right the IN statement is merely shorthand for EQUALS OR EQUALS OR EQUALS has no inherent order whatsoever where as BETWEEN is a comparison operator with an implicit greater than or less than and therefore absolutely loves indexes
I honestly have no idea what you are talking about, but it does seem you are asking a good question I just have no notion what it is :-). Are you saying that a primary key cannot contain a second index? because it absolutely can. The primary key never needs to be indexed because it is ALWAYS indexed automatically, so if you are getting an error/warn (I assume you are?) about supplementary indices then it's not the second, third index causing it it's the PRIMARY KEY not needing it, and you mentioning that probably is the error. Having said that I have no idea what question you asked - it's my answer to my best guess as to your actual question.

MySQL: SELECT(x) WHERE vs COUNT WHERE?

This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.