This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.
Related
What is the best way to create index when I have a query like this?
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
I know that only one index can be used in a query so I can't create two indexes, one for user_1 and one for user_2.
Also could solution for this type of query be used for this query?
WHERE ((user_1 = '$user_id' AND user_2 = '$friend_id') OR (user_1 = '$friend_id' AND user_2 = '$user_id'))
MySQL has a hard time with OR conditions. In theory, there's an index merge optimization that #duskwuff mentions, but in practice, it doesn't kick in when you think it should. Besides, it doesn't give as performance as a single index when it does.
The solution most people use to work around this is to split up the query:
SELECT ... WHERE user_1 = ?
UNION
SELECT ... WHERE user_2 = ?
That way each query will be able to use its own choice for index, without relying on the unreliable index merge feature.
Your second query is optimizable more simply. It's just a tuple comparison. It can be written this way:
WHERE (user_1, user_2) IN (('$user_id', '$friend_id'), ('$friend_id', '$user_id'))
In old versions of MySQL, tuple comparisons would not use an index, but since 5.7.3, it will (see https://dev.mysql.com/doc/refman/5.7/en/row-constructor-optimization.html).
P.S.: Don't interpolate application code variables directly into your SQL expressions. Use query parameters instead.
I know that only one index can be used in a query…
This is incorrect. Under the right circumstances, MySQL will routinely use multiple indexes in a query. (For example, a query JOINing multiple tables will almost always use at least one index on each table involved.)
In the case of your first query, MySQL will use an index merge union optimization. If both columns are indexed, the EXPLAIN output will give an explanation along the lines of:
Using union(index_on_user_1,index_on_user_2); Using where
The query shown in your second example is covered by an index on (user_1, user_2). Create that index if you plan on running those queries routinely.
The two cases are different.
At the first case both columns needs to be searched for the same value. If you have a two column index (u1,u2) then it may be used at the column u1 as it cannot be used at column u2. If you have two indexes separate for u1 and u2 probably both of them will be used. The choice comes from statistics based on how many rows are expected to be returned. If returned rows expected few an index seek will be selected, if the appropriate index is available. If the number is high a scan is preferable, either table or index.
At the second case again both columns need to be checked again, but within each search there are two sub-searches where the second sub-search will be upon the results of the first one, due to the AND condition. Here it matters more and two indexes u1 and u2 will help as any field chosen to be searched first will have an index. The choice to use an index is like i describe above.
In either case however every OR will force 1 more search or set of searches. So the proposed solution of breaking using union does not hinder more as the table will be searched x times no matter 1 select with OR(s) or x selects with union and no matter index selection and type of search (seek or scan). As a result, since each select at the union get its own execution plan part, it is more likely that (single column) indexes will be used and finally get all row result sets from all parts around the OR(s). If you do not want to copy a large select statement to many unions you may get the primary key values and then select those or use a view to be sure the majority of the statement is in one place.
Finally, if you exclude the union option, there is a way to trick the optimizer to use a single index. Create a double index u1,u2 (or u2,u1 - whatever column has higher cardinality goes first) and modify your statement so all OR parts use all columns:
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
will be converted to:
... WHERE ((user_1 = '$user_id' and user_2=user_2) OR (user_1=user_1 and user_2 = '$user_id')) ...
This way a double index (u1,u2) will be used at all times. Please not that this will work if columns are nullable and bypassing this with isnull or coalesce may cause index not to be selected. It will work with ansi nulls off however.
I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.
If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.
Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/
Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!
I've been looking for ages about how to do a query like this but use an index.
SELECT * FROM aliens_tmp
WHERE creator != 'a'
AND COUNTRY = 'UK'
ORDER BY id DESC LIMIT 0, 10
Whatever index I create will not get a hit because of the != clashing with the inclusive nature of indexes.
So I thought about doing a minus where I do something like
SELECT * FROM aliens_tmp
WHERE COUNTRY = 'UK
MINUS
SELECT * FROM aliens_tmp
WHERE CREATOR = 'a'
ORDER BY id DESC LIMIT 0, 10
...but MINUS doesn't exist. So I looked and figured I could use the old trick mentioned on http://www.bitbybit.dk/carsten/blog/?p=71 which briefly is
SELECT DISTINCT a.member_id, a.name
FROM a LEFT JOIN b USING (member_id, name)
WHERE b.member_id IS NULL
but that doesn't help me with the original goal of doing a MINUS whilst maintaining the use of an index as any join to my own table still has to exclude my creator which causes the miss because of the exclude Any ideas?
If most of your table consists of rows with creator != 'a' then using index will not give you any advantage. However if you have mostly rows with creator='a' then you can consider assigning such values to creator (or using enum) that if you order by creator 'a' would go first. Instead of checking for creator != 'a' you could then use creator > 'a' condition, which is perfectly fine for using btree indexes.
UPDATE after more info about creator:
So your creator != 'a' condition has extremely low selectivity and trying to use index on creator is pointless. You are limiting your results and ordering by id desc, so the key on country should be used to return as few rows as possible by simply checking every single row starting from highest id whether it satisfies where conditions until there are enough for your limit. If you see key: country and number of rows close to 10 in your explain select ... it won't get any faster.
The index will not buy you anything in this case, your original query with the != will be your fastest option.
The reason is that to do the MINUS you are talking about, the database would need to retrieve all rows, then check each row against the index to determine if it should be removed. This requires it to examine every row in the table.
Your original query will also need to examine every row, but will only need to check it for the condition you specified.
This means that your original query would be faster even if there were a MINUS operation like you describe.
In fact, your original query may be faster because it only needs to check the condition and would not have to do an index lookup for every row.
An index on creator would probably not be very useful, even it it could be used. !=a would probably return more than half the rows in the table. A table scan is going to be more efficient than using an index in that case. An index on country MIGHT help, assuming there are many countries, but if half your users are in UK, you'd have the same problem there too.
You could consider partitioning your data in to two tables, 'a' creators, and 'not a' creators, if that makes sense for your application.
I'd like to know which of the followings would execute faster in MySQL database. The table would have 200 - 1000 entries.
SELECT id
from TABLE
order by id desc
limit 1
or
SELECT count(id)
from TABLE
The story is the Table is cached. So this query is to be executed every time before cache retrieval to determine whether the cache data is invalid by comparing the previous value.
So if there exists a even less expensive query, please kindly let me know. Thanks.
If you
start from 1
never have any gaps
use the InnoDB engine
id is not nullable
Then the 2nd could run [ever so marginally] faster due to not having to visit table data at all (count is stored in metadata).
Otherwise,
if the table has NO index on ID (causing a SCAN), the 2nd one is faster
Barring both the above
the first one is faster
And if you actually meant to ask SELECT .. LIMIT 1 vs SELECT MAX(id).. then the answer is actually that they are the same for MySQL and most sane DBMS, whether or not there is an index.
I think, the first query will run faster, as the query is limited to be executed for one row only, 200-1000 may not matter that much in this case.
As already pointed out in the comments, your table is so small it really doesn't what your solution will be. For this reason the select count(id) should be used as it expresses the intent and doesn't need any further processing.
Now select count(id) comes with an alternative select count(*). These two are not synonyms. select count(*) will count the number of rows and use a cached value if possible when select count(id) counts the number of non null values of the column id exists. If the id columns is set as not null then the cached row count may be used.
The selection between count(*) and count(id) depends once again on your intent. In the general case, count(*) describes the intent better.
The there is the possibility of count(1) which is actually a synonym of count(*) when using mysql but the interpretation may vary if end up using a different RDBMS.
The performance of each type of count also varies depending on whether you are using MyISAM or InnoDB. The row counts are cached on the former but not on the latter, if I've understood correctly.
In the end, you should rely on query plans and running tests and measuring their performance rather than these general ramblings.
I have two questions here but i am asking them at once as i think they are inter-related.
I am working with a complex query (Multiple joins + sub queries) and the table is pretty huge as well (around 2,00,000 records in this table).
A part of this query (a LEFT JOIN) is required to find a record which has a second lowest value in a cetain column among all the records associated with the primary key of the first table. For now I have isolated this part and thinking on the lines of -
SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1;
But there is a case where, if there is only 1 record in the table, it must return that record instead of NULL. So my first question is how do write a query for this ?
Secondly, considering the size of the table and the time its already taking to run even after creating indexes, I understand that adding any more complexity to it in order to achieve the above part might affect the querying time dramatically.
I cannot decompose joins because I need to get some of the columns for the ORDER BY clause (the application has an option to sort the result by these columns, the above column "myvalue" being one of them)
What would be the way(s) to approach this problem ?
Thanks
Something like this might work
COALESCE(
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1),
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 0,1))
It selects the first non null value from the list provided.
As for the complexity of the query, post the whole thing so we can take a look at it.