Why do you need to place columns you create yourself (for example select 1 as "number") after HAVING and not WHERE in MySQL?
And are there any downsides instead of doing WHERE 1 (writing the whole definition instead of a column name)?
All other answers on this question didn't hit upon the key point.
Assume we have a table:
CREATE TABLE `table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`value` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `value` (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And have 10 rows with both id and value from 1 to 10:
INSERT INTO `table`(`id`, `value`) VALUES (1, 1),(2, 2),(3, 3),(4, 4),(5, 5),(6, 6),(7, 7),(8, 8),(9, 9),(10, 10);
Try the following 2 queries:
SELECT `value` v FROM `table` WHERE `value`>5; -- Get 5 rows
SELECT `value` v FROM `table` HAVING `value`>5; -- Get 5 rows
You will get exactly the same results, you can see the HAVING clause can work without GROUP BY clause.
Here's the difference:
SELECT `value` v FROM `table` WHERE `v`>5;
The above query will raise error: Error #1054 - Unknown column 'v' in 'where clause'
SELECT `value` v FROM `table` HAVING `v`>5; -- Get 5 rows
WHERE clause allows a condition to use any table column, but it cannot use aliases or aggregate functions.
HAVING clause allows a condition to use a selected (!) column, alias or an aggregate function.
This is because WHERE clause filters data before select, but HAVING clause filters resulting data after select.
So put the conditions in WHERE clause will be more efficient if you have many many rows in a table.
Try EXPLAIN to see the key difference:
EXPLAIN SELECT `value` v FROM `table` WHERE `value`>5;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | table | range | value | value | 4 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
EXPLAIN SELECT `value` v FROM `table` having `value`>5;
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
| 1 | SIMPLE | table | index | NULL | value | 4 | NULL | 10 | Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
You can see either WHERE or HAVING uses index, but the rows are different.
Why is it that you need to place columns you create yourself (for example "select 1 as number") after HAVING and not WHERE in MySQL?
WHERE is applied before GROUP BY, HAVING is applied after (and can filter on aggregates).
In general, you can reference aliases in neither of these clauses, but MySQL allows referencing SELECT level aliases in GROUP BY, ORDER BY and HAVING.
And are there any downsides instead of doing "WHERE 1" (writing the whole definition instead of a column name)
If your calculated expression does not contain any aggregates, putting it into the WHERE clause will most probably be more efficient.
The main difference is that WHERE cannot be used on grouped item (such as SUM(number)) whereas HAVING can.
The reason is the WHERE is done before the grouping and HAVING is done after the grouping is done.
HAVING is used to filter on aggregations in your GROUP BY.
For example, to check for duplicate names:
SELECT Name FROM Usernames
GROUP BY Name
HAVING COUNT(*) > 1
These 2 will be feel same as first as both are used to say about a condition to filter data. Though we can use ‘having’ in place of ‘where’ in any case, there are instances when we can’t use ‘where’ instead of ‘having’. This is because in a select query, ‘where’ filters data before ‘select’ while ‘having’ filter data after ‘select’. So, when we use alias names that are not actually in the database, ‘where’ can’t identify them but ‘having’ can.
Ex: let the table Student contain student_id,name, birthday,address.Assume birthday is of type date.
SELECT * FROM Student WHERE YEAR(birthday)>1993; /*this will work as birthday is in database.if we use having in place of where too this will work*/
SELECT student_id,(YEAR(CurDate())-YEAR(birthday)) AS Age FROM Student HAVING Age>20;
/*this will not work if we use ‘where’ here, ‘where’ don’t know about age as age is defined in select part.*/
WHERE filters before data is grouped, and HAVING filters after data is grouped. This is an important distinction; rows that are
eliminated by a WHERE clause will not be included in the group. This
could change the calculated values which, in turn(=as a result) could affect which
groups are filtered based on the use of those values in the HAVING
clause.
And continues,
HAVING is so similar to WHERE that most DBMSs treat them as the same
thing if no GROUP BY is specified. Nevertheless, you should make that
distinction yourself. Use HAVING only in conjunction with GROUP BY
clauses. Use WHERE for standard row-level filtering.
Excerpt From:
Forta, Ben. “Sams Teach Yourself SQL in 10 Minutes (5th
Edition) (Sams Teach Yourself...).”.
Having is only used with aggregation but where with non aggregation statements
If you have where word put it before aggregation (group by)
Related
I have a table below:
CREATE TABLE `student` (
`name` varchar(30) NOT NULL DEFAULT '',
`city` varchar(30) NOT NULL DEFAULT '',
`age` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`name`,`city`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I want to know, if I execute the following two SQLs, do they have the same performance?
mysql> select * from student where name='John' and city='NewYork';
mysql> select * from student where city='NewYork' and name='John';
Involved question:
If there is a multi-column indexes (name, city), do the two SQLs all use it?
Does the optimizer change the second sql to the first because of the index?
I execute explain on the two of them, the result is below:
mysql> explain select * from student where name='John' and city='NewYork';
+----+-------------+---------+-------+---------------+---------+---------+-------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+---------+---------+-------------+------+-------+
| 1 | SIMPLE | student | const | PRIMARY | PRIMARY | 184 | const,const | 1 | NULL |
+----+-------------+---------+-------+---------------+---------+---------+-------------+------+-------+
mysql> explain select * from student where city='NewYork' and name='John';
+----+-------------+---------+-------+---------------+---------+---------+-------------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+---------+---------+-------------+------+-------+
| 1 | SIMPLE | student | const | PRIMARY | PRIMARY | 184 | const,const | 1 | NULL |
+----+-------------+---------+-------+---------------+---------+---------+-------------+------+-------+
If, given an index on (name,city), I execute the following two SQLs, do they have the same performance?
where name='John' and city='NewYork'
where city='NewYork' and name='John'
Yes.
The query planner doesn't care about the order of WHERE clauses. If both your clauses filter on equality, the planner can use the index. SQL is a declarative language, not procedural. That is, you say what you want, not how to get it. It's a little counterintuitive for many programmers.
It can also use the (name,city) index for WHERE name LIKE 'Raymo%' because name is first in the index. It cannot use that index for WHERE city = 'Kalamazoo', though.
It can use the index for WHERE city LIKE 'Kalam%' AND name = 'Raymond'. In that case it uses the index to find the name value, then scans for matching cities.
If you had an index on (city,name) you could also use that for WHERE city = 'Kalamazoo' AND name = 'Raymond'. If both indexes exist, the query planner will pick one, probably based on some kind of cardinality consideration.
Note. If instead you have two different indexes on city and name, the query planner can't (as of mid-2017) use more than one of them to satisfy WHERE city = 'Kalamazoo' AND name = 'Raymond'.
http://use-the-index-luke.com/ for good info.
The order of columns in a multi-column index matters.
The documentation of the multiple-column indexes reads:
MySQL can use multiple-column indexes for queries that test all the columns in the index, or queries that test just the first column, the first two columns, the first three columns, and so on. If you specify the columns in the right order in the index definition, a single composite index can speed up several kinds of queries on the same table.
This means an index on columns name and city can be used when an index on column name is needed but it cannot be used instead of an index on column city.
The order of conditions in the WHERE clause doesn't matter. The MySQL optimizer does a lot of work on the conditions on the WHERE clause to eliminate as many candidate rows as possible as early as possible and to read as little data as possible from the tables and indexes (because some of the read data is dropped because it doesn't match the entire WHERE clause).
If I compare
explain select * from Foo where find_in_set(id,'2,3');
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | User | ALL | NULL | NULL | NULL | NULL | 4 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
with this one
explain select * from Foo where id in (2,3);
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | User | range | PRIMARY | PRIMARY | 8 | NULL | 2 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
It is apparent that FIND_IN_SET does not exploit the primary key.
I want to put a query such as the above into a stored procedure, with the comma-separated string as an argument.
Is there any way to make the query behave like the second version, in which the index is used, but without knowing the content of the id set at the time the query is written?
In reference to your comment:
#MarcB the database is normalized, the CSV string comes from the UI.
"Get me data for the following people: 101,202,303"
This answer has a narrow focus on just those numbers separated by a comma. Because, as it turns out, you were not even talking about FIND_IN_SET afterall.
Yes, you can achieve what you want. You create a prepared statement that accepts a string as a parameter like in this Recent Answer of mine. In that answer, look at the second block that shows the CREATE PROCEDURE and its 2nd parameter which accepts a string like (1,2,3). I will get back to this point in a moment.
Not that you need to see it #spraff but others might. The mission is to get the type != ALL, and possible_keys and keys of Explain to not show null, as you showed in your second block. For a general reading on the topic, see the article Understanding EXPLAIN’s Output and the MySQL Manual Page entitled EXPLAIN Extra Information.
Now, back to the (1,2,3) reference above. We know from your comment, and your second Explain output in your question that it hits the following desired conditions:
type = range (and in particular not ALL) . See the docs above on this.
key is not null
These are precisely the conditions you have in your second Explain output, and the output that can be seen with the following query:
explain
select * from ratings where id in (2331425, 430364, 4557546, 2696638, 4510549, 362832, 2382514, 1424071, 4672814, 291859, 1540849, 2128670, 1320803, 218006, 1827619, 3784075, 4037520, 4135373, ... use your imagination ..., ..., 4369522, 3312835);
where I have 999 values in that in clause list. That is an sample from this answer of mine in Appendix D than generates such a random string of csv, surrounded by open and close parentheses.
And note the following Explain output for that 999 element in clause below:
Objective achieved. You achieve this with a stored proc similar to the one I mentioned before in this link using a PREPARED STATEMENT (and those things use concat() followed by an EXECUTE).
The index is used, a Tablescan (meaning bad) is not experienced. Further readings are The range Join Type, any reference you can find on MySQL's Cost-Based Optimizer (CBO), this answer from vladr though dated, with a eye on the ANALYZE TABLE part, in particular after significant data changes. Note that ANALYZE can take a significant amount of time to run on ultra-huge datasets. Sometimes many many hours.
Sql Injection Attacks:
Use of strings passed to Stored Procedures are an attack vector for SQL Injection attacks. Precautions must be in place to prevent them when using user-supplied data. If your routine is applied against your own id's generated by your system, then you are safe. Note, however, that 2nd level SQL Injection attacks occur when data was put in place by routines that did not sanitize that data in a prior insert or update. Attacks put in place prior via data and used later (a sort of time bomb).
So this answer is Finished for the most part.
The below is a view of the same table with a minor modification to it to show what a dreaded Tablescan would look like in the prior query (but against a non-indexed column called thing).
Take a look at our current table definition:
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
select min(id), max(id),count(*) as theCount from ratings;
+---------+---------+----------+
| min(id) | max(id) | theCount |
+---------+---------+----------+
| 1 | 5046213 | 4718592 |
+---------+---------+----------+
Note that the column thing was a nullable int column before.
update ratings set thing=id where id<1000000;
update ratings set thing=id where id>=1000000 and id<2000000;
update ratings set thing=id where id>=2000000 and id<3000000;
update ratings set thing=id where id>=3000000 and id<4000000;
update ratings set thing=id where id>=4000000 and id<5100000;
select count(*) from ratings where thing!=id;
-- 0 rows
ALTER TABLE ratings MODIFY COLUMN thing int not null;
-- current table definition (after above ALTER):
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
And then the Explain that is a Tablescan (against column thing):
You can use following technique to use primary index.
Prerequisities:
You know the maximum amount of items in comma separated string and it is not large
Description:
we convert comma separated string into temporary table
inner join to the temporary table
select #ids:='1,2,3,5,11,4', #maxCnt:=15;
SELECT *
FROM foo
INNER JOIN (
SELECT * FROM (SELECT #n:=#n+1 AS n FROM foo INNER JOIN (SELECT #n:=0) AS _a) AS _a WHERE _a.n <= #maxCnt
) AS k ON k.n <= LENGTH(#ids) - LENGTH(replace(#ids, ',','')) + 1
AND id = SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
This is a trick to extract nth value in comma separated list:
SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
Notes: #ids can be anything including other column from other or the same table.
I have a table that has 4,000,000 records.
The table is created that : (user_id int, partner_id int, PRIMARY_KEY ( user_id )) engine=InnoDB;
I want to test the performance of select 100 records.
Then, I tested following:
mysql> explain select user_id from MY_TABLE use index (PRIMARY) where user_id IN ( 1 );
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
| 1 | PRIMARY | MY_TABLE | const | PRIMARY | PRIMARY | 4 | const | 1 | Using index |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------------+
1 row in set, 1 warning (0.00 sec)
This is OK.
But, this query is buffered by mysql.
So, this test make no after the first test.
Then, I thinked of a sql that select by random value.
I tested following:
mysql> explain select user_id from MY_TABLE use index (PRIMARY) where user_id IN ( select ceil( rand() ) );
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
| 1 | PRIMARY | MY_TABLE | index | NULL | PRIMARY | 4 | NULL | 3998727 | Using where; Using index |
+----+-------------+----------+-------+---------------+---------+---------+------+---------+--------------------------+
But, it's bad.
Explain shows that possible_keys is NULL.
So, full index scanning is planned, and in fact, it's too slow rather than the one before.
Then, I want to ask you to teach me how do I write random value with index looking up.
Thanks
Using rand() in SQL is usually a sure-fire way to make the query slow. A common theme here is people using it in ORDER BY to get a random sequence. It's slow because not only does it throw away the indexes, but it also reads through the whole table.
However in your case, the fact that the function calls are in a sub-query ought to allow the outer query to still use its indexes. The fact that it isn't seems quite odd (so I've given the question a +1 vote).
My theory is that perhaps MySQL's optimiser is getting it wrong -- it's seeing the functions in the inner query, and deciding incorrectly that it can't use an index.
The only thing I can suggest to work around that is using force index to push MySQL into using the index you want.
See the definition of rand().
If i understand right, you are trying to get a random record from the database. If that is the case, again from the rand() definition:
ORDER BY RAND() combined with LIMIT is useful for selecting a random sample from a set of rows:
SELECT * FROM table1, table2 WHERE a=b AND c<d -> ORDER BY RAND() LIMIT 1000;
It's a limitation of the MySQL optimizer, that it can't tell that the subquery returns exactly one value, it has to assume the subquery returns multiple rows with unpredictable values, potentially even all the values of user_id. Therefore it decides it's just going to do an index scan.
Here's a workaround:
mysql> explain select user_id from MY_TABLE use index (PRIMARY)
where user_id = ( select ceil( rand() ) );
Note that MySQL's RAND() function returns a value in the range 0 <= v < 1.0. If you CEIL() it, you'll likely get the value 1. Therefore you'll virtually always get the row where user_id=1. If you don't have such a row in your table, you'll get an empty set result. You certainly won't get a user chosen randomly among all your users.
To fix that problem, you'd have to multiply the rand() by the number of distinct user_id values. And that brings up the problem that you might have gaps, so a randomly chosen value won't match any existing user_id.
Re your comment:
You'll always see possible keys as NULL when you get an index scan (i.e., "type" is "index").
I tried your explain query on a similar table, and it appears that the optimizer can't figure out that the subquery is a constant expression. You can workaround this limitation by calculating the random number in application code and then using the result as a constant value in your query:
select user_id from MY_TABLE use index (PRIMARY)
where user_id = $random;
I am using MySQl database.
I know if I create a index for a column, it will be fast to query data from a table by using that column index. But, I still have the following questions:
(suppose I have a table named cars, there is a column named country, and I have created index for country column)
I know for example the query SELECT * FROM cars WHERE country='japan'will use the index on column country to query data which is fast. How about != operation? will SELECT * FROM cars WHERE country!='japan'; also use index to query data?
Does WHERE ... IN ... operation use index to query data? For example SELECT * FROM cars WHERE country IN ('japan','usa','sweden');
The general answer is: it depends. It depends on what the database optimizer thinks is the best way to retrieve the data, and its decision may need on the distribution of the data.
For example, if 99% of your rows have country = 'japan', maybe the first query (=) will not use the index, but the country with != will use it.
You can use EXPLAIN SELECT to find out if your query uses an index or not.
For example:
EXPLAIN SELECT *
FROM A
WHERE foo NOT IN (1,4,5,6);
Might yield:
+----+-------------+-------+------+---------------
| id | select_type | table | type | possible_keys
+----+-------------+-------+------+---------------
| 1 | SIMPLE | A | ALL | NULL
+----+-------------+-------+------+---------------
+------+---------+------+------+-------------+
| key | key_len | ref | rows | Extra |
+------+---------+------+------+-------------+
| NULL | NULL | NULL | 2 | Using where |
+------+---------+------+------+-------------+
In this case, the query had no possible_keys and therefore used no key to do the query. It's the key column you'd be interested in.
More information here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
http://dev.mysql.com/doc/refman/5.0/en/optimization-indexes.html
Use 'EXPLAIN' to see what happens with your query. You will probably be interested in the 'possible_keys' and 'key' column.
EXPLAIN SELECT * FROM CARS WHERE `country` != 'japan'
Both queries will use indexes (assuming there is index with country as it's first column)
When in doubt use EXPLAIN. You will also want to read (at least parts of) this
I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
Why not using a concatenated index?
CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
Concatenated index vs. merging multiple indexes
Index-Only Scan
After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:
where `created_at` >= convert({specific_date}, datetime)
My original comments still apply for the optimization part.
The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.
This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:
select k.kind_id, count(*)
from
kind_catalog k
inner join (
select kind_id
from counters
where create_at >= ?
) c on k.kind_id = c.kind_id
group by k.kind_id
This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.