I thought columns in a VIEW simply inherited data types used in the underlying TABLE, but that doesn't seem to be true.
I have a MySQL TABLE like:
"myTable"
FIELD TYPE NULL KEY DEFAULT EXTRA
id int(11) NO PRI NULL auto_increment
dt datetime NO MUL NULL
foo smallint(5) unsigned NO NULL
I can query the table like:
SELECT dt, SUM(foo) FROM myTable WHERE dt>DATE_SUB(Now(), INTERVAL 3 DAY) GROUP BY dt
Now I need the ability to query the same data but using alternate names for some columns (such as "foo").
[I'll skip the long explanation of why!]
I figured a simple solution was a VIEW:
CREATE VIEW myView AS ( SELECT id, dt, grps AS groups FROM myTable ORDER BY dt )
This creates a view with columns like:
"myView"
FIELD TYPE NULL KEY DEFAULT EXTRA
id int(11) NO 0
dt datetime NO NULL
foobar smallint(5) unsigned NO NULL
The problem arises when I query the view: (almost identical to the previous query)
SELECT dt, SUM(foobar) AS foo FROM myView WHERE dt>DATE_SUB(Now(), INTERVAL 3 DAY) GROUP BY dt
The query runs without producing an error but the response is zero records.
I discovered that if I CAST the WHERE clause like this, then it works properly (although it's painfully slow.)
. . . WHERE CAST(dt>DATE_SUB(Now(), INTERVAL 3 DAY)
CASTing all columns would be tedious, plus it's slowing down query execution quite a bit. (There are 5 million records and growing.)
Why is sql forcing me to re-cast the fields? What can I do about it?
Thanks!
Turns out this is a regression somehow introduced with MariaDB 10.2. Somehow having an ORDER BY in the view definition does not turn out well with a GROUP BY on the same column on queries using that view.
I created the following bug report for this:
https://jira.mariadb.org/browse/MDEV-23826
You should remove the ORDER BY clause within the creation statement of the view.
Since, for every call of the view the scanning and the sorting occur for the whole record set before bringing the result for which you might not want to sort the result. This makes ORDER BY clause redundant, and might probably cause adverse impact on performance.
Related
I'm using MySQL 5.7.10.
I'm checking a new query for an audit report.
I'll execute it in a simple background Unix process, which invoke mysql from the console.
To check the query, I use a worksheet in HeidiSQL.
The table is:
CREATE TABLE `services` (
`assigned_id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`service_id` VARCHAR(10) NOT NULL,
`name` VARCHAR(50) NOT NULL,
...
`audit_insert` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
...
INDEX `idx_audit_insert` (`audit_insert`),
...
);
The simple worksheet is:
SET #numberOfMonths:=6;
SET #today:=CURRENT_TIMESTAMP();
SET #todaySubstractnumberOfMonths=TIMESTAMP( date_sub(#today, interval #numberOfMonths MONTH) );
EXPLAIN SELECT service_id from services where audit_insert between #todaySubstractnumberOfMonths and #today;
The explain output for that query is:
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,services,[all partitions],ALL,idx_audit_insert,,,,47319735,21.05,Using where
So, index 'idx_audit_insert' is not used.
If I change the query to:
EXPLAIN SELECT service_id where audit_insert between '2020-01-01 00:00:00' and '2020-03-10 23:59:59';
The output is:
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,tdom_transitos,[all partitions],range,idx_audit_insert,idx_audit_insert,4,,4257192,100.00,Using index condition
Now, the index is used and the rows value is dramatically reduced.
So, my questions are:
How can I force the variables to be timestamp? Is there any wrong in my worksheet?
or maybe
How can I use the index (trying to avoid hints like USE INDEX, FORCE INDEX...)?
Thanks a lot.
(EDIT: I copy the same question in dbastackexchange. Maybe is more properly for that forum).
Well, maybe it's not the answer I thought I'd find, but it works perfectly.
I have splitted audit_insert field in another one, audit_insert_datetype, of DATE type. This field has a new index too.
I have changed the query to execute with this field, and I have tried to force the #... variables to be date type (with current_date and date).
The results: the new index is used and the execution time is dramatically reduced.
Maybe it's bad style, but it works as I need.
All that date arithmetic can be done in SQL. If you do that, it will use the index.
"Constant" expressions (such as CURDATE() + INTERVAL 4 MONTH) are evaluated to a DATETIME or TIMESTAMP datatype before starting the query.
I've found a few questions that deal with this problem, and it appears that MySQL doesn't allow it. That's fine, I don't have to have a subquery in the FROM clause. However, I don't know how to get around it. Here's my setup:
I have a metrics table that has 3 columns I want: ControllerID, TimeStamp, and State. Basically, a data gathering engine contacts each controller in the database every 5 minutes and sticks an entry in the metrics table. The table has those three columns, plus a MetricsID that I don't care about. Maybe there is a better way to store those metrics, but I don't know it.
Regardless, I want a view that takes the most recent TimeStamp for each of the different ControllerIDs and grabs the TimeStamp, ControllerID, and State. So if there are 4 controllers, the view should always have 4 rows, each with a different controller, along with its most recent state.
I've been able to create a query that gets what I want, but it relies on a subquery in the FROM clause, something that isn't allowed in a view. Here is what I have so far:
SELECT *
FROM
(SELECT
ControllerID, TimeStamp, State
FROM Metrics
ORDER BY TimeStamp DESC)
AS t
GROUP BY ControllerID;
Like I said, this works great. But I can't use it in a view. I've tried using the max() function, but as per here: SQL: Any straightforward way to order results FIRST, THEN group by another column? if I want any additional columns besides the GROUP BY and ORDER BY columns, max() doesn't work. I've confirmed this limitation, it doesn't work.
I've also tried to alter the metrics table to order by TimeStamp. That doesn't work either; the wrong rows are kept.
Edit: Here is the SHOW CREATE TABLE of the Metrics table I am pulling from:
CREATE TABLE Metrics (
MetricsID int(11) NOT NULL AUTO_INCREMENT,
ControllerID int(11) NOT NULL,
TimeStamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
State tinyint(4) NOT NULL,
PRIMARY KEY (MetricsID),
KEY makeItFast (ControllerID,MetricsID),
KEY fast (ControllerID,TimeStamp),
KEY fast2 (MetricsID),
KEY MetricsID (MetricsID),
KEY TimeStamp (TimeStamp)
) ENGINE=InnoDB AUTO_INCREMENT=8958 DEFAULT CHARSET=latin1
If you want the most recent row for each controller, the following is view friendly:
SELECT ControllerID, TimeStamp, State
FROM Metrics m
WHERE NOT EXISTS (SELECT 1
FROM Metrics m2
WHERE m2.ControllerId = m.ControllerId and m2.Timestamp > m.TimeStamp
);
Your query is not correct anyway, because it uses a MySQL extension that is not guaranteed to work. The value for state doesn't necessary come from the row with the largest timestamp. It comes from an arbitrary row.
EDIT:
For best performance, you want an index on Metrics(ControllerId, Timestamp).
Edit Sorry, I misunderstood your question; I thought you were trying to overcome the nested-query limitation in a view.
You're trying to display the most recent row for each distinct ControllerID. Furthermore, you're trying to do it with a view.
First, let's do it. If your MetricsID column (which I know you don't care about) is an autoincrement column, this is really easy.
SELECT ControllerId, TimeStamp, State
FROM Metrics m
WHERE MetricsID IN (
SELECT MAX(MetricsID) MetricsID
FROM Metrics
GROUP BY ControllerID)
ORDER BY ControllerID
This query uses MAX ... GROUP BY to extract the highest-numbered (most recent) row for each controller. It can be made into a view.
A compound index on (ControllerID, MetricsID) will be able to satisfy the subquery with a highly efficient loose index scan.
The root cause of my confusion: I didn't read your question carefully enough.
The root cause of your confusion: You're trying to take advantage of a pernicious MySQL extension to GROUP BY. Your idea of ordering the subquery may have worked. But your temporary success is an accidental side-effect of the present implementation. Read this: http://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html
I have a very huge table (425+ million rows).
CREATE TABLE `DummyTab` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`Name` varchar(48) NOT NULL,
`BeginDate` datetime DEFAULT NULL,
`EndDate` datetime NOT NULL,
......
......
KEY `BeginDate_index` (`dBegDate`),
KEY `id` (`id`),
) ENGINE=MyISAM
Selects are done based on "BeginDate" and other criteria on this table
select * from DummyTab where Name like "%dummyname%" and BeginDate>= 20141101
Now in this case only the date field is being provided out of datetime (although it'll be used as 2014-11-01 00:00:00).
Question is DOES THE OPTIMIZER MAKE USE OF DATETIME INDEX PROPERLY EVEN WHEN JUST DATE IS PROVIDED IN THIS CASE ? or should the index be set on a "date" field to be used more effectively rather than a "datetime"
Yes, BeginDate_index can still be used when the query is specified with a DATE-only filter (also applying additional criteria on Name won't disqualify the index either).
If you look at this SqlFiddle of random data, and expand the Execution plan at the bottom, you'll see something like:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS FILTERED EXTRA
1 SIMPLE DummyTab range BeginDate_index BeginDate_index 6 17190 100 Using index condition; Using where
(Specifically KEY is BeginDate_index).
Note however that use of the index is not guaranteed, e.g. if you execute the same query against a wider range of date criteria, that a different plan may be used (e.g. if you run the same fiddle for > 20140101, the BeginDate_index is no longer used, since it does not offer sufficient selectivity).
Edit, Re: Comment on Exactness
Since BeginDate is a datetime, the literal 20141101 will be also be converted to a Datetime (once). From the docs:
If one of the arguments is a TIMESTAMP or DATETIME column and the other argument is a constant, the constant is converted to a timestamp before the comparison is performed.
So again, yes, as per your last paragraph, the literal in the filter BeginDate >= 20141101 will be converted to the exact date time 20141101000000 (2014-11-01 00:00:00) and any eligible indexes will be considered (but again, never guaranteed).
A common issue where indexes cannot be used is because the filter predicates are NOT sargable is when a function is applied to a column in a filter, as the engine would need to evaluate the function on all remaining rows in the query. Some examples here.
So altering your example a bit, the below queries do the same thing, but the second one is much slower. This query is sargable:
SELECT * FROM DummyTab
WHERE BeginDate < 20140101; -- Good
Whereas this is NOT:
SELECT * FROM DummyTab
WHERE YEAR(BeginDate) < 2014; -- Bad
Updated SqlFiddle here - again, look at the Execution Plans at the bottom to see the difference.
EDIT: Thank you everyone for your comments. I have tried most of your suggestions but they did not help. I need to add that I am running this query through Matlab using Connector/J 5.1.26 (Sorry for not mentioning this earlier). In the end, I think this is the source of the increase in execution time since when I run the query "directly" it takes 0.2 seconds. However, I have never come across such a huge performance hit using Connector/J. Given this new information, do you have any suggestions? I apologize for not disclosing this earlier, but again, I've never experienced performance impact with Connector/J.
I have the following table in mySQL (CREATE code taken from HeidiSQL):
CREATE TABLE `data` (
`PRIMARY` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`ID` VARCHAR(5) NULL DEFAULT NULL,
`DATE` DATE NULL DEFAULT NULL,
`PRICE` DECIMAL(14,4) NULL DEFAULT NULL,
`QUANT` INT(10) NULL DEFAULT NULL,
`TIME` TIME NULL DEFAULT NULL,
INDEX `DATE` (`DATE`),
INDEX `ID` (`SYMBOL`),
INDEX `PRICE` (`PRICE`),
INDEX `QUANT` (`SIZE`),
INDEX `TIME` (`TIME`),
PRIMARY KEY (`PRIMARY`)
)
It is populated with approximately 360,000 rows of data.
The following query takes over 10 seconds to execute:
Select ID, DATE, PRICE, QUANT, TIME FROM database.data WHERE DATE
>= "2007-01-01" AND DATE <= "2010-12-31" ORDER BY ID, DATE, TIME ASC;
I have other tables with millions of rows in which a similar query would take a fraction of a second. I can't figure out what might be causing this one to be so slow. Any ideas/tips?
EXPLAIN:
id = 1
select_type = SIMPLE
table = data
type = ALL
possible_keys = DATE
key = (NULL)
key_len = (NULL)
ref = (NULL)
rows = 361161
Extra = Using where; Using filesort
You are asking for a wide range of data. The time is probably being spent sorting the results.
Is a query on a smaller date range faster? For instance,
WHERE DATE >= '2007-01-01' AND DATE < '2007-02-01'
One possibility is that the optimizer may be using the index on id for the sort and doing a full table scan to filter out the date range. Using indexes for sorts is often suboptimal. You might try the query as:
select t.*
from (Select ID, DATE, PRICE, QUANT, TIME
FROM database.data
WHERE DATE >= "2007-01-01" AND DATE <= "2010-12-31"
) t
ORDER BY ID, DATE, TIME ASC;
I think this will force the optimizer to use the date index for the selection and then sort using file sort -- but there is the cost of a derived table. If you do not have a large result set, this might significantly improve performance.
I assume you already tried to OPTIMIZE TABLE and got no results.
You can either try to use a covering index (at the expense of more disk space, and a slight slowing down on UPDATEs) by replacing the existing date index with
CREATE INDEX data_date_ndx ON data (DATE, TIME, PRICE, QUANT, ID);
and/or you can try and create an empty table data2 with the same schema. Then just SELECT all the contents of data table into data2 and run the same query against the new table. It could be that the data table needed to be compacted more than OPTIMIZE could - maybe at the filesystem level.
Also, check out the output of EXPLAIN SELECT... for that query.
I'm not familiar with mysql but mssql so maybe:
what about to provide index which fully covers all fields in your select query.
Yes, it will duplicates data but we can move to next point of issue discussion.
I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.