I want to:
select
max_date = max( dates)
from some_table t
where dates is datetime in form of
2014-10-29 23:34:11
and is primary key, so is indexed.
What is the retrieval complexity for big databases?
Since your date column is primary key it will be unique and indexed. So, it should be fine.
Per MySQL documentation, if you use WHERE clause along with the MAX() function then the query will be optimized and will be faster.
In your case, you are just trying to get the maximum date, you can as well use OEDER BY with LIMIT like below which will take advantage of index on dates column and will be faster
select `dates`
from some_table
order by `dates` desc
limit 1;
Related
I have an Invoice table:
CREATE TABLE Invoice (
id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
countryCode CHAR(2) NOT NULL,
number INT(10) UNSIGNED NOT NULL,
...
PRIMARY KEY(id),
UNIQUE KEY(countryCode ASC, number ASC)
);
Each country must have its own sequential invoice numbering, so before creating a new invoice, I have to get the next sequence number for this particular country using the following query:
SELECT MAX(number) + 1
FROM Invoice
WHERE countryCode = 'XX';
I understand that this query will use the index.
Is there any incidence of the index sort order (number ASC or number DESC) on the performance of this query?
I think you have pretty much optimal index utilization since countryCode is first part of your key and the column you are getting max value for is second part of key. No need for sort here at all. Check out this exceprt from MySQL documentation on index utilization
http://dev.mysql.com/doc/refman/5.5/en/mysql-indexes.html
To find the MIN() or MAX() value for a specific indexed column
key_col. This is optimized by a preprocessor that checks whether you
are using WHERE key_part_N = constant on all key parts that occur
before key_col in the index. In this case, MySQL does a single key
lookup for each MIN() or MAX() expression and replaces it with a
constant. If all expressions are replaced with constants, the query
returns at once. For example:
SELECT MIN(key_part2),MAX(key_part2) FROM tbl_name WHERE
key_part1=10;
From http://dev.mysql.com/doc/refman/5.1/en/create-table.html
An index_col_name specification can end with ASC or DESC. These keywords are permitted for future extensions for specifying ascending or descending index value storage. Currently, they are parsed but ignored; index values are always stored in ascending order.
I believe that with your statement, a key will only be created on countryCode and not number.
There will be small performance difference for Selectstatements, but bigger difference for Insert, Update and Delete statements.
When you insert in the middle of a full page, then DB engine should move some rows to next page. If you have DESC then every INSERT should move index rows over pages. Same for DELETE ones.
Updatesare not so affected, but if you change index values same thing applies.
EDIT: Thank you everyone for your comments. I have tried most of your suggestions but they did not help. I need to add that I am running this query through Matlab using Connector/J 5.1.26 (Sorry for not mentioning this earlier). In the end, I think this is the source of the increase in execution time since when I run the query "directly" it takes 0.2 seconds. However, I have never come across such a huge performance hit using Connector/J. Given this new information, do you have any suggestions? I apologize for not disclosing this earlier, but again, I've never experienced performance impact with Connector/J.
I have the following table in mySQL (CREATE code taken from HeidiSQL):
CREATE TABLE `data` (
`PRIMARY` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`ID` VARCHAR(5) NULL DEFAULT NULL,
`DATE` DATE NULL DEFAULT NULL,
`PRICE` DECIMAL(14,4) NULL DEFAULT NULL,
`QUANT` INT(10) NULL DEFAULT NULL,
`TIME` TIME NULL DEFAULT NULL,
INDEX `DATE` (`DATE`),
INDEX `ID` (`SYMBOL`),
INDEX `PRICE` (`PRICE`),
INDEX `QUANT` (`SIZE`),
INDEX `TIME` (`TIME`),
PRIMARY KEY (`PRIMARY`)
)
It is populated with approximately 360,000 rows of data.
The following query takes over 10 seconds to execute:
Select ID, DATE, PRICE, QUANT, TIME FROM database.data WHERE DATE
>= "2007-01-01" AND DATE <= "2010-12-31" ORDER BY ID, DATE, TIME ASC;
I have other tables with millions of rows in which a similar query would take a fraction of a second. I can't figure out what might be causing this one to be so slow. Any ideas/tips?
EXPLAIN:
id = 1
select_type = SIMPLE
table = data
type = ALL
possible_keys = DATE
key = (NULL)
key_len = (NULL)
ref = (NULL)
rows = 361161
Extra = Using where; Using filesort
You are asking for a wide range of data. The time is probably being spent sorting the results.
Is a query on a smaller date range faster? For instance,
WHERE DATE >= '2007-01-01' AND DATE < '2007-02-01'
One possibility is that the optimizer may be using the index on id for the sort and doing a full table scan to filter out the date range. Using indexes for sorts is often suboptimal. You might try the query as:
select t.*
from (Select ID, DATE, PRICE, QUANT, TIME
FROM database.data
WHERE DATE >= "2007-01-01" AND DATE <= "2010-12-31"
) t
ORDER BY ID, DATE, TIME ASC;
I think this will force the optimizer to use the date index for the selection and then sort using file sort -- but there is the cost of a derived table. If you do not have a large result set, this might significantly improve performance.
I assume you already tried to OPTIMIZE TABLE and got no results.
You can either try to use a covering index (at the expense of more disk space, and a slight slowing down on UPDATEs) by replacing the existing date index with
CREATE INDEX data_date_ndx ON data (DATE, TIME, PRICE, QUANT, ID);
and/or you can try and create an empty table data2 with the same schema. Then just SELECT all the contents of data table into data2 and run the same query against the new table. It could be that the data table needed to be compacted more than OPTIMIZE could - maybe at the filesystem level.
Also, check out the output of EXPLAIN SELECT... for that query.
I'm not familiar with mysql but mssql so maybe:
what about to provide index which fully covers all fields in your select query.
Yes, it will duplicates data but we can move to next point of issue discussion.
I have a table with the following structure
CREATE TABLE rel_score (
user_id bigint(20) NOT NULL DEFAULT '0',
score_date date NOT NULL,
rel_score decimal(4,2) DEFAULT NULL,
doc_count int(8) NOT NULL
total_doc_count int(8) NOT NULL
PRIMARY KEY (user_id,score_date),
KEY SCORE_DT_IDX (score_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 PACK_KEYS=1
The table will store rel_score value for every user in the application for every day since 1st Jan 2000 till date. I estimated the total number records will be over 700 million. I populated the table with 6 months data (~ 30 million rows) and the query response time is about 8 minutes. Here is my query,
select
user_id, max(rel_score) as max_rel_score
from
rel_score
where score_date between '2012-01-01' and '2012-06-30'
group by user_id
order by max_rel_score desc;
I tried optimizing the query using the following techniques,
Partitioning on the score_date column
Adding an index on the score_date column
The query response time improved marginally to a little less than 8 mins.
How can I improve response time? Is the design of the table appropropriate?
Also, I cannot move the old data to archive as an user is allowed to query on the entire data range.
If you partition your table on the same level of the score_date you will not reduce the query response time.
Try to create another attribut that will contain only the year of the date, cast it to an INTEGER , partition your table on this attribut (you will get 13 partition), and reexecute your query to see .
Your primary index should do a good job of covering the table. If you didn't have it, I would suggest building an index on rel_score(user_id, score_date, rel_score). For your query, this is a "covering" index, meaning that the index has all the columns in the query, so the engine never has to access the data pages (only the index).
The following version might also make good use of this index (although I much prefer your version of the query):
select u.user_id,
(select max(rel_score)
from rel_score r2
where r2.user_id = r.user_id and
r2.score_date between '2012-01-01' and '2012-06-30'
) as rel_score
from (select distinct user_id
from rel_score
where score_date between '2012-01-01' and '2012-06-30'
) u
order by rel_score desc;
The idea behind this query is to replace the aggregation with a simple index lookup. Aggregation in MySQL is a slow operation -- it works much better in other databases so such tricks shouldn't be necessary.
I'm having a MySQL-Table like this:
CREATE TABLE `dates` (
`id` int UNSIGNED NULL AUTO_INCREMENT ,
`object_id` int UNSIGNED NOT NULL ,
`date_from` date NOT NULL ,
`date_to` date NULL ,
`time_from` time NULL ,
`time_to` time NULL ,
PRIMARY KEY (`id`)
);
which is queried mostly this way:
SELECT object_id FROM `dates`
WHERE NOW() BETWEEN date_from AND date_to
How do I index the table best? Should I create two indexes, one for date_from and one for date_to or is a combined index on both columns better?
For the query:
WHERE NOW() >= date_from
AND NOW() <= date_to
A compound index (date_from, date_to) is useless.
Create both indices: (date_from) and (date_to) and let the SQL optimizer decide each time which one to use. Depending on the values and the selectivity, the optimizer may choose one or the other index. Or none of them. There is no easy way to create an index that will take both conditions into consideration.
(A spatial index could be used to optimize such a condition, if you could translate the dates to latitude and longitude).
Update
My mistake. An index on (date_from, date_to, object_id) can and is indeed used in some situations for this query. If the selectivity of the NOW() <= date_from is high enough, the optimizer chooses to use this index, than doing a full scan on the table or using another index. This is because it's a covering index, meaning no data is needed to be fetched from the table, only reading from the index data is required.
Minor note (not related to performance, only correctness of the query). Your condition is equivalent to:
WHERE CURRENT_DATE() >= date_from
AND ( CURRENT_DATE() + INTERVAL 1 DAY <= date_to
OR ( CURRENT_DATE() = NOW()
AND CURRENT_DATE() = date_to
)
)
Are you sure you want that or do you want this:
WHERE CURRENT_DATE() >= date_from
AND CURRENT_DATE() <= date_to
The NOW() function returns a DATETIME, while CURRENT_DATE() returns a DATE, without the time part.
You should create an index covering date_from, date_to and object_id as explained by ypercube. The order of the fields in the index is dependant on whether you will have more data for the past or the future. As pointed out by Erwin in response to Sanjay's comment, the date_to field will be more selective if you have more dates in the past and vice versa.
CREATE INDEX ON (date_to, date_from, object_id);
How many rows in relation to your table size does your query return? If it's more than 10 percent i would not bother to create an index, in such a case your quite close to a table scan anyway. If it's well below 10 percent, then in this case, would use an index containg
(date_from, date_to, object_id) so, that the query result can be constructed entirely from the information in the index, without the database havind to track back to the table data to get the value for object_id.
Depending on the size of your table this can use up alot of space. If you can spare that, give it a try.
Create an index with (date_from,date_to) as that single index would be usable for the WHERE criteria
If you create separate indexes then MySQL will have to use one or the other instead of both
I have this table:
CREATE TABLE `table1` (
`object` varchar(255) NOT NULL,
`score` decimal(10,3) NOT NULL,
`timestamp` datetime NOT NULL
KEY `ex` (`object`,`score`,`timestamp`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
with 9.1 million rows and I am running the following query:
SELECT `object`, `timestamp`, AVG(score) as avgs
from `table1`
where timestamp >= '2011-12-14'
AND timestamp <= '2011-12-13'
group by `object`
order by `avgs` ASC limit 100;
The dates come from user input. The query takes 6-10 seconds, depending on the range of dates. The run time seems to increase with the number of rows
What can I do to improve this?
I have tried:
fiddling with indexes (brought query time down from max 13sec to max 10sec)
moving storage to fast SAN (brought query time down by around 0.1sec, regardless of parameters).
The CPU and memory load on the server doesn't appear to be too high when the query is running.
The reason why fast SAN is perform much better
is because your query require copy to temporary table,
and need file-sort for a large results set.
You have five nasty factors.
range query
group-by
sorting
varchar 255 for object
a wrong index
Break-down timestamp to two fields,
date, time
Build another reference table for object,
so, you use integer, such as object_id (instead of varchar 255) to represent object
Rebuilt the index on
date (date type), object_id
Change the query to
where date IN('2011-12-13', '2011-12-14', ...)