I'm having a MySQL-Table like this:
CREATE TABLE `dates` (
`id` int UNSIGNED NULL AUTO_INCREMENT ,
`object_id` int UNSIGNED NOT NULL ,
`date_from` date NOT NULL ,
`date_to` date NULL ,
`time_from` time NULL ,
`time_to` time NULL ,
PRIMARY KEY (`id`)
);
which is queried mostly this way:
SELECT object_id FROM `dates`
WHERE NOW() BETWEEN date_from AND date_to
How do I index the table best? Should I create two indexes, one for date_from and one for date_to or is a combined index on both columns better?
For the query:
WHERE NOW() >= date_from
AND NOW() <= date_to
A compound index (date_from, date_to) is useless.
Create both indices: (date_from) and (date_to) and let the SQL optimizer decide each time which one to use. Depending on the values and the selectivity, the optimizer may choose one or the other index. Or none of them. There is no easy way to create an index that will take both conditions into consideration.
(A spatial index could be used to optimize such a condition, if you could translate the dates to latitude and longitude).
Update
My mistake. An index on (date_from, date_to, object_id) can and is indeed used in some situations for this query. If the selectivity of the NOW() <= date_from is high enough, the optimizer chooses to use this index, than doing a full scan on the table or using another index. This is because it's a covering index, meaning no data is needed to be fetched from the table, only reading from the index data is required.
Minor note (not related to performance, only correctness of the query). Your condition is equivalent to:
WHERE CURRENT_DATE() >= date_from
AND ( CURRENT_DATE() + INTERVAL 1 DAY <= date_to
OR ( CURRENT_DATE() = NOW()
AND CURRENT_DATE() = date_to
)
)
Are you sure you want that or do you want this:
WHERE CURRENT_DATE() >= date_from
AND CURRENT_DATE() <= date_to
The NOW() function returns a DATETIME, while CURRENT_DATE() returns a DATE, without the time part.
You should create an index covering date_from, date_to and object_id as explained by ypercube. The order of the fields in the index is dependant on whether you will have more data for the past or the future. As pointed out by Erwin in response to Sanjay's comment, the date_to field will be more selective if you have more dates in the past and vice versa.
CREATE INDEX ON (date_to, date_from, object_id);
How many rows in relation to your table size does your query return? If it's more than 10 percent i would not bother to create an index, in such a case your quite close to a table scan anyway. If it's well below 10 percent, then in this case, would use an index containg
(date_from, date_to, object_id) so, that the query result can be constructed entirely from the information in the index, without the database havind to track back to the table data to get the value for object_id.
Depending on the size of your table this can use up alot of space. If you can spare that, give it a try.
Create an index with (date_from,date_to) as that single index would be usable for the WHERE criteria
If you create separate indexes then MySQL will have to use one or the other instead of both
Related
I have a large table containing over 10 million records and It will keep growing. I am performing an aggregation query (count of particular value) on records of last 24 hours. The time taken by this query will keep increasing with number of records in the table.
I can limit the time taken by keeping these 24 hours records in separate table and perform aggregation on that table. Does mysql provide any functionality to handle this kind of scenario?
Table schema and query for reference:
CREATE TABLE purchases (
Id int(11) NOT NULL AUTO_INCREMENT,
ProductId int(11) NOT NULL,
CustomerId int(11) NOT NULL,
PurchaseDateTime datetime(3) NOT NULL,
PRIMARY KEY (Id),
KEY ix_purchases_PurchaseDateTime (PurchaseDateTime) USING BTREE,
KEY ix_purchases_ProductId (ProductId) USING BTREE,
KEY ix_purchases_CustomerId (CustomerId) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
select COALESCE(sum(ProductId = v_ProductId), 0),
COALESCE(sum(CustomerId = v_CustomerId), 0)
into v_ProductCount, v_CustomerCount
from purchases
where PurchaseDateTime > NOW() - INTERVAL 1 DAY
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId );
Build and maintain a separate Summary table .
With partitioning, you might get a small improvement, or you might get no improvement. With a summary table, you might get a factor of 10 improvement.
The summary table could have a 1-day resolution, or you might need 1-hour. Please provide SHOW CREATE TABLE for what you currently have, so we can discuss more specifics.
(There is no built-in mechanism for what you want.)
Plan A
I would leave off
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId )
since the rest of the query will simply deal with it anyway.
Then I would add
INDEX(PurchaseDateTime, ProductId, CustomerId)
which would be "covering" -- that is, the entire SELECT can be performed in the INDEX's BTree. It would also be 'clustered' in the sense that all the data needed would be consecutively stored in the index. Yes, the datetime is deliberately first. (OR is a nuisance to optimize. I don't trust the Optimizer to do "index merge union".)
Plan B
If you expect to touch very few rows (because of v_ProductId and v_CustomerId), then the following may be faster, in spite of being more complex:
SELECT COALESCE(sum(ProductId = v_ProductId), 0)
INTO v_ProductCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND ProductId = v_ProductId;
SELECT COALESCE(sum(CustomerId = v_CustomerId), 0)
INTO v_CustomerCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND CustomerId = v_CustomerId;
together with both:
INDEX(ProductId, PurchaseDateTime),
INDEX(CustomerId, PurchaseDateTime)
Yes, the order of the columns is deliberately different.
Original Question
Both of these approaches are better than your original suggestion of a separate table. These isolate the data in one part of an index (or two indexes), thereby having the effect of "separate". And these do the task with less effort on your part.
I would like to know if there is any option to improve and optimize my query. I am importing schedule data for our staff and need to delete the old data, when it is existing for this agent and day. The reason is, that agents may don't exist for everyday (because they left the company) and it could be, that we upload a more updated report as we did before (recent schedule changes).
That's why I currently have this query:
DELETE FROM `agents` WHERE
(`id` = 1 AND `date` => '01.01.2015 00:00:00' AND `date` <= '01.01.2015 23:59:59') OR
(`id` = 2 AND `date` => '01.01.2015 00:00:00' AND `date` <= '01.01.2015 23:59:59') OR [...]
This combination is for each agent of the report and each day in the report. I uploaded one which created 5780 day/agent combinations. This query took on my (currently) small table about 5 minutes to be executed.
I am wondering if anyone has an idea how I could improve this thing.
What you want to do is going to be rather difficult. As written, it probably requires a full table scan.
One method would be to add an index on agents(id, date) and to do the deletes separately:
DELETE FROM `agents`
WHERE (`id` = 1 AND date >= '2015-01-01' AND `date` < '2015-01-02');
DELETE FROM `agents`
WHERE (`id` = 2 AND date >= '2015-01-01' AND `date` < '2015-01-02')
Assuming the dates are all the same, you can write the where clause as:
DELETE FROM `agents`
WHERE `id` IN (1, 2, 3, . . . ) AND
`date` >= '2015-01-01' AND `date` < '2015-01-02';
Depending on the distribution of the data (number of dates per id in the range) weight, either the above index or one on agents(date, id) would be best.
If you had a table like this:
create table t123
( id int not null,
date datetime not null,
myThing varchar(10) not null
);
And you later added an index like this:
ALTER TABLE t123 ADD INDEX (id,date); -- add an index after the fact
Then a delete from like yours on table t123 would perform as fast as I could imagine. However it would need to maintain the index along the way, and it is baggage to consider.
All index changes need to be carefully weighed. The benefit of quicker access, at the expense of slowing down inserts/updates/deletes.
Manual pages of Fast Engine Creation and Alter Table
What is the best thing for my scenario
I have a tables with nearly 20,000,000 records, which basically stores what users have done in the site
id -> primary int 11 auto increment
user_id -> index int 11 not null
create_date -> ( no index yet ) date-time not null
it has other columns but seems irrelevant to name them here
I know I must put an index on create_date but do I put a single column index or a double column, which one first on the double index ( given the large number of records)?
by the way the query that I'm now using is like :
select max(id) -- in here I'm selecting actions that users have done, after this date, since date is today
from table t
where
t.create_date >= '2014-12-29 00:00:00'
group by t.user_id
Could you edit your question with an EXPLAIN PLAN of your SELECT? EXPLAIN Link. Meanwhile, you can try with this:
Make partitions using your date field create_date. Partitions
Build your index with the most restrictive criteria first. I think that in your case, it will be better create_date + user_id
CREATE INDEX index_name
ON table_name ( create_date , user_id );
I want to:
select
max_date = max( dates)
from some_table t
where dates is datetime in form of
2014-10-29 23:34:11
and is primary key, so is indexed.
What is the retrieval complexity for big databases?
Since your date column is primary key it will be unique and indexed. So, it should be fine.
Per MySQL documentation, if you use WHERE clause along with the MAX() function then the query will be optimized and will be faster.
In your case, you are just trying to get the maximum date, you can as well use OEDER BY with LIMIT like below which will take advantage of index on dates column and will be faster
select `dates`
from some_table
order by `dates` desc
limit 1;
EDIT: Thank you everyone for your comments. I have tried most of your suggestions but they did not help. I need to add that I am running this query through Matlab using Connector/J 5.1.26 (Sorry for not mentioning this earlier). In the end, I think this is the source of the increase in execution time since when I run the query "directly" it takes 0.2 seconds. However, I have never come across such a huge performance hit using Connector/J. Given this new information, do you have any suggestions? I apologize for not disclosing this earlier, but again, I've never experienced performance impact with Connector/J.
I have the following table in mySQL (CREATE code taken from HeidiSQL):
CREATE TABLE `data` (
`PRIMARY` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`ID` VARCHAR(5) NULL DEFAULT NULL,
`DATE` DATE NULL DEFAULT NULL,
`PRICE` DECIMAL(14,4) NULL DEFAULT NULL,
`QUANT` INT(10) NULL DEFAULT NULL,
`TIME` TIME NULL DEFAULT NULL,
INDEX `DATE` (`DATE`),
INDEX `ID` (`SYMBOL`),
INDEX `PRICE` (`PRICE`),
INDEX `QUANT` (`SIZE`),
INDEX `TIME` (`TIME`),
PRIMARY KEY (`PRIMARY`)
)
It is populated with approximately 360,000 rows of data.
The following query takes over 10 seconds to execute:
Select ID, DATE, PRICE, QUANT, TIME FROM database.data WHERE DATE
>= "2007-01-01" AND DATE <= "2010-12-31" ORDER BY ID, DATE, TIME ASC;
I have other tables with millions of rows in which a similar query would take a fraction of a second. I can't figure out what might be causing this one to be so slow. Any ideas/tips?
EXPLAIN:
id = 1
select_type = SIMPLE
table = data
type = ALL
possible_keys = DATE
key = (NULL)
key_len = (NULL)
ref = (NULL)
rows = 361161
Extra = Using where; Using filesort
You are asking for a wide range of data. The time is probably being spent sorting the results.
Is a query on a smaller date range faster? For instance,
WHERE DATE >= '2007-01-01' AND DATE < '2007-02-01'
One possibility is that the optimizer may be using the index on id for the sort and doing a full table scan to filter out the date range. Using indexes for sorts is often suboptimal. You might try the query as:
select t.*
from (Select ID, DATE, PRICE, QUANT, TIME
FROM database.data
WHERE DATE >= "2007-01-01" AND DATE <= "2010-12-31"
) t
ORDER BY ID, DATE, TIME ASC;
I think this will force the optimizer to use the date index for the selection and then sort using file sort -- but there is the cost of a derived table. If you do not have a large result set, this might significantly improve performance.
I assume you already tried to OPTIMIZE TABLE and got no results.
You can either try to use a covering index (at the expense of more disk space, and a slight slowing down on UPDATEs) by replacing the existing date index with
CREATE INDEX data_date_ndx ON data (DATE, TIME, PRICE, QUANT, ID);
and/or you can try and create an empty table data2 with the same schema. Then just SELECT all the contents of data table into data2 and run the same query against the new table. It could be that the data table needed to be compacted more than OPTIMIZE could - maybe at the filesystem level.
Also, check out the output of EXPLAIN SELECT... for that query.
I'm not familiar with mysql but mssql so maybe:
what about to provide index which fully covers all fields in your select query.
Yes, it will duplicates data but we can move to next point of issue discussion.