I would like to know if there is any option to improve and optimize my query. I am importing schedule data for our staff and need to delete the old data, when it is existing for this agent and day. The reason is, that agents may don't exist for everyday (because they left the company) and it could be, that we upload a more updated report as we did before (recent schedule changes).
That's why I currently have this query:
DELETE FROM `agents` WHERE
(`id` = 1 AND `date` => '01.01.2015 00:00:00' AND `date` <= '01.01.2015 23:59:59') OR
(`id` = 2 AND `date` => '01.01.2015 00:00:00' AND `date` <= '01.01.2015 23:59:59') OR [...]
This combination is for each agent of the report and each day in the report. I uploaded one which created 5780 day/agent combinations. This query took on my (currently) small table about 5 minutes to be executed.
I am wondering if anyone has an idea how I could improve this thing.
What you want to do is going to be rather difficult. As written, it probably requires a full table scan.
One method would be to add an index on agents(id, date) and to do the deletes separately:
DELETE FROM `agents`
WHERE (`id` = 1 AND date >= '2015-01-01' AND `date` < '2015-01-02');
DELETE FROM `agents`
WHERE (`id` = 2 AND date >= '2015-01-01' AND `date` < '2015-01-02')
Assuming the dates are all the same, you can write the where clause as:
DELETE FROM `agents`
WHERE `id` IN (1, 2, 3, . . . ) AND
`date` >= '2015-01-01' AND `date` < '2015-01-02';
Depending on the distribution of the data (number of dates per id in the range) weight, either the above index or one on agents(date, id) would be best.
If you had a table like this:
create table t123
( id int not null,
date datetime not null,
myThing varchar(10) not null
);
And you later added an index like this:
ALTER TABLE t123 ADD INDEX (id,date); -- add an index after the fact
Then a delete from like yours on table t123 would perform as fast as I could imagine. However it would need to maintain the index along the way, and it is baggage to consider.
All index changes need to be carefully weighed. The benefit of quicker access, at the expense of slowing down inserts/updates/deletes.
Manual pages of Fast Engine Creation and Alter Table
Related
I have a large table containing over 10 million records and It will keep growing. I am performing an aggregation query (count of particular value) on records of last 24 hours. The time taken by this query will keep increasing with number of records in the table.
I can limit the time taken by keeping these 24 hours records in separate table and perform aggregation on that table. Does mysql provide any functionality to handle this kind of scenario?
Table schema and query for reference:
CREATE TABLE purchases (
Id int(11) NOT NULL AUTO_INCREMENT,
ProductId int(11) NOT NULL,
CustomerId int(11) NOT NULL,
PurchaseDateTime datetime(3) NOT NULL,
PRIMARY KEY (Id),
KEY ix_purchases_PurchaseDateTime (PurchaseDateTime) USING BTREE,
KEY ix_purchases_ProductId (ProductId) USING BTREE,
KEY ix_purchases_CustomerId (CustomerId) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
select COALESCE(sum(ProductId = v_ProductId), 0),
COALESCE(sum(CustomerId = v_CustomerId), 0)
into v_ProductCount, v_CustomerCount
from purchases
where PurchaseDateTime > NOW() - INTERVAL 1 DAY
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId );
Build and maintain a separate Summary table .
With partitioning, you might get a small improvement, or you might get no improvement. With a summary table, you might get a factor of 10 improvement.
The summary table could have a 1-day resolution, or you might need 1-hour. Please provide SHOW CREATE TABLE for what you currently have, so we can discuss more specifics.
(There is no built-in mechanism for what you want.)
Plan A
I would leave off
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId )
since the rest of the query will simply deal with it anyway.
Then I would add
INDEX(PurchaseDateTime, ProductId, CustomerId)
which would be "covering" -- that is, the entire SELECT can be performed in the INDEX's BTree. It would also be 'clustered' in the sense that all the data needed would be consecutively stored in the index. Yes, the datetime is deliberately first. (OR is a nuisance to optimize. I don't trust the Optimizer to do "index merge union".)
Plan B
If you expect to touch very few rows (because of v_ProductId and v_CustomerId), then the following may be faster, in spite of being more complex:
SELECT COALESCE(sum(ProductId = v_ProductId), 0)
INTO v_ProductCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND ProductId = v_ProductId;
SELECT COALESCE(sum(CustomerId = v_CustomerId), 0)
INTO v_CustomerCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND CustomerId = v_CustomerId;
together with both:
INDEX(ProductId, PurchaseDateTime),
INDEX(CustomerId, PurchaseDateTime)
Yes, the order of the columns is deliberately different.
Original Question
Both of these approaches are better than your original suggestion of a separate table. These isolate the data in one part of an index (or two indexes), thereby having the effect of "separate". And these do the task with less effort on your part.
What is the best thing for my scenario
I have a tables with nearly 20,000,000 records, which basically stores what users have done in the site
id -> primary int 11 auto increment
user_id -> index int 11 not null
create_date -> ( no index yet ) date-time not null
it has other columns but seems irrelevant to name them here
I know I must put an index on create_date but do I put a single column index or a double column, which one first on the double index ( given the large number of records)?
by the way the query that I'm now using is like :
select max(id) -- in here I'm selecting actions that users have done, after this date, since date is today
from table t
where
t.create_date >= '2014-12-29 00:00:00'
group by t.user_id
Could you edit your question with an EXPLAIN PLAN of your SELECT? EXPLAIN Link. Meanwhile, you can try with this:
Make partitions using your date field create_date. Partitions
Build your index with the most restrictive criteria first. I think that in your case, it will be better create_date + user_id
CREATE INDEX index_name
ON table_name ( create_date , user_id );
EDIT: Thank you everyone for your comments. I have tried most of your suggestions but they did not help. I need to add that I am running this query through Matlab using Connector/J 5.1.26 (Sorry for not mentioning this earlier). In the end, I think this is the source of the increase in execution time since when I run the query "directly" it takes 0.2 seconds. However, I have never come across such a huge performance hit using Connector/J. Given this new information, do you have any suggestions? I apologize for not disclosing this earlier, but again, I've never experienced performance impact with Connector/J.
I have the following table in mySQL (CREATE code taken from HeidiSQL):
CREATE TABLE `data` (
`PRIMARY` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`ID` VARCHAR(5) NULL DEFAULT NULL,
`DATE` DATE NULL DEFAULT NULL,
`PRICE` DECIMAL(14,4) NULL DEFAULT NULL,
`QUANT` INT(10) NULL DEFAULT NULL,
`TIME` TIME NULL DEFAULT NULL,
INDEX `DATE` (`DATE`),
INDEX `ID` (`SYMBOL`),
INDEX `PRICE` (`PRICE`),
INDEX `QUANT` (`SIZE`),
INDEX `TIME` (`TIME`),
PRIMARY KEY (`PRIMARY`)
)
It is populated with approximately 360,000 rows of data.
The following query takes over 10 seconds to execute:
Select ID, DATE, PRICE, QUANT, TIME FROM database.data WHERE DATE
>= "2007-01-01" AND DATE <= "2010-12-31" ORDER BY ID, DATE, TIME ASC;
I have other tables with millions of rows in which a similar query would take a fraction of a second. I can't figure out what might be causing this one to be so slow. Any ideas/tips?
EXPLAIN:
id = 1
select_type = SIMPLE
table = data
type = ALL
possible_keys = DATE
key = (NULL)
key_len = (NULL)
ref = (NULL)
rows = 361161
Extra = Using where; Using filesort
You are asking for a wide range of data. The time is probably being spent sorting the results.
Is a query on a smaller date range faster? For instance,
WHERE DATE >= '2007-01-01' AND DATE < '2007-02-01'
One possibility is that the optimizer may be using the index on id for the sort and doing a full table scan to filter out the date range. Using indexes for sorts is often suboptimal. You might try the query as:
select t.*
from (Select ID, DATE, PRICE, QUANT, TIME
FROM database.data
WHERE DATE >= "2007-01-01" AND DATE <= "2010-12-31"
) t
ORDER BY ID, DATE, TIME ASC;
I think this will force the optimizer to use the date index for the selection and then sort using file sort -- but there is the cost of a derived table. If you do not have a large result set, this might significantly improve performance.
I assume you already tried to OPTIMIZE TABLE and got no results.
You can either try to use a covering index (at the expense of more disk space, and a slight slowing down on UPDATEs) by replacing the existing date index with
CREATE INDEX data_date_ndx ON data (DATE, TIME, PRICE, QUANT, ID);
and/or you can try and create an empty table data2 with the same schema. Then just SELECT all the contents of data table into data2 and run the same query against the new table. It could be that the data table needed to be compacted more than OPTIMIZE could - maybe at the filesystem level.
Also, check out the output of EXPLAIN SELECT... for that query.
I'm not familiar with mysql but mssql so maybe:
what about to provide index which fully covers all fields in your select query.
Yes, it will duplicates data but we can move to next point of issue discussion.
I have a table with the following structure
CREATE TABLE rel_score (
user_id bigint(20) NOT NULL DEFAULT '0',
score_date date NOT NULL,
rel_score decimal(4,2) DEFAULT NULL,
doc_count int(8) NOT NULL
total_doc_count int(8) NOT NULL
PRIMARY KEY (user_id,score_date),
KEY SCORE_DT_IDX (score_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 PACK_KEYS=1
The table will store rel_score value for every user in the application for every day since 1st Jan 2000 till date. I estimated the total number records will be over 700 million. I populated the table with 6 months data (~ 30 million rows) and the query response time is about 8 minutes. Here is my query,
select
user_id, max(rel_score) as max_rel_score
from
rel_score
where score_date between '2012-01-01' and '2012-06-30'
group by user_id
order by max_rel_score desc;
I tried optimizing the query using the following techniques,
Partitioning on the score_date column
Adding an index on the score_date column
The query response time improved marginally to a little less than 8 mins.
How can I improve response time? Is the design of the table appropropriate?
Also, I cannot move the old data to archive as an user is allowed to query on the entire data range.
If you partition your table on the same level of the score_date you will not reduce the query response time.
Try to create another attribut that will contain only the year of the date, cast it to an INTEGER , partition your table on this attribut (you will get 13 partition), and reexecute your query to see .
Your primary index should do a good job of covering the table. If you didn't have it, I would suggest building an index on rel_score(user_id, score_date, rel_score). For your query, this is a "covering" index, meaning that the index has all the columns in the query, so the engine never has to access the data pages (only the index).
The following version might also make good use of this index (although I much prefer your version of the query):
select u.user_id,
(select max(rel_score)
from rel_score r2
where r2.user_id = r.user_id and
r2.score_date between '2012-01-01' and '2012-06-30'
) as rel_score
from (select distinct user_id
from rel_score
where score_date between '2012-01-01' and '2012-06-30'
) u
order by rel_score desc;
The idea behind this query is to replace the aggregation with a simple index lookup. Aggregation in MySQL is a slow operation -- it works much better in other databases so such tricks shouldn't be necessary.
I'm having a MySQL-Table like this:
CREATE TABLE `dates` (
`id` int UNSIGNED NULL AUTO_INCREMENT ,
`object_id` int UNSIGNED NOT NULL ,
`date_from` date NOT NULL ,
`date_to` date NULL ,
`time_from` time NULL ,
`time_to` time NULL ,
PRIMARY KEY (`id`)
);
which is queried mostly this way:
SELECT object_id FROM `dates`
WHERE NOW() BETWEEN date_from AND date_to
How do I index the table best? Should I create two indexes, one for date_from and one for date_to or is a combined index on both columns better?
For the query:
WHERE NOW() >= date_from
AND NOW() <= date_to
A compound index (date_from, date_to) is useless.
Create both indices: (date_from) and (date_to) and let the SQL optimizer decide each time which one to use. Depending on the values and the selectivity, the optimizer may choose one or the other index. Or none of them. There is no easy way to create an index that will take both conditions into consideration.
(A spatial index could be used to optimize such a condition, if you could translate the dates to latitude and longitude).
Update
My mistake. An index on (date_from, date_to, object_id) can and is indeed used in some situations for this query. If the selectivity of the NOW() <= date_from is high enough, the optimizer chooses to use this index, than doing a full scan on the table or using another index. This is because it's a covering index, meaning no data is needed to be fetched from the table, only reading from the index data is required.
Minor note (not related to performance, only correctness of the query). Your condition is equivalent to:
WHERE CURRENT_DATE() >= date_from
AND ( CURRENT_DATE() + INTERVAL 1 DAY <= date_to
OR ( CURRENT_DATE() = NOW()
AND CURRENT_DATE() = date_to
)
)
Are you sure you want that or do you want this:
WHERE CURRENT_DATE() >= date_from
AND CURRENT_DATE() <= date_to
The NOW() function returns a DATETIME, while CURRENT_DATE() returns a DATE, without the time part.
You should create an index covering date_from, date_to and object_id as explained by ypercube. The order of the fields in the index is dependant on whether you will have more data for the past or the future. As pointed out by Erwin in response to Sanjay's comment, the date_to field will be more selective if you have more dates in the past and vice versa.
CREATE INDEX ON (date_to, date_from, object_id);
How many rows in relation to your table size does your query return? If it's more than 10 percent i would not bother to create an index, in such a case your quite close to a table scan anyway. If it's well below 10 percent, then in this case, would use an index containg
(date_from, date_to, object_id) so, that the query result can be constructed entirely from the information in the index, without the database havind to track back to the table data to get the value for object_id.
Depending on the size of your table this can use up alot of space. If you can spare that, give it a try.
Create an index with (date_from,date_to) as that single index would be usable for the WHERE criteria
If you create separate indexes then MySQL will have to use one or the other instead of both