I have a table with the following structure
CREATE TABLE rel_score (
user_id bigint(20) NOT NULL DEFAULT '0',
score_date date NOT NULL,
rel_score decimal(4,2) DEFAULT NULL,
doc_count int(8) NOT NULL
total_doc_count int(8) NOT NULL
PRIMARY KEY (user_id,score_date),
KEY SCORE_DT_IDX (score_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 PACK_KEYS=1
The table will store rel_score value for every user in the application for every day since 1st Jan 2000 till date. I estimated the total number records will be over 700 million. I populated the table with 6 months data (~ 30 million rows) and the query response time is about 8 minutes. Here is my query,
select
user_id, max(rel_score) as max_rel_score
from
rel_score
where score_date between '2012-01-01' and '2012-06-30'
group by user_id
order by max_rel_score desc;
I tried optimizing the query using the following techniques,
Partitioning on the score_date column
Adding an index on the score_date column
The query response time improved marginally to a little less than 8 mins.
How can I improve response time? Is the design of the table appropropriate?
Also, I cannot move the old data to archive as an user is allowed to query on the entire data range.
If you partition your table on the same level of the score_date you will not reduce the query response time.
Try to create another attribut that will contain only the year of the date, cast it to an INTEGER , partition your table on this attribut (you will get 13 partition), and reexecute your query to see .
Your primary index should do a good job of covering the table. If you didn't have it, I would suggest building an index on rel_score(user_id, score_date, rel_score). For your query, this is a "covering" index, meaning that the index has all the columns in the query, so the engine never has to access the data pages (only the index).
The following version might also make good use of this index (although I much prefer your version of the query):
select u.user_id,
(select max(rel_score)
from rel_score r2
where r2.user_id = r.user_id and
r2.score_date between '2012-01-01' and '2012-06-30'
) as rel_score
from (select distinct user_id
from rel_score
where score_date between '2012-01-01' and '2012-06-30'
) u
order by rel_score desc;
The idea behind this query is to replace the aggregation with a simple index lookup. Aggregation in MySQL is a slow operation -- it works much better in other databases so such tricks shouldn't be necessary.
Related
I have a large table containing over 10 million records and It will keep growing. I am performing an aggregation query (count of particular value) on records of last 24 hours. The time taken by this query will keep increasing with number of records in the table.
I can limit the time taken by keeping these 24 hours records in separate table and perform aggregation on that table. Does mysql provide any functionality to handle this kind of scenario?
Table schema and query for reference:
CREATE TABLE purchases (
Id int(11) NOT NULL AUTO_INCREMENT,
ProductId int(11) NOT NULL,
CustomerId int(11) NOT NULL,
PurchaseDateTime datetime(3) NOT NULL,
PRIMARY KEY (Id),
KEY ix_purchases_PurchaseDateTime (PurchaseDateTime) USING BTREE,
KEY ix_purchases_ProductId (ProductId) USING BTREE,
KEY ix_purchases_CustomerId (CustomerId) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
select COALESCE(sum(ProductId = v_ProductId), 0),
COALESCE(sum(CustomerId = v_CustomerId), 0)
into v_ProductCount, v_CustomerCount
from purchases
where PurchaseDateTime > NOW() - INTERVAL 1 DAY
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId );
Build and maintain a separate Summary table .
With partitioning, you might get a small improvement, or you might get no improvement. With a summary table, you might get a factor of 10 improvement.
The summary table could have a 1-day resolution, or you might need 1-hour. Please provide SHOW CREATE TABLE for what you currently have, so we can discuss more specifics.
(There is no built-in mechanism for what you want.)
Plan A
I would leave off
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId )
since the rest of the query will simply deal with it anyway.
Then I would add
INDEX(PurchaseDateTime, ProductId, CustomerId)
which would be "covering" -- that is, the entire SELECT can be performed in the INDEX's BTree. It would also be 'clustered' in the sense that all the data needed would be consecutively stored in the index. Yes, the datetime is deliberately first. (OR is a nuisance to optimize. I don't trust the Optimizer to do "index merge union".)
Plan B
If you expect to touch very few rows (because of v_ProductId and v_CustomerId), then the following may be faster, in spite of being more complex:
SELECT COALESCE(sum(ProductId = v_ProductId), 0)
INTO v_ProductCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND ProductId = v_ProductId;
SELECT COALESCE(sum(CustomerId = v_CustomerId), 0)
INTO v_CustomerCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND CustomerId = v_CustomerId;
together with both:
INDEX(ProductId, PurchaseDateTime),
INDEX(CustomerId, PurchaseDateTime)
Yes, the order of the columns is deliberately different.
Original Question
Both of these approaches are better than your original suggestion of a separate table. These isolate the data in one part of an index (or two indexes), thereby having the effect of "separate". And these do the task with less effort on your part.
I have been reading lots of great answers to different problems over the time on this site but this is the first time I am posting. So in advance thanks for your help.
Here is my question:
I have a MySQL table that tracks visits to different websites we have. This is the table structure:
create table navigation_base (
uid int(11) NOT NULL,
date datetime not null,
dia date not null,
ip int(4) unsigned not null default 0,
session_id int unsigned not null,
cliente smallint unsigned not null default 0,
campaign mediumint unsigned not null default 0,
trackcookie int unsigned not null,
adgroup int unsigned not null default 0,
PRIMARY KEY (uid)
) ENGINE=MyISAM;
This table has aprox. 70 million rows (an average of 110,000 per day).
On that table we have created indexes with following commands:
alter table navigation_base add index dia_cliente_campaign_ip (dia,cliente,campaign,ip);
alter table navigation_base add index dia_cliente_campaign_ip_session (dia,cliente,campaign,ip,session_id);
alter table navigation_base add index dia_cliente_campaign_ip_session_trackcookie (dia,cliente,campaign,ip,session_id,trackcookie);
We then use this table to get visitor statistics grouped by clients, days and campaigns with the following query:
select
dia,
navigation_base.campaign,
navigation_base.cliente,
count(distinct ip) as visitas,
count(ip) as paginas_vistas,
count(distinct session_id) as sesiones,
count(distinct trackcookie) as cookies
from navigation_base where
(dia between '2017-01-01' and '2017-01-31')
group by dia,cliente,campaign order by NULL
Even having those indexes created, the response times for periods of one month are relatively slow; On our server about 3 seconds.
Are there some ways of speeding up these queries?
Thanks in advance.
With this much of data, indexing alone may not be all that helpful since there is a lot of similarity in the data. Besides you have GROUP BY and SORT along with aggregation. All these things combined makes optimization very hard. partitioning is the way forward, because:
Some queries can be greatly optimized in virtue of the fact that data
satisfying a given WHERE clause can be stored only on one or more
partitions, which automatically excludes any remaining partitions from
the search. Because partitions can be altered after a partitioned
table has been created, you can reorganize your data to enhance
frequent queries that may not have been often used when the
partitioning scheme was first set up.
And if this doesn't work for you, it's still possible to
In addition, MySQL 5.7 supports explicit partition selection for
queries. For example, SELECT * FROM t PARTITION (p0,p1) WHERE c < 5
selects only those rows in partitions p0 and p1 that match the WHERE
condition.
ALTER TABLE navigation_base
PARTITION BY RANGE( TO_DAYS(dia)) (
PARTITION p0 VALUES LESS THAN (TO_DAYS('2018-12-31')),
PARTITION p1 VALUES LESS THAN (TO_DAYS('2017-12-31')),
PARTITION p2 VALUES LESS THAN (TO_DAYS('2016-12-31')),
PARTITION p3 VALUES LESS THAN (TO_DAYS('2015-12-31')),
..
PARTITION p10 VALUES LESS THAN MAXVALUE));
Use bigger or smaller partitions as you see fit.
The most important factor to keep in mind is that mysql can only use one index per table. So choose your index wisely.
If you only do COUNT(DISTINCT ...) at the granularity of a day, then build and incrementally maintain a summary table. It would augmented each night by a query nearly identical to your SELECT, but only fetching yesterday's data.
Then use this Summary Table for the monthly "report".
More on Summary Tables
I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:
There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:
CREATE TABLE sns_value (
value_id int(11) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
KEY idx_sensor id (sensor_id),
KEY idx_date (date),
KEY idx_type_id (type_id) );
At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.
Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.
I believe that the right solution would be using a table with the same structure for each of the sensors:
sns_value_XXXXX
This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.
What problems would result from this solution? Is there a more normalized solution?
Editing with additional information
I consider the table to be big in relation to my server:
Cloud 2xCPU and 8GB Memory
LAMP (CentOS 6.5 and MySQL 5.1.73)
Each sensor may have more than one variable types (CO, CO2, etc.).
I mainly have two slow queries:
1) Daily summary for each sensor and type (avg, max, min):
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
This takes more than 5 min.
2) Vertical to Horizontal view and export:
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;
This also takes more than 5 min.
Other considerations
Timestamps may be repeated due to inserts characteristics.
Periodic inserts must coexist with selects.
No updates nor deletes are performed on the table.
Suppositions made to the "one table for each sensor" approach
Tables for each sensor would be much smaller so access would be faster.
Selects will be performed only on one table for each sensor.
Selects mixing data from different sensors are not time-critical.
Update 02/02/2015
We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).
In order to be able to partition the table, the primary index had to be removed.
Are we missing something? Is there a way to improve the performance?
Many thanks!
Edited based on changes to the question
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
You'll have to create tables each time you add (or delete) sensors.
Queries that involve data from multiple sensors will be slow and convoluted.
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
(Avoid the column name date if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts, meaning timestamp.)
Beware: int(11) values aren't aren't big enough for your value_id column. You're going to run out of ids. Use bigint(20) for that column.
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id using a constant, then you're looking up a range of date values, then you're aggregating by type_id. Finally you're extracting the value column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value) will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
In your second query, a similar indexing strategy will work.
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id and then use a date range. You then extract both type_id and value. That means the same four column index I mentioned should work for you.
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
Creating separate table for a range of sensors would be an idea.
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
Use composite key instead, depends from your usecase, the sequence of columns may be different.
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
You can try get randomize summary data
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.
EDIT: Thank you everyone for your comments. I have tried most of your suggestions but they did not help. I need to add that I am running this query through Matlab using Connector/J 5.1.26 (Sorry for not mentioning this earlier). In the end, I think this is the source of the increase in execution time since when I run the query "directly" it takes 0.2 seconds. However, I have never come across such a huge performance hit using Connector/J. Given this new information, do you have any suggestions? I apologize for not disclosing this earlier, but again, I've never experienced performance impact with Connector/J.
I have the following table in mySQL (CREATE code taken from HeidiSQL):
CREATE TABLE `data` (
`PRIMARY` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`ID` VARCHAR(5) NULL DEFAULT NULL,
`DATE` DATE NULL DEFAULT NULL,
`PRICE` DECIMAL(14,4) NULL DEFAULT NULL,
`QUANT` INT(10) NULL DEFAULT NULL,
`TIME` TIME NULL DEFAULT NULL,
INDEX `DATE` (`DATE`),
INDEX `ID` (`SYMBOL`),
INDEX `PRICE` (`PRICE`),
INDEX `QUANT` (`SIZE`),
INDEX `TIME` (`TIME`),
PRIMARY KEY (`PRIMARY`)
)
It is populated with approximately 360,000 rows of data.
The following query takes over 10 seconds to execute:
Select ID, DATE, PRICE, QUANT, TIME FROM database.data WHERE DATE
>= "2007-01-01" AND DATE <= "2010-12-31" ORDER BY ID, DATE, TIME ASC;
I have other tables with millions of rows in which a similar query would take a fraction of a second. I can't figure out what might be causing this one to be so slow. Any ideas/tips?
EXPLAIN:
id = 1
select_type = SIMPLE
table = data
type = ALL
possible_keys = DATE
key = (NULL)
key_len = (NULL)
ref = (NULL)
rows = 361161
Extra = Using where; Using filesort
You are asking for a wide range of data. The time is probably being spent sorting the results.
Is a query on a smaller date range faster? For instance,
WHERE DATE >= '2007-01-01' AND DATE < '2007-02-01'
One possibility is that the optimizer may be using the index on id for the sort and doing a full table scan to filter out the date range. Using indexes for sorts is often suboptimal. You might try the query as:
select t.*
from (Select ID, DATE, PRICE, QUANT, TIME
FROM database.data
WHERE DATE >= "2007-01-01" AND DATE <= "2010-12-31"
) t
ORDER BY ID, DATE, TIME ASC;
I think this will force the optimizer to use the date index for the selection and then sort using file sort -- but there is the cost of a derived table. If you do not have a large result set, this might significantly improve performance.
I assume you already tried to OPTIMIZE TABLE and got no results.
You can either try to use a covering index (at the expense of more disk space, and a slight slowing down on UPDATEs) by replacing the existing date index with
CREATE INDEX data_date_ndx ON data (DATE, TIME, PRICE, QUANT, ID);
and/or you can try and create an empty table data2 with the same schema. Then just SELECT all the contents of data table into data2 and run the same query against the new table. It could be that the data table needed to be compacted more than OPTIMIZE could - maybe at the filesystem level.
Also, check out the output of EXPLAIN SELECT... for that query.
I'm not familiar with mysql but mssql so maybe:
what about to provide index which fully covers all fields in your select query.
Yes, it will duplicates data but we can move to next point of issue discussion.
I have this table:
CREATE TABLE `table1` (
`object` varchar(255) NOT NULL,
`score` decimal(10,3) NOT NULL,
`timestamp` datetime NOT NULL
KEY `ex` (`object`,`score`,`timestamp`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
with 9.1 million rows and I am running the following query:
SELECT `object`, `timestamp`, AVG(score) as avgs
from `table1`
where timestamp >= '2011-12-14'
AND timestamp <= '2011-12-13'
group by `object`
order by `avgs` ASC limit 100;
The dates come from user input. The query takes 6-10 seconds, depending on the range of dates. The run time seems to increase with the number of rows
What can I do to improve this?
I have tried:
fiddling with indexes (brought query time down from max 13sec to max 10sec)
moving storage to fast SAN (brought query time down by around 0.1sec, regardless of parameters).
The CPU and memory load on the server doesn't appear to be too high when the query is running.
The reason why fast SAN is perform much better
is because your query require copy to temporary table,
and need file-sort for a large results set.
You have five nasty factors.
range query
group-by
sorting
varchar 255 for object
a wrong index
Break-down timestamp to two fields,
date, time
Build another reference table for object,
so, you use integer, such as object_id (instead of varchar 255) to represent object
Rebuilt the index on
date (date type), object_id
Change the query to
where date IN('2011-12-13', '2011-12-14', ...)