Partitioning or separating a very large table in mysql

Partitioning or separating a very large table in mysql - mysql

We have a very large table in mysql with 500,000,000 records in it with 100 requests ( SELECT ) per second.
This is schema:
id(int),
user_id (int),
content(text),
date(datetime)
Since up to 90% of requests are within last 6 months. My question is about increasing performance.
Is it a good idea to separate those records from last 6 month in another table and SELECT from it, OR a partitioning method to get all records of the last 6 month fast.
Or if there's a better way...
For instance, a query is this.
SELECT content,user_id FROM log
JOIN users ON users.id = log.user_id
WHERE date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
user_id, date is indexed in table Log
There are 2 million users in table Users.

Your edit says you use queries like this at a rate of a third of a million per hour.
SELECT content,user_id
FROM log
JOIN users ON users.id = log.user_id
WHERE date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
I will take the liberty of rewriting this query to fully qualify your column selections.
SELECT log.content,
log.user_id
FROM log /* one half gigarow table */
JOIN users ON users.id = log.user_id /* two megarow table */
WHERE log.date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
(Please consider updating your question if this is not correct.)
Why are you joining the users table in this query? None of your results seem to come from it. Why won't this query do what you need?
SELECT log.content,
log.user_id
FROM log /* one half gigarow table */
WHERE log.date > DATE_SUB(CURDATE(), INTERVAL 180 DAY)
LIMIT 15
If you want to make this query faster, put a compound covering index on (date,user_id, content). This covering index will support a range scan and fast retrieval. If your content column is in fact of type TEXT (a LOB) type, you need to put just (date,user_id) into the covering index, and your retrieval will be a little slower.
Are you using the JOIN to ensure that you get log entries returned which have a matching entry in users? If so, please explain your query better.
You definitely can partition your table based on date ranges. But you will need to either alter your table, or recreate and repopulate it, which will incur either downtime or a giant scramble.
http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html
Something like this DDL should then do the trick for you
CREATE TABLE LOG (
id INT NOT NULL AUTO_INCREMENT, /*maybe BIGINT? */
user_id INT NOT NULL,
`date` DATETIME NOT NULL,
content TEXT,
UNIQUE KEY (id, `date`),
KEY covering (`date`,user_id)
)
PARTITION BY RANGE COLUMNS(`date`) (
PARTITION p0 VALUES LESS THAN ('2012-01-01'),
PARTITION p1 VALUES LESS THAN ('2012-07-01'),
PARTITION p2 VALUES LESS THAN ('2013-01-01'),
PARTITION p3 VALUES LESS THAN ('2013-07-01'),
PARTITION p4 VALUES LESS THAN ('2014-01-01'),
PARTITION p5 VALUES LESS THAN ('2014-07-01'),
PARTITION p6 VALUES LESS THAN ('2015-01-01'),
PARTITION p7 VALUES LESS THAN ('2015-07-01')
);
Notice that there's some monkey business about the UNIQUE KEY. The column that goes into your partitioning function needs also to appear in the so-called primary key.
Later on, when July 2015 (partition p7's cutoff date) draws near, you can run this statement to add a partition for the next six month segment of time.
ALTER TABLE `log`
ADD PARTITION (PARTITION p8 VALUES LESS THAN ('2016-01-01'))
But, seriously, none of this partitioning junk is going to help much if your queries have unnecessary joins or poor index coverage. And it is going to make your database administration more complex.

Related

Will adding an index to a column improve the select query (without where) performance in SQL?

I have a MySQL table that contains 20 000 000 rows, and columns like (user_id, registered_timestamp, etc). I have written a below query to get a count of users registered day wise. The query was taking a long time to execute. Will adding an index to the registered_timestamp column improve the execution time?
select date(registered_timestamp), count(userid) from table group by 1

Consider using this query to get a list of dates and the number of registrations on each date.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
GROUP BY date(registered_timestamp)
Then an index on table(registered_timestamp) will help a little because it's a covering index.
If you adapt your query to return dates from a limited range, for example.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
WHERE registered_timestamp >= CURDATE() - INTERVAL 8 DAY
AND registered_timestamp < CURDATE()
GROUP BY date(registered_timestamp)
the index will help. (This query returns results for the week ending yesterday.) However, the index will not help this query.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
WHERE DATE(registered_timestamp) >= CURDATE() - INTERVAL 8 DAY /* slow! */
GROUP BY date(registered_timestamp)
because the function on the column makes the query unsargeable.
You probably can address this performance issue with a MySQL generated column. This command:
ALTER TABLE `table`
ADD registered_date DATE
GENERATED ALWAYS AS DATE(registered_timestamp)
STORED;
Then you can add an index on the generated column
CREATE INDEX regdate ON `table` ( registered_date );
Then you can use that generated (derived) column in your query, and get a lot of help from that index.
SELECT registered_date, COUNT(*)
FROM table
GROUP BY registered_date;
But beware, creating the generated column and its index will take a while.

select date(registered_timestamp), count(userid) from table group by 1
Would benefit from INDEX(registered_timestamp, userid) but only because such an index is "covering". The query will still need to read every row of the index, and do a filesort.
If userid is the PRIMARY KEY, then this would give you the same answers without bothering to check each userid for being NOT NULL.
select date(registered_timestamp), count(*) from table group by 1
And INDEX(registered_timestamp) would be equivalent to the above suggestion. (This is because InnoDB implicitly tacks on the PK.)
If this query is common, then you could build and maintain a "summary table", which collects the count every night for the day's registrations. Then the query would be a much faster fetch from that smaller table.

mysql partitioning with int and timestamp

I have MySQL 5.6.12 Community Server.
I am trying to partition my MySQL innoDB table which contains 5M(and growing always) rows of history data. It is getting slower and slower and I figured partitioning will solve it.
I have columns.
stationID int(4)
valueNumberID(int 5)
logTime(timestamp)
value(double)
(stationID,valueNumberID,logTime) is my PRIMARY key.
I have 50 different stationID's. From each station comes history data and I need to store it for 5 years. There are only 2-5 different valueNumberID's from each stationID but hundreds of value changes per day. Each query in the system uses stationID,valueNumberID and logTime in that order. In most cases the queries are limited to current year.
I would like to create partitioning with stationID with each stationID having own partition so the queries use smaller physical table to scan, and further reduce the size of the table by subpartitioning it by logTime. I do not know how to create own partition for 50 different stationID's and create subpartitions for them using timestamp.
Thank you for your replies. SELECT queries are getting slower. To me it seems like they are getting slower linearly with the speed the table is growing. The issue must be with the GROUP-statement.This is an example query. SELECT DATE_FORMAT(logTime,"%Y%m%d%H%i%s") AS 'logTime', SUM(Value) FROM His WHERE stationID=23 AND valueNumberID=4 AND logTime > '2013-01-01 00:00:00' AND logTime < '2013-11-14 00:00:00' GROUP BY DATE_FORMAT( logPVM,"%Y%m") ORDER BY logTime LIMIT 0,120;
Objective of this query/queries like this is to give either AVG,MAX,MIN,SUM in hour,day,week,month intervals. Result of the query is bound tightly to how the results are presented to the user in various ways(graph,excel file) and it would take long time to change if I would change the queries. So I was looking an easy way out with partitioning.
And estimate of 1.2-1.4M rows per month comes to this table.
Thank you

search by date mysql performance

I have a large table with about 100 million records, with fields start_date and end_date, with DATE type. I need to check the number of overlaps with some date range, say between 2013-08-20 AND 2013-08-30, So I use.
SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-08-20'
AND start_date <= '2013-08-30'
date column are indexed.
The important points is that the date ranges that I am searching for overlap are always in the future, while the main part of the records in the table are in the past (say about 97-99 million).
So, will this query be faster, if I add a column is_future - TINYINT, so, by checking only that condition like this
SELECT COUNT(*) FROM myTable WHERE is_future = 1
AND end_date >= '2013-08-20' AND start_date <= '2013-08-30'
it will exclude the rest 97 million or so records and will check the date condition for only the remaining 1-3 million records ?
I use MySQL
Thanks
EDIT
The mysql engine is innodb, but will matter considerably if it is say, MyISAM
here is the create table
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`title`
`start_date` date DEFAULT NULL,
`end_date` date DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
EDIT 2 after #Robert Co answer
The partitioning looks like a good idea for this case, but it does not allow me to create partition based on is_future field unless I define it as primary key, otherwise I should remove my main primary key - id, which I can not do. So, if I define that field as primary key, then is there a meaning of partitioning, will not it be fast already if I search by is_future field which is primary key.
EDIT 3
The actual query where I need to use this is to select restaurant that have some free tables for that date range
SELECT r.id, r.name, r.table_count
FROM restaurants r
LEFT JOIN orders o
ON r.id = o.restaurant_id
WHERE o.id IS NULL
OR (r.table_count > (SELECT COUNT(*)
FROM orders o2
WHERE o2.restaurant_id = r.id AND
end_date >= '2013-08-20' AND start_date <= '2013-08-30'
AND o2.status = 1
)
)
SOLUTION
After a lot more research and testing the fastest way for counting the number of rows in my case was to just add one more condition, that start_date is more than current date (because the date ranges for search are always in the future)
SELECT COUNT(*) FROM myTable WHERE end_date >= '2013-09-01'
AND start_date >= '2013-08-20' AND start_date <= '2013-09-30'
also it is necessary to have one index - with start_date and end_date fields (thank you #symcbean).
As a result the execution time on table with 10m rows from 7 seconds - became 0.050 seconds.
SOLUTION 2 (#Robert Co)
partitioning in this case worked as well !! - perhaps it is better solution than indexing. Or they can both be applied together.
Thanks

This is a perfect use case for
table partitioning. If the Oracle INTERVAL feature makes it to MySQL, then it will just add to the awesomeness.

date column are indexed
What type of index? A hash based index is no use for range queries. If it's not a BTREE index then change it now. And you've not shown us *how they are indexed. Are both columns in the same index? Is there other stuff in there too? What order (end_date must appear as the first column)?
There are implicit type conversions in the script - this should be handled automatically by the optimizer, but it's worth checking....
SELECT COUNT(*) FROM myTable WHERE end_date >= 20130820000000
AND start_date <= 20130830235959
if I add a column is_future - TINYINT
First, in order to be of any use, this would require that the future dates be a small proportion of the total data stored in the table (less than 10%). And that's just to make it more efficient than a full table scan.
Secondly, it's going to require very frequent updates to the index to maintain it, which in addition to the overhead of initial populatiopn is likely to lead to fragmentation of the index and degraded performance (depending on how the iondex is constructed).
Thirdly, if this still has to process 3 million rows of data (and specifically, via an index lookup) then it's going to be very slow even with the data pegged in memory.
Further, the optimizer is never likely to use this index without being forced to (due to the low cardinality).

I have done a simple test, just created an index on the tinyint column. The structures may not be the same, but with an index it seems to work.
http://www.sqlfiddle.com/#!2/514ab/1/0
and for count
http://www.sqlfiddle.com/#!2/514ab/2/0
View execution plan there to see that the select just scans one row which means it would process only the lesser number of records in your case.
So the simple answer is yes, with an index it would work.

MySQL - turning data points into ranges

I have a database of measurements that indicate a sensor, a reading, and the timestamp the reading was taken. The measurements are only recorded when there's a change. I want to generate a result set that shows the range each sensor is reading a particular measurement.
The timestamps are in milliseconds but I'm outputting the result in seconds.
Here's the table:
CREATE TABLE `raw_metric` (
`row_id` BIGINT NOT NULL AUTO_INCREMENT,
`sensor_id` BINARY(6) NOT NULL,
`timestamp` BIGINT NOT NULL,
`angle` FLOAT NOT NULL,
PRIMARY KEY (`row_id`)
)
Right now I'm getting the results I want using a subquery, but it's fairly slow when there's a lot of datapoints:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT MIN(`timestamp`)
FROM raw_metric AS rm2
WHERE rm2.`timestamp` > rm1.`timestamp`
AND rm2.sensor_id = rm1.sensor_id
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1
Essentially, to get the range, I need to get the very next reading (or use the current time if there isn't another reading). The subquery finds the minimum timestamp that is later than the current one but is from the same sensor.
This query isn't going to occur very often so I'd prefer to not have to add an index on the timestamp column and slow down inserts. I was hoping someone might have a suggestion as to an alternate way of doing this.
UPDATE:
The row_id's should be incremented along with timestamps but it can't be guaranteed due to network latency issues. So, it's possible that an entry with a lower row_id comes occurs AFTER a later row_id, though unlikely.

This is perhaps more appropriate as a comment than as a solution, but it is too long for a comment.
You are trying to implement the lead() function in MySQL, and MySQL does not, unfortunately, have window functions. You could switch to Oracle, DB2, Postgres, SQL Server 2012 and use the built-in (and optimized) functionality there. Ok, that may not be realistic.
So, given your data structure you need to do either a correlated subquery or a non-equijoin (actually a partial equi-join because there is match on sensor_id). These are going to be expensive operations, unless you add an index. Unless you are adding measurements tens of times per second, the additional overhead on the index should not be a big deal.
You could also change your data structure. If you had a "sensor counter" that was a sequential number enumerating the readings, then you could use this as an equijoin (although for good performance you might still want an index). Adding this in to your table would require having a trigger -- and that is likely to perform even worse than an index for when inserting.
If you only have a handful of sensors, you could create a separate table for each one. Oh, I can feel the groans at this suggestion. But, if you did, then an auto-incremented id would perform the same role. To be honest, I would only do this if I could count the number of sensors on each hand.
In the end, I might suggest that you take the hit during insertion and have "effective" and "end' times on each record (as well as an index on sensor id and either timestamp or id). With these additional columns, you will probably find more uses for the table.
If you are doing this for just one sensor, then create a temoprary table for the information and use an auto-incremented id column. Then insert the data into it:
insert into temp_rawmetric (orig_row_id, sensor_id, timestamp, angle)
select orig_row_id, sensor_id, timestamp, angle
from raw_metric
order by sensor_id, timestamp;
Be sure your table has a temp_rawmetric_id column that is auto-incremented and the primary key (creates an index automatically). The order by makes sure this is incremented according to the timestamp.
Then you can do your query as:
select trm.sensor_id, trm.angle,
trm.timestamp as startTime, trmnext.timestamp as endTime
from temp_rawmetric trm left outer join
temp_rawmetric trmnext
on trmnext.temp_rawmetric_id = trm.temp_rawmetric_id+1;
This will require a pass through the original data to extra the data, and then a primary key join on the temporary table. The first might take some time. The second should be pretty quick.

Select rm1.row_id
,HEX(rm1.sensor_id)
,rm1.angle
,(COALESCE(rm2.timestamp, UNIX_TIMESTAMP() * 1000) - rm1.timestamp) as duration
from raw_metric rm1
left outer join
raw_metric rm2
on rm2.sensor_id = rm1.sensor_id
and rm2.timestamp = (
select min(timestamp)
from raw_metric rm3
where rm3.sensor_id = rm1.sensor_id
and rm3.timestamp > rm1.timestamp
)

If you use auto_increment for primary key, you may replace timestamp by row_id in query condition part. Like this:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT MIN(`timestamp`)
FROM raw_metric AS rm2
WHERE rm2.`row_id` > rm1.`row_id`
AND rm2.sensor_id = rm1.sensor_id
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1
It must work some quickly.
Also you can add one more subquery for fast select row id of new senser value. See:
SELECT row_id,
HEX(sensor_id),
angle,
(
COALESCE((
SELECT timestamp FROM raw_metric AS rm1a
WHERE row_id =
(
SELECT MIN(`row_id`)
FROM raw_metric AS rm2
WHERE rm2.`row_id` > rm1.`row_id`
AND rm2.sensor_id = rm1.sensor_id
)
), UNIX_TIMESTAMP() * 1000) - `timestamp`
) / 1000 AS duration
FROM raw_metric AS rm1

Faster way of retrieving aggregate data from large table?

I have a table that grows by tens of millions of rows each day. The rows in the table contain hourly information about page view traffic.
The indices on the table are on url and datetime.
I want to aggregate the information by day, rather than hourly. How should I do this? This is a query that exemplifies what I am trying to do:
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= "2012-08-29 00:00:00" AND datetime <= "2012-08-29 23:00:00"
GROUP BY url
ORDER BY pageviews DESC
LIMIT 10;
The above query never finishes, though. There are millions of rows in the table. Is there a more efficient way that I can get this aggregate data?

Tens of millions of rows per day is quite a lot.
Assuming:
only 10 million new records per day;
your table contains only the columns that you mention in your question;
url is of type TEXT with a mean (Punycode) length of ~77 characters;
pageviews is of type INT;
int_views is of type INT;
ext_views is of type INT; and
datetime is of type DATETIME
then each day's data will occupy around 9.9 × 108 bytes, which is almost 1GiB/day. In reality it may be considerably more, because the above assumptions were quite conservative.
MySQL's maximum table size is determined, amongst other things, by the underlying filesystem on which its data files reside. If you're using the MyISAM engine (as suggested by your comment beneath) without partitioning on Windows or Linux, then a limit of a few GiB is not uncommon; which implies the table will reach its capacity well within a working week!
As #Gordon Linoff mentioned, you should partition your table; However, each table has a limit of 1024 partitions. With 1 partition/day (which would be imminently sensible in your case), you will be limited to storing under 3 years of data in a single table before the partitions start getting reused.
I would therefore advise that you keep each year's data in its own table, each partitioned by day. Furthermore, as #Ben explained, a composite index on (datetime, url) would help (I actually propose creating a date column from DATE(datetime) and indexing that, because it will enable MySQL to prune the partitions when performing your query); and, if row-level locking and transactional integrity are not important to you (for a table of this sort, they may not be), using MyISAM may not be daft:
CREATE TABLE news_2012 (
INDEX (date, url(100))
)
Engine = MyISAM
PARTITION BY HASH(TO_DAYS(date)) PARTITIONS 366
SELECT *, DATE(datetime) AS date FROM news WHERE YEAR(datetime) = 2012;
CREATE TRIGGER news_2012_insert BEFORE INSERT ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
CREATE TRIGGER news_2012_update BEFORE UPDATE ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
If you choose to use MyISAM, you can not only archive completed years (using myisampack) but can also replace your original table with a MERGE one comprising the UNION of all of your underlying year tables (an alternative that would also work in InnoDB would be to create a VIEW, but it would only be useful for SELECT statements as UNION views are neither updatable nor insertable):
DROP TABLE news;
CREATE TABLE news (
date DATE,
INDEX (date, url(100))
)
Engine = MERGE
INSERT_METHOD = FIRST
UNION = (news_2012, news_2011, ...)
SELECT * FROM news_2012 WHERE FALSE;
You can then run your above query (along with any other) on this merge table:
SELECT url, SUM(pageviews), SUM(int_views), SUM(ext_views)
FROM news
WHERE date = '2012-08-29'
GROUP BY url
ORDER BY SUM(pageviews) DESC
LIMIT 10;

A few points:
Also, as the only predicate that you're filtering on you should
probably have an index with datetime as the first column.
You're ordering by pageviews. I would have assumed that you want to order by sum(pageviews).
You're querying 23 hours of data not 24. You probably want to use an explicit less than, <, from midnight the next day to avoid missing anything.
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= '2012-08-29 00:00:00'
AND datetime < '2012-08-30 00:00:00'
GROUP BY url
ORDER BY sum(pageviews) DESC
LIMIT 10;
You could index this on datetime, url, pageviews, int_views, ext_views but I think that would be overkill; so, if the index isn't too big datetime, url seems like a good way to go. The only way to be certain is to test it and decide if any performance improvements in querying are worth the extra time taken in index maintenance.
As Gordon's just mentioned in the comments you may need to look into partitioning. This enables you to query a smaller "table" that is part of the larger one. If all your queries are based at the day level it sounds like you might need to create a new one each day.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Partitioning or separating a very large table in mysql - mysql

Related

Will adding an index to a column improve the select query (without where) performance in SQL?

mysql partitioning with int and timestamp

search by date mysql performance

MySQL - turning data points into ranges

Faster way of retrieving aggregate data from large table?

Categories

Resources