I have a table
CREATE TABLE `acme`.`partitioned_table` (
`id` INT NULL,
`client_id` INT NOT NULL,
`create_datetime` INT NOT NULL,
`some_val` VARCHAR(45) NULL);
I'd like to partition this table in such a way that each client’s data is stored in its own partition based on the client_id AND each partition can only contain data for 1 week based on the create_datetime. This is done so we can drop weekly one week’s worth of data based each client’s own retention policy.
For example, some clients would like to have 3 months of data while others may have longer data retention policies.
I am having a hard time being new to MySQL to come up with a proper partitioning strategy. How can I partition by Week based on the INT column. To throw a curve ball this might be hosted on AWS RDS later.
Many thanks in advance,
M
Your clients x weeks level of partitioning would lead to a lot of partitions. This implies a lot of disk space and queries will be slower.
Your requirement for "separate storage" would be better handled by either separate tables or separate databases.
If you also need to do queries across all clients, we need to discuss things further.
One of the "guidelines" for partitioning is not to partition a table with less than a million rows.
If a client's table is big enough to justify partitioning, see http://mysql.rjweb.org/doc.php/partitionmaint for more discussion. If not big enough, then either simply do the DELETE, or see this for more options: http://mysql.rjweb.org/doc.php/deletebig .
There are a lot of DATETIME functions that are messy if you use INT:
`create_datetime` INT
Related
I have a MySQL database with over 20 tables, but one of them is significantly large because it collects measurement data from different sensors. It's size is around 145 GB on disk and it contains over 1 billion records. All this data is also being replicated to another MySQL server.
I'd like to separate the data to smaller "shards", so my question is which of the below solutions would be better. I'd use the record's "timestamp" for dividing the data by years. Almost all SELECT queries that are executed on this table contain the "timestamp" field in the "where" part of the query.
So below are the solutions that I cannot decide on:
Using the MySQL partitioning and dividing the data by years (e.g. partition1 - 2010, partition2 - 2011 etc.)
Creating separate tables and dividing the data by years (e.g. measuring_2010, measuring_2011 etc. tables)
Are there any other (newer) possible options that I'm not aware of?
I know that in the first case MySQL itself would get the data from the 'shards' and in the second case I'd have to write a kind of wrapper for it and do it by myself. Is there any other way for the second case that would make all separate tables to be seen as 'one big table' to fetch data from?
I know this question has already been asked in the past, but maybe somebody came up with some new solution (that I'm not aware of) or that the best practice solution changed by now. :)
Thanks a lot for your help.
Edit:
The schema is something similar to this:
device_id (INT)
timestamp (DATETIME)
sensor_1_temp (FLOAT)
sensor_2_temp (FLOAT)
etc. (30 more for instance)
All sensor temperatures are written at the same moment once a minute. Note that there around 30 different sensors measurements written in a row. This data is mostly used for displaying graphs and some other statistic purposes.
Well, if you are hoping for a new answer, that means you have probably read my answers, and I sound like a broken record. See Partitioning blog for the few use cases where partitioning can help performance. Yours does not sound like any of the 4 cases.
Shrink device_id. INT is 4 bytes; do you really have millions of devices? TINYINT UNSIGNED is 1 byte and a range of 0..255. SMALLINT UNSIGNED is 2 bytes and a range of 0..64K. That will shrink the table a little.
If your real question is about how to manage so much data, then let's "think outside the box". Read on.
Graphing... What date ranges are you graphing?
The 'last' hour/day/week/month/year?
An arbitrary hour/day/week/month/year?
An arbitrary range, not tied to day/week/month/year boundaries?
What are you graphing?
Average value over a day?
Max/min over the a day?
Candlesticks (etc) for day or week or whatever?
Regardless of the case, you should build (and incrementally maintain) a Summary Table with data. A row would contain summary info for one hour. I would suggest
CREATE TABLE Summary (
device_id SMALLINT UNSIGNED NOT NULL,
sensor_id TINYINT UNSIGNED NOT NULL,
hr TIMESTAMP NOT NULL,
avg_val FLOAT NOT NULL,
min_val FLOAT NOT NULL,
max_val FLOAT NOT NULL
PRIMARY KEY (device_id, sensor_id, hr)
) ENGINE=InnoDB;
The one Summary table might be 9GB (for current amount of data).
SELECT hr,
avg_val,
min_val,
max_val
FROM Summary
WHERE device_id = ?
AND sensor_id = ?
AND hr >= ?
AND hr < ? + INTERVAL 20 DAY;
Would give you the hi/lo/avg values for 480 hours; enough to graph? Grabbing 480 rows from the summary table is a lot faster than grabbing 60*480 rows from the raw data table.
Getting similar data for a year would probably choke a graphing package, so it may be worth building a summary of the summary -- with resolution of a day. It would be about 0.4GB.
There are a few different ways to build the Summary table(s); we can discuss that after you have pondered its beauty and read Summary tables blog. It may be that gathering one hour's worth of data, then augmenting the Summary table, is the best way. That would be somewhat like the flip-flop discussed my Staging table blog.
And, if you had the hourly summaries, do you really need the minute-by-minute data? Consider throwing it away. Or, maybe data after, say, one month. That leads to using partitioning, but only for its benefit in deleting old data as discussed in "Case 1" of Partitioning blog. That is, you would have daily partitions, using DROP and REORGANIZE every night to shift the time of the "Fact" table. This would lead to decreasing your 145GB footprint, but without losing much data. New footprint: About 12GB (Hourly summary + last 30 days' minute-by-minute details)
PS: The Summary Table blog shows how to get standard deviation.
You haven't said much about how you use/query the data or what the schema looks like but I try to make something up.
One thing how you can split your table is based on entities
(different sensors are different entities). That's useful if
different sensors require different columns. So you don't need to
force them into one schema that fits all of them (least common
multiple). Though it's not good if sensors are added or removed
dynamically since you would have to add tables at runtime.
Another approach is to split the table based on time. This is the
case if after some time data can be "historized" and it is not used for
the actual business logic anymore but for statistic purposes.
Both approaches can also be combined. Furthermore be sure that the table is properly indexed according to your query needs.
I strongly discourage any approach that often requires adding a table after some time or anything similar. As always I wouldn't split anything before there's a performance issue.
Edit:
I would clearly restructure the table to following and not split it at all:
device_id (INT)
timestamp (DATETIME)
sensor_id (INT) -- could be unique or not. if sensor_id is not unique make a
-- composite key from device_id and sensor_id given that you
-- need it for queries
sensor_temp (FLOAT)
If data grows fast and you're expecting to generate terabytes of data soon you are better off with a NoSQL approach. But that's a different story.
I have a MySQL database with over 20 tables, but one of them is significantly large because it collects measurement data from different sensors. It's size is around 145 GB on disk and it contains over 1 billion records. All this data is also being replicated to another MySQL server.
I'd like to separate the data to smaller "shards", so my question is which of the below solutions would be better. I'd use the record's "timestamp" for dividing the data by years. Almost all SELECT queries that are executed on this table contain the "timestamp" field in the "where" part of the query.
So below are the solutions that I cannot decide on:
Using the MySQL partitioning and dividing the data by years (e.g. partition1 - 2010, partition2 - 2011 etc.)
Creating separate tables and dividing the data by years (e.g. measuring_2010, measuring_2011 etc. tables)
Are there any other (newer) possible options that I'm not aware of?
I know that in the first case MySQL itself would get the data from the 'shards' and in the second case I'd have to write a kind of wrapper for it and do it by myself. Is there any other way for the second case that would make all separate tables to be seen as 'one big table' to fetch data from?
I know this question has already been asked in the past, but maybe somebody came up with some new solution (that I'm not aware of) or that the best practice solution changed by now. :)
Thanks a lot for your help.
Edit:
The schema is something similar to this:
device_id (INT)
timestamp (DATETIME)
sensor_1_temp (FLOAT)
sensor_2_temp (FLOAT)
etc. (30 more for instance)
All sensor temperatures are written at the same moment once a minute. Note that there around 30 different sensors measurements written in a row. This data is mostly used for displaying graphs and some other statistic purposes.
Well, if you are hoping for a new answer, that means you have probably read my answers, and I sound like a broken record. See Partitioning blog for the few use cases where partitioning can help performance. Yours does not sound like any of the 4 cases.
Shrink device_id. INT is 4 bytes; do you really have millions of devices? TINYINT UNSIGNED is 1 byte and a range of 0..255. SMALLINT UNSIGNED is 2 bytes and a range of 0..64K. That will shrink the table a little.
If your real question is about how to manage so much data, then let's "think outside the box". Read on.
Graphing... What date ranges are you graphing?
The 'last' hour/day/week/month/year?
An arbitrary hour/day/week/month/year?
An arbitrary range, not tied to day/week/month/year boundaries?
What are you graphing?
Average value over a day?
Max/min over the a day?
Candlesticks (etc) for day or week or whatever?
Regardless of the case, you should build (and incrementally maintain) a Summary Table with data. A row would contain summary info for one hour. I would suggest
CREATE TABLE Summary (
device_id SMALLINT UNSIGNED NOT NULL,
sensor_id TINYINT UNSIGNED NOT NULL,
hr TIMESTAMP NOT NULL,
avg_val FLOAT NOT NULL,
min_val FLOAT NOT NULL,
max_val FLOAT NOT NULL
PRIMARY KEY (device_id, sensor_id, hr)
) ENGINE=InnoDB;
The one Summary table might be 9GB (for current amount of data).
SELECT hr,
avg_val,
min_val,
max_val
FROM Summary
WHERE device_id = ?
AND sensor_id = ?
AND hr >= ?
AND hr < ? + INTERVAL 20 DAY;
Would give you the hi/lo/avg values for 480 hours; enough to graph? Grabbing 480 rows from the summary table is a lot faster than grabbing 60*480 rows from the raw data table.
Getting similar data for a year would probably choke a graphing package, so it may be worth building a summary of the summary -- with resolution of a day. It would be about 0.4GB.
There are a few different ways to build the Summary table(s); we can discuss that after you have pondered its beauty and read Summary tables blog. It may be that gathering one hour's worth of data, then augmenting the Summary table, is the best way. That would be somewhat like the flip-flop discussed my Staging table blog.
And, if you had the hourly summaries, do you really need the minute-by-minute data? Consider throwing it away. Or, maybe data after, say, one month. That leads to using partitioning, but only for its benefit in deleting old data as discussed in "Case 1" of Partitioning blog. That is, you would have daily partitions, using DROP and REORGANIZE every night to shift the time of the "Fact" table. This would lead to decreasing your 145GB footprint, but without losing much data. New footprint: About 12GB (Hourly summary + last 30 days' minute-by-minute details)
PS: The Summary Table blog shows how to get standard deviation.
You haven't said much about how you use/query the data or what the schema looks like but I try to make something up.
One thing how you can split your table is based on entities
(different sensors are different entities). That's useful if
different sensors require different columns. So you don't need to
force them into one schema that fits all of them (least common
multiple). Though it's not good if sensors are added or removed
dynamically since you would have to add tables at runtime.
Another approach is to split the table based on time. This is the
case if after some time data can be "historized" and it is not used for
the actual business logic anymore but for statistic purposes.
Both approaches can also be combined. Furthermore be sure that the table is properly indexed according to your query needs.
I strongly discourage any approach that often requires adding a table after some time or anything similar. As always I wouldn't split anything before there's a performance issue.
Edit:
I would clearly restructure the table to following and not split it at all:
device_id (INT)
timestamp (DATETIME)
sensor_id (INT) -- could be unique or not. if sensor_id is not unique make a
-- composite key from device_id and sensor_id given that you
-- need it for queries
sensor_temp (FLOAT)
If data grows fast and you're expecting to generate terabytes of data soon you are better off with a NoSQL approach. But that's a different story.
I was asked to optimize (size-wise) statistics system for a certain site and I noticed that they store 2 sets of stat data in a single table. Those sets are product displays on search lists and visits on product pages. Each row has a product id, stat date, stat count and stat flag columns. The flag column indicates if it's a search list display or page visit stat. Stats are stored per day and product id, stat date (actually combined with product ids and stat types) and stat count have indexes.
I was wondering if it's better (size-wise) to store those two sets as separate tables or keep them as a single one. I presume that the part which would make a difference would be the flag column (lets say its a 1 byte TINYINT) and indexes. I'm especially interested about how the space taken by indexes would change in 2 table scenario. The table in question already has a few millions of records.
I'll probably do some tests when I have more time, but I was wondering if someone had already challenged a similar problem.
Ordinarily, if two kinds of observations are conformable, it's best to keep them in a single table. By "conformable," I mean that their basic data is the same.
It seems that your observations are indeed conformable.
Why is this?
First, you can add more conformable observations trivially easily. For example, you could add sales to search-list and product-page views, by adding a new value to the flag column.
Second, you can report quite easily on combinations of the kinds of observations. If you separate these things into different tables, you'll be doing UNIONs or JOINs when you want to get them back together.
Third, when indexing is done correctly the access times are basically the same.
Fourth, the difference in disk space usage is small. You need indexes in either case.
Fifth, the difference in disk space cost is trivial. You have several million rows, or in other words, a dozen or so gigabytes. The highest-quality Amazon Web Services storage costs about US$ 1.00 per year per gigabyte. It's less than the heat for your office will cost for the day you will spend refactoring this stuff. Let it be.
Finally I got a moment to conduct a test. It was just a small scale test with 12k and 48k records.
The table that stored both types of data had following structure:
CREATE TABLE IF NOT EXISTS `stat_test` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
`stat_type` tinyint(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`,`stat_type`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
The other two tables had this structure:
CREATE TABLE IF NOT EXISTS `stat_test_other` (
`id_off` int(11) NOT NULL,
`stat_date` date NOT NULL,
`stat_count` int(11) NOT NULL,
PRIMARY KEY (`id_off`,`stat_date`),
KEY `id_off` (`id_off`),
KEY `stat_count` (`stat_count`)
) ENGINE=InnoDB DEFAULT CHARSET=latin2;
In case of 12k records 2 separate tables were actually slightly bigger than the one storing everything, but in case of 48k records, two tables were smaller and by a noticeable value.
In the end I didn't split the data into two tables to solve my initial space problem. I managed to considerably reduce the size of the database, by removing the redundant id_off index and adjusting the data types (in most cases unsigned smallint was more than enough to store all the values I needed). Note that originally stat_type was also of type int and for this column unsigned tinyint was enough. All in all this reduced the size of the database from 1.5GB to 600MB (and my limit was just 2GB for the database). Another advantage of this solution was the fact that I didn't have to modify a single line of code to make everything work (since the site was written by someone else, I didn't had to spend hours trying to understand the source code).
I am considering partitioning a mySQL table that has the potential to grow very big. The table as it stands goes like this
DROP TABLE IF EXISTS `uidlist`;
CREATE TABLE IF NOT EXISTS `uidlist` (
`uid` varchar(9) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`chcs` varchar(16) NOT NULL DEFAULT '',
UNIQUE KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii;
where
uid is a sequence of 9 character id strings starting with a lowercase letter
chcs is a checksum that is used internally.
I suspect that the best way to partition this table would be based on the first letter of the uid field. This would give
Partition 1
abcd1234,acbd1234,adbc1234...
Partition 2
bacd1234,bcad1234,bdac1234...
However, I have never done partitioning before I have no idea how to go about it. Is the partitioning scheme I have outlined possible? If so, how do I go about implementing it?
I would much appreciate any help with this.
Check out the manual for start :)
http://dev.mysql.com/tech-resources/articles/partitioning.html
MySQL is pretty feature-rich when it comes to partitioning and choosing the correct strategy depends on your use case (can partitioning help your sequential scans?) and the way your data grows since you don't want any single partition to become too large to handle.
If your data will tend to grow over time somewhat steadily you might want to do a create-date based partitioning scheme so that (for example) all records generated in a single year end up in last partition and previous partitions are never written to - for this to happen you may have to introduce another column to regulate this, see http://dev.mysql.com/doc/refman/5.1/en/partitioning-hash.html.
Added optimization benefit of this approach would be that you can have the most recent partition on a disk with fast writes (a solid state for example) and you can keep the older partitions on a cheaper disk with decent read speed.
Anyway, knowing more about your use case would help people give you more concrete answers (possibly including sql code)
EDIT, also, check out http://www.tokutek.com/products/tokudb-for-mysql/
The main question you need to ask yourself before partitioning is "why". What is the goal you are trying to achieve by partitioning the table?
Since all the table's data will still existing on a single MySQL server and, I assume, new rows will be arriving in "random" order (meaning the partition they'll be inserted into), you won't gain much by partitioning. Your point select queries might be slightly faster, but not likely by much.
The main benefit I've seen using MySQL partitioning is for data that needs to be purged according to a set retention policy. Partitioning data by week or month makes it very easy to delete old data quickly.
It sounds more likely to me that you want to be sharding your data (spreading it across many servers), and since your data design as shown is really just key-value then I'd recommend looking at database solutions that include sharding as a feature.
I have upvoted both of the answers here since they both make useful points. #bbozo - a move to TokuDB is planned but there are constraints that stop it from being made right now.
I am going off the idea of partitioning the uidlist table as I had originally wanted to do. However, for the benefit of anyone who finds this thread whilst trying to do something similiar here is the "how to"
DROP TABLE IF EXISTS `uidlist`;
CREATE TABLE IF NOT EXISTS `uidlist` (
`uid` varchar(9) CHARACTER SET ascii COLLATE ascii_bin NOT NULL ,
`chcs` varchar(16) NOT NULL DEFAULT '',
UNIQUE KEY `uid` (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii
PARTITION BY RANGE COLUMNS(uid)
(
PARTITION p0 VALUES LESS THAN('f%'),
PARTITION p1 VALUES LESS THAN('k%'),
PARTITION p2 VALUES LESS THAN('p%'),
PARTITION p3 VALUES LESS THAN('u%')
);
which creates four partitions.
I suspect that the long term solution here is to use a key-value store as suggested by #tmcallaghan rather than just stuffing everything into a MySQL table. I will probably post back in due course once I have established what would be the right way to accomplish that.
I have to perform scientific experiments using time series.
I intend to use MySQL as the data storage platform.
I'm thinking of using the following set of tables to store the data:
Table1 --> ts_id (store the time series index, I will have to deal with several time series)
Table2 --> ts_id, obs_date, value (should be indexed by {ts_idx,obs_date})
Because there will be many time series (hundreds) each with possibly millions of observations, table 2 may grow very large.
The problem is that I have to replicate this experiment several times, so I'm not sure what would be the best approach:
add an experiment_id to the tables and allow them to grow even more.
create a separate data base for each experiment.
if option 2 is better (I personally think so), what would be the best logical way to do this? I have many different experiments to perform, each needing replication. If I create a different data base for every replication, I'd get hundreds of data bases pretty soon. Is there a way to logically organize them, such as each replication as a "sub-database" of its experiment master database?
You might want to start out by considering how you will need to analyze your data.
Assumably your analysis will need to know about experiment name, experiment replica number, internal replicates (e.g. at each timepoint there 3 "identical" subjects measured for each treatment). So your db schema might be something like this:
experiments
exp_id int unsigned not null auto_increment primary key,
exp_name varchar(45)
other fields that any kind of experiment can have
replicates
rep_id int unsigned not null auto_increment primary key,
exp_id int unsigned not null foreign key to experiments
other fields that any kind of experiment replica can have
subjects
subject_id int unsigned not null auto_increment primary key,
subject_name varchar(45),
other fields that any kind of subject can have
observations
ob_id int unsigned not null auto_increment primary key,
rep_id int unsigned not null foreign key to replicates,
subject_id int unsigned not null foreign key to subjects,
ob_time timestamp
other fields to hold the measurements you make at each timepoint
If you have internal replicates you'll need another table to hold the internal replicate/subject relationship.
Don't worry about your millions of rows. As long as you index sensibly, there won't likely be any problems. But if worse comes to worst you can always partition your observation table (likely to be the largest) by rep_id.
Should you have more than one data base, one for each experiment?
The answer to your question hinges on your answer to this question: Will you want to do a lot of analysis that compares one experiment to another?
If you will do a lot of experiment-to-experiment comparison, it will be a horrendous pain in the neck to have a separate data base for every experiment.
I think your suggestion of an experiment ID column in your observation table is a fine idea. That way you can build an experiment table with an overall description of your experiment. That table can also hold the units of observation in your value column (e.g. temp, voltage, etc).
If you have some kind of complex organization of your multiple experiments, you can store that organization in your experiment table.
Notice that MySQL is quite efficient at slinging around short rows of data. You can buy a nice server for the cost of a few dozen hours of your labor, or rent one on a cloud service for the cost of a few hours of labor.
Notice also that MySQL offers the MERGE storage engine. http://dev.mysql.com/doc/refman/5.5/en/merge-storage-engine.html This allows a bunch of different tables with the same column structure to be accessed as if it were one table. This would allow you to store results from individual experiments or groups of them in their own tables, and then access them together. If you have problems scaling up your data collection system, you may want to consider this. But the good news is you can get your database working and then convert to this.
Another question: why do you have a table with nothing but ts_id values in it? I don't get that.