I want to partition a table in MySQL while preserving the table's structure.
I have a column, 'Year', based on which I want to split up the table into different tables for each year respectively. The new tables will have names like 'table_2012', 'table_2013' and so on. The resultant tables need to have all the fields exactly as in the source table.
I have tried the following two pieces of SQL script with no success:
1.
CREATE TABLE all_data_table
( column1 int default NULL,
column2 varchar(30) default NULL,
column3 date default NULL
) ENGINE=InnoDB
PARTITION BY RANGE ((year))
(
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2011) , PARTITION p2 VALUES LESS THAN (2012) ,
PARTITION p3 VALUES LESS THAN (2013), PARTITION p4 VALUES LESS THAN MAXVALUE
);
2.
ALTER TABLE all_data_table PARTITION BY RANGE COLUMNS (`year`) (
PARTITION p0 VALUES LESS THAN (2011),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p3 VALUES LESS THAN (MAXVALUE)
);
Any assistance would be appreciated!
This is old, but seeing as it comes up highly ranked in partitioning searches, I figured I'd give some additional details for people who might hit this page. What you are talking about in having a table_2012 and table_2013 is not "MySQL Partitioning" but "Manual Partitioning".
Partitioning means that you have one "logical table" with a single table name, which--behind the scenes--is divided among multiple files. When you have millions to billions of rows, over years, but typically you are only searching a single month, partitioning by Year/Month can have a great performance benefit because MySQL only has to search against the file that contains the Year/Month that you are searching for...so long as you include the partition key in your WHERE.
When you create multiple tables like table_2012 and table_2013, you are MANUALLY partitioning the tables, which you don't do with the MySQL PARTITION configuration. To manually partition the tables, during 2012, you put all data into the 2012 table. When you hit 2013, you start putting all the data into the 2013 table. You have to make sure to create the table before you hit 2013 or it won't have any place to go. Then, when you query across the years (e.g. from Nov 2012 - Jan 2013), you have to do a UNION between table_2012 and table_2013.
SELECT * FROM table_2012 WHERE #...
UNION
SELECT * FROM table_2013 WHERE #...
With partitioning, this manual work is not necessary. You do the initial setup of the partitions, then you treat is as a single table. No unions required, no checking the date before you insert, etc. This makes life much easier. MySQL handles figuring out what tables it needs to query. However, you MUST make sure to query against the Year column or it will have to scan ALL files. E.g. SELECT * FROM all_data_table WHERE Month=12 will scan all partitions for Month=12. To ensure you are only scanning the partition files that you need to scan, you want to make sure to include the partition column in every query that you can.
Possible negatives to partitioning...if you have billions of rows and you do an ALTER TABLE on the table to--say--add a column...it's going to have to update every row taking a VERY long time. At the company I currently work for, the boss doesn't think it's worth the time it takes to update the billion rows historically when we are adding a new column for going forward...so this is one of the reasons we do manual partitioning instead of letting MySQL do it.
DISCLAIMER: I am not an expert at partitioning...so if I'm wrong in any of this, please let me know and I'll fix the incorrect parts.
From what I see you want to create many tables from one big table.
I think you should try to create views instead.
Since from what I look around about partitioning, it actually partitions the physical storage of that table and then store them separately. But if you see from the top perspective you will see them as a single table.
Related
What is good approach to handle 3b rec table where concurrent read/write is very frequent within few days?
Linux server, running MySQL v8.0.15.
I have this table that will log device data history. The table need to retain its data for one year, possibly two years. The growth rate is very high: 8,175,000 rec/day (1mo=245m rec, 1y=2.98b rec). In the case of device number growing, the table is expected to be able to handle it.
The table read is frequent within last few days, more than a week then this frequency drop significantly.
There are multi concurrent connection to read and write on this table, and the target to r/w is quite close to each other, therefore deadlock / table lock happens but has been taken care of (retry, small transaction size).
I am using daily partitioning now, since reading is hardly spanning >1 partition. However there will be too many partition to retain 1 year data. Create or drop partition is on schedule with cron.
CREATE TABLE `table1` (
`group_id` tinyint(4) NOT NULL,
`DeviceId` varchar(10) COLLATE utf8mb4_unicode_ci NOT NULL,
`DataTime` datetime NOT NULL,
`first_log` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`first_res` tinyint(1) NOT NULL DEFAULT '0',
`last_log` datetime DEFAULT NULL,
`last_res` tinyint(1) DEFAULT NULL,
PRIMARY KEY (`group_id`,`DeviceId`,`DataTime`),
KEY `group_id` (`group_id`,`DataTime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
/*!50100 PARTITION BY RANGE (to_days(`DataTime`))
(
PARTITION p_20191124 VALUES LESS THAN (737753) ENGINE = InnoDB,
PARTITION p_20191125 VALUES LESS THAN (737754) ENGINE = InnoDB,
PARTITION p_20191126 VALUES LESS THAN (737755) ENGINE = InnoDB,
PARTITION p_20191127 VALUES LESS THAN (737756) ENGINE = InnoDB,
PARTITION p_future VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
Insert are performed in size ~1500/batch:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
VALUES(%s, %s, FROM_UNIXTIME(%s), %s)
ON DUPLICATE KEY UPDATE last_log=NOW(), last_res=values(first_result);
Select are mostly to get count by DataTime or DeviceId, targeting specific partition.
SELECT DataTime, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DataTime HAVING ct<50;
SELECT DeviceId, COUNT(*) ct FROM table1 partition(p_20191126)
WHERE group_id=1 GROUP BY DeviceId HAVING ct<50;
So the question:
Accord to RickJames blog, it is not a good idea to have >50 partitions in a table, but if partition is put monthly, there are 245m rec in one partition. What is the best partition range in use here? Does RJ's blog still taken place with current mysql version?
Is it a good idea to leave the table not partitioned? (the index is running well atm)
note: I have read this stack question, having multiple table is a pain, therefore if it is not necessary i wish not to break the table. Also, sharding is currently not possible.
First of all, INSERTing 100 records/second is a potential bottleneck. I hope you are using SSDs. Let me see SHOW CREATE TABLE. Explain how the data is arriving (in bulk, one at a time, from multiple sources, etc) because we need to discuss batching the input rows, even if you have SSDs.
Retention for 1 or 2 years? Yes, PARTITIONing will help, but only with the deleting via DROP PARTITION. Use monthly partitions and use PARTITION BY RANGE(TO_DAYS(DataTime)). (See my blog which you have already found.)
What is the average length of DeviceID? Normally I would not even mention normalizing a VARCHAR(10), but with billions of rows, it is probably worth it.
The PRIMARY KEY you have implies that a device will not provide two values in less than one second?
What do "first" and "last" mean in the column names?
In older versions of MySQL, the number of partitions had impact on performance, hence the recommendation of 50. 8.0's Data Dictionary may have a favorable impact on that, but I have not experimented yet to see if the 50 should be raised.
The size of a partition has very little impact on anything.
In order to judge the indexes, let's see the queries.
Sharding is not possible? Do too many queries need to fetch multiple devices at the same time?
Do you have Summary tables? That is a major way for Data Warehousing to avoid performance problems. (See my blogs on that.) And, if you do some sort of "staging" of the input, the summary tables can be augmented before touching the Fact table. At that point, the Fact table is only an archive; no regular SELECTs need to touch it? (Again, let's see the main queries.)
One table per day (or whatever unit) is a big no-no.
Ingestion via IODKU
For the batch insert via IODKU, consider this:
collect the 1500 rows in a temp table, preferably with a single, 1500-row, INSERT.
massage that data if needed
do one IODKU..SELECT:
INSERT INTO table1(group_id, DeviceId, DataTime, first_result)
ON DUPLICATE KEY UPDATE
last_log=NOW(), last_res=values(first_result)
SELECT group_id, DeviceId, DataTime, first_result
FROM tmp_table;
If necessary, the SELECT can do some de-dupping, etc.
This approach is likely to be significantly faster than 1500 separate IODKUs.
DeviceID
If the DeviceID is alway 10 characters and limited to English letters and digits, then make it
CHAR(10) CHARACTER SET ascii
Then pick between COLLATION ascii_general_ci and COLLATION ascii_bin, depending on whether you allow case folding or not.
Just for your reference:
I have a large table right now over 30B rows, grows 11M rows daily.
The table is innodb table and is not partitioned.
Data over 7 years is archived to file and purged from the table.
So if your performance is acceptable, partition is not necessary.
From management perspective, it is easier to manage the table with partitions, you might partition the data by week. It will 52 - 104 partitions if you keep last or 2 years data online
I've created a view on partitioned table. When I pass the partitioned column to the SELECT statement of view, the optimizer is not going to that particular partition when checked through EXPLAIN statement.
Is there any way to make the view access a single partition of its table?
[Edit] : Here is how I created the view on two partitioned tables
CREATE TABLE Partition1 (ID INT,NAME VARCHAR(100),DOB DATE)
PARTITION BY LIST (YEAR(DOB))
(
PARTITION P_2000 VALUES IN (2000),
PARTITION P_2001 VALUES IN (2001)
);
CREATE TABLE NOPART (ID INT,DOB DATE)
PARTITION BY LIST (YEAR(DOB))
(
PARTITION P_2000 VALUES IN (2000),
PARTITION P_2001 VALUES IN (2001)
);
CREATE OR REPLACE VIEW P_VIEW
AS
SELECT ID,DOB
FROM PARTITION1
UNION
SELECT ID,DOB
FROM NOPART;
EXPLAIN
SELECT * FROM P_VIEW
WHERE DOB = '2001-01-01';
When I run the "Explain" it shows optimizer is going to both partitions "p_2000" and "p_2001".
There are many deficiencies in the implementation of VIEWs. You may have hit one.
There are many uses of PARTITIONing that do not provide any performance. BY RANGE is probably the only variant that helps performance for some use cases. A table with less than a million rows is not worth partitioning.
Without seeing your CREATE TABLE, CREATE VIEW, and SELECT, we can only give you vague answers like I have.
(Responding to added code) Unless there is more to it than that, PARTITIONing in that way provide no benefit over having an index on DOB.
Furthermore, The VIEW + PARTITION approach (without an index) must scan the entire 2001 partition looking for the few rows for '2001-01-01'. Instead the simple index approach can find them immediately -- 365 times as fast. (OK, not really that much faster, but still.)
Can someone tell me pros and cons of HASH PARITION vs RANGE PARTITION on a DATETIME column?
Let consider we have POS table with 20 milion records and would want to create partitions based on transaction date's year like
PARTITION BY HASH(YEAR(TRANSACTION_DATE)) PARTITIONS 4;
or
PARTITION BY RANGE(YEAR(TRANSACTION_DATE)) (
PARTITION p0 VALUES LESS THAN (2010),
PARTITION p1 VALUES LESS THAN (2012),
PARTITION p2 VALUES LESS THAN (2013),
PARTITION p4 VALUES LESS THAN MAXVALUE
);
to improve performance of queries with TRANSACTION_DATE BETWEEN '2013-03-01' AND '2013-09-29'
Which one better over the other? and why?
There are some significant differences. If you have a where clause that refers to a range of years, such as:
where year(transaction_date) between 2009 and 2011
then I don't think the hash partitioning will recognize this as hitting just one, two, or three partitions. The range partitioning should recognize this, reducing the I/O for such a query.
The more important difference has to do with managing the data. With range partitioning, once a partition has been created -- and the year has past -- presumably the partition will not be touched again. That means that you only have to back up one partition, the current partition. And, next year, you'll only need to back up one partition.
A similar situation arises if you want to move data offline. Dropping a partition containing the oldest year of data is pretty easy, compared to deleting the rows one-by-one.
When the number of partitions is only four, these considerations may not make much of a difference. The key idea is that range partitioning assigns a each row to a known partition. Hash partitioning assigns each row to a partition, but you don't know exactly which one.
EDIT:
The particular optimization that reduces the reading of partitions is called "partition pruning". MySQL documents this pretty well here. In particular:
For tables that are partitioned by HASH or KEY, partition pruning is
also possible in cases in which the WHERE clause uses a simple =
relation against a column used in the partitioning expression.
It would appear that partition pruning for inequalities (and even in) requires range partitioning.
I want to keep the last 45 days of log data in a MySQL table for statistical reporting purposes. Each day could be 20-30 million rows. I'm planning on creating a flat file and using load data infile to get the data in there each day. Ideally I'd like to have each day on it's own partition without having to write a script to create a partition every day.
Is there a way in MySQL to just say each day gets it's own partition automatically?
thanks
I would strongly suggest using Redis or Cassandra rather than MySQL to store high traffic data such as logs. Then you could stream it all day long rather than doing daily imports.
You can read more on those two (and more) in this comparison of "NoSQL" databases.
If you insist on MySQL, I think the easiest would just be to create a new table per day, like logs_2011_01_13 and then load it all in there. It makes dropping older dates very easy and you could also easily move different tables on different servers.
er.., number them in Mod 45 with a composite key and cycle through them...
Seriously 1 table per day was a valid suggestion, and since it is static data I would create packed MyISAM, depending upon my host's ability to sort.
Building queries to union some or all of them would be only moderately challenging.
1 table per day, and partition those to improve load performance.
Yes, you can partition MySQL tables by date:
CREATE TABLE ExampleTable (
id INT AUTO_INCREMENT,
d DATE,
PRIMARY KEY (id, d)
) PARTITION BY RANGE COLUMNS(d) (
PARTITION p1 VALUES LESS THAN ('2014-01-01'),
PARTITION p2 VALUES LESS THAN ('2014-01-02'),
PARTITION pN VALUES LESS THAN (MAXVALUE)
);
Later, when you get close to overflowing into partition pN, you can split it:
ALTER TABLE ExampleTable REORGANIZE PARTITION pN INTO (
PARTITION p3 VALUES LESS THAN ('2014-01-03'),
PARTITION pN VALUES LESS THAN (MAXVALUE)
);
This doesn't automatically partition by date, but you can reorganize when you need to. Best to reorganize before you fill the last partition, so the operation will be quick.
I have stumbled on this question while looking for something else and wanted to point out the MERGE storage engine (http://dev.mysql.com/doc/refman/5.7/en/merge-storage-engine.html).
The MERGE storage is more or less a simple pointer to multiple tables, and can be redone in seconds. For cycling logs, it can be very powerfull! Here's what I'd do:
Create one table per day, use LOAD DATA as OP mentionned to fill it up. Once it is done, drop the MERGE table and recreate it including that new table while ommiting the oldest one. Once done, I could delete/archive the old table. This would allow me to rapidly query a specific day, or all as both the orignal tables and the MERGE are valid.
CREATE TABLE logs_day_46 LIKE logs_day_45 ENGINE=MyISAM;
DROP TABLE IF EXISTS logs;
CREATE TABLE logs LIKE logs_day_46 ENGINE=MERGE UNION=(logs_day_2,[...],logs_day_46);
DROP TABLE logs_day_1;
Note that a MERGE table is not the same as a PARTIONNED one and offer some advantages and inconvenients. But do remember that if you are trying to aggregate from all tables it will be slower than if all data was in only one table (same is true for partitions, as they are basically different tables under the hood). If you are going to query mostly on specific days, you will need to choose the table yourself, but if partitions are done on the day values, MySQL will automatically grab the correct table(s) which might come out faster and easier to write.
I have read the documentation (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), but I would like, in your own words, what it is and why it is used.
Is it mainly used for multiple servers so it doesn't drag down one server?
So, part of the data will be on server1, and part of the data will be on server2. And server 3 will "point" to server1 or server2...is that how it works?
Why does MYSQL documentation focus on partitioning within the same server...if the purpose is to spread it across servers?
The idea behind partitioning isn't to use multiple servers but to use multiple tables instead of one table. You can divide a table into many tables so that you can have old data in one sub table and new data in another table. Then the database can optimize queries where you ask for new data knowing that they are in the second table. What's more, you define how the data is partitioned.
Simple example from the MySQL Documentation:
CREATE TABLE employees (
id INT NOT NULL,
fname VARCHAR(30),
lname VARCHAR(30),
hired DATE NOT NULL DEFAULT '1970-01-01',
separated DATE NOT NULL DEFAULT '9999-12-31',
job_code INT,
store_id INT
)
PARTITION BY RANGE ( YEAR(separated) ) (
PARTITION p0 VALUES LESS THAN (1991),
PARTITION p1 VALUES LESS THAN (1996),
PARTITION p2 VALUES LESS THAN (2001),
PARTITION p3 VALUES LESS THAN MAXVALUE
);
This allows to speed up e.g.:
Dropping old data by simple:
ALTER TABLE employees DROP PARTITION p0;
Database can speed up a query like this:
SELECT COUNT(*)
FROM employees
WHERE separated BETWEEN '2000-01-01' AND '2000-12-31'
GROUP BY store_id;
Knowing that all data is stored only on the p2 partition.
A partitioned table is a single logical table that’s composed of multiple physical subtables.
The partitioning code is really just a wrapper around a set of Handler objects
that represent the underlying partitions, and it forwards requests to the storage engine
through the Handler objects. Partitioning is a kind of black box that hides the underlying
partitions from you at the SQL layer, although you can see them quite easily by
looking at the filesystem, where you’ll see the component tables with a hash-delimited
naming convention.
For example,
here’s a simple way to place each year’s worth of sales into a separate partition:
CREATE TABLE sales (
order_date DATETIME NOT NULL,
-- Other columns omitted
) ENGINE=InnoDB PARTITION BY RANGE(YEAR(order_date)) (
PARTITION p_2010 VALUES LESS THAN (2010),
PARTITION p_2011 VALUES LESS THAN (2011),
PARTITION p_2012 VALUES LESS THAN (2012),
PARTITION p_catchall VALUES LESS THAN MAXVALUE );
read more here.
It is not really about using different server instances (although that is sometimes a possibility), it is more about dividing your tables in different physical partitions.
It's dividing your tables and indexes into smaller pieces, and even subdivide it into even smaller pieces.
Think of it as having several million different magazines of different topics and different years (say 2000-2019) all in one big warehouse (one big table). Partitioning would mean that you would put them organized in different rooms inside that big warehouse. They still belong together inside the one warehouse, but now you group them on a logical level, depending on your database partitioning strategy.
Indexing is actually like keeping a table of which magazine is where in your warehouse, or in your rooms inside your warehouse. As you can see, there is a big difference between database partitioning and indexing, and they can be very well used together.
You can read more about it on my website on this article about Database Partitioning