I can't find info online about this.
What are the best way to alter a table that is already partitioned?
should I just use the normal
UPDATE `table` MODIFY COLUMN `column_name` TINYINT(1) DEFAULT 1 NOT NULL;
and lock the table for several minutes
or should I run that command partition by partition?
UPDATE `table` PARTITION (p0) MODIFY COLUMN `column_name` TINYINT(1) DEFAULT 1 NOT NULL;
What are your recommendations?
What happens if not all partitions are exactly equal? is that even possible?
This is the create statement:
CREATE TABLE `redirects` (
`emailhash` varchar(100) NOT NULL,
`f_email_log` varchar(50) NOT NULL,
`linknum` int(11) NOT NULL DEFAULT '1',
`redirect` varchar(500) NOT NULL,
`clicked` int(11) NOT NULL DEFAULT '0',
`clicktime` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`emailhash`),
KEY `f_email_log` (`f_email_log`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (emailhash)
PARTITIONS 16 */
The table has around 40 million records.
I want to reduce the size of some fields like INT to TINYINT since those values are mostly 1-30 or 0/1, as well as the varchar lengths since I've found that those number are too large and can be reduced.
Altering a partitioned table requires altering each partition one at a time. Meanwhile, the entire table needs to be locked, otherwise, reads/writes will stumble over a half-finished Alter.
Please provide SHOW CREATE TABLE, the number of partitions, the rationale for partitioning at all, and indicate which column needs changing. We may be able to suggest a work-around.
More
400M rows would be about 12GB for that schema?
4GB buffer_pool (which could be raised to 11G for that much RAM)
md5 for key
--> 67% of inserts and selects will not find the desired block in RAM (cache), so would have to hit the disk. This leads to sluggish performance. It will only get worse as the table grows. And it won't matter whether it is partitioned or not. (No I cannot explain the difference you report.)
See here for more discussion, but no good solution for your use case.
Shrinking the datatypes (4-byte INT --> 1-byte TINYINT UNSIGNED, etc) will help some. UNHEX(md5) would let you put the hash in 16 bytes: BINARY(16), thereby saving something like 18 bytes over what you have now. Shrinking the max on VARCHAR has little or no effect. Ditto for CHARACTER SET.
The query would need where emailhash=UNHEX('abcdef1234567890')
ALTER
Back to the original question of how to do the ALTER "fast". Unless you already have replication set up, you are mostly out of luck. The partitions must always have the same schema, so your idea about altering them one-by-one is not possible.
But... check pt-online-schema-change and gh-ost to see if they will work with partitioned tables.
Related
Looking for some guidance on how to best tackle partitioning on some database tables for the purpose of archiving/deleting data over a certain age. The main reason for this is to resolve some issues in database size.
You can think of the data akin to telemetry data where is is growing over time, but once it enters the database it doesn't change outside of the first 10-15 minutes in the event there is any form of conflicting data that requires the application to update a recent record (max 15 mins).
Current database size is approaching 500GB and is sitting on NVMe storage across a 3x Node Galera cluster in three cities. Backups are becoming increasingly larger and if an SST is needed between nodes this can take a couple of hours to complete which is less than ideal.
The plan to deal with this is by way of archiving, where we plan to off-board historical data to another server (say once a year) with slower storage that can then be backed up once and won't change for 12 months. The historical data will be rarely accessed, and in the event it is our application will handle querying the archive server if older than a certain date instead of the production servers that are relied on heavily for "recent" data.
We have 3x tables per customer, and they reference each other in a sort of heirarchy. There are no foreign keys in the tables, but they do hold references to one another and are used in JOIN queries. Eg. summary table is the top of the hierarchy and holds one record per "event". Under this is the details table and there could be 1-10 detail records sitting under the summary event. Under details is the digits table that could include 0-10 records per detailed record.
CREATE TABLE data below;
CREATE TABLE `summary_X` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_utc` datetime DEFAULT NULL,
`end_utc` datetime DEFAULT NULL,
`total_duration` smallint(6) DEFAULT NULL,
`legs` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `start_utc` (`start_utc`)
) ENGINE=InnoDB
CREATE TABLE `details_X` (
`xid` bigint(20) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`duration` smallint(6) DEFAULT NULL,
`start_utc` timestamp NULL DEFAULT NULL,
`end_utc` timestamp NULL DEFAULT NULL,
`event` varchar(2) DEFAULT NULL,
`event_time` smallint(6) DEFAULT NULL,
`event_a` varchar(7) DEFAULT NULL,
`event_b` varchar(7) DEFAULT NULL,
`ani` varchar(20) DEFAULT NULL,
`dnis` varchar(10) DEFAULT NULL,
`first_time` varchar(30) DEFAULT NULL,
`final_time` varchar(30) DEFAULT NULL,
`digits_count` int(2) DEFAULT 0,
`sys_a` varchar(3) DEFAULT NULL,
`sys_b` varchar(3) DEFAULT NULL,
`log_id_a` varchar(12) DEFAULT NULL,
`seq_a` varchar(1) DEFAULT NULL,
`log_id_b` varchar(12) DEFAULT NULL,
`seq_b` varchar(1) DEFAULT NULL,
`assoc_log_id_a` varchar(12) DEFAULT NULL,
`assoc_log_id_b` varchar(12) DEFAULT NULL,
PRIMARY KEY (`xid`),
KEY `start_utc` (`start_utc`),
KEY `end_utc` (`end_utc`),
KEY `event_a` (`event_a`),
KEY `event_b` (`event_b`),
KEY `id` (`id`),
KEY `final_digits` (`final_digits`),
KEY `log_id_a` (`log_id_a`),
KEY `log_id_b` (`log_id_b`)
) ENGINE=InnoDB
CREATE TABLE `digits_X` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`leg_id` bigint(20) DEFAULT NULL,
`sequence` int(2) NOT NULL,
`digits` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `digits` (`digits`),
KEY `leg_id` (`leg_id`)
) ENGINE=InnoDB
My first thought was to partition on Year, sounds easy enough but we don't have a date column on the digits table, so records here could be orphaned away from their mapped details record and no longer match in a JOIN on the archive server.
We then can also have a similar issue with summary and the timestamps on the "details" records could span multiple years. Eg. Summary event starts at 2021-12-31 23:55:00. First detail record is same timestamp, and then the next detail under the same event could be 2022-01-01 00:11:00. If 2021 partition was archived off to the other server, the 2022 detail would be orphaned and no longer JOIN to the 2021 summary event.
One alternative could be not to partition at all and do SELECT/INSERT/DELETE which isn't practical with the volume of data. Some tables have 30M-40M rows per year so this would be very resource taxing. There are also 400+ customers each with their own sets of tables.
Another I thought of was to add a column to the three tables as a "Year" column we can partition on but would include the Year of first event across all, so all related records can be on the same partitions/server, but this seems like a waste of space and there should be a better way.
Any thoughts or guidance would be appreciated.
To add PARTITIONing will require copying the entire table over. That will involve downtime and disk space. If you can live with that, then...
PARTITION BY RANGE(...) where the expression involves, say, TO_DAYS(...) or possibly TO_SECONDS(...). Then set up cron jobs to add a new partition periodically. (There is nothing automated for such.) And to detach the oldest partition. See Partition for a discussion of the details. (TO_DAYS avoids the need for a 'year' column.)
Note that Partitioning is implemented as several sub-tables under a table. With "transportable tablespaces", you can detach a partition from the big table, turning it into a table unto itself. At that point, you are free to move it to another server of something.
In a situation like yours, I might consider the following.
Write the raw data to a file (perhaps one per day) for archiving;
Insert into a table that will live only briefly; this will be purged by some means frequently;
Update "normalization" tables
"Summarize" the data into Summary Tables, where each set of rows covers one hour (or whatever makes sense).
Write "reports" from the summary table(s).
Be aware that each Partition takes an extra 5.5MB (average), so do not make many partitions. Or do you need only 2, each containing 15 minutes' data?
Meanwhile, I would look carefully at the schema. Can an INT (4 bytes) be turned into a SMALLINT (2 bytes). Can more things be Normalized.
digits_count int(2) -- that is a 4-byte INT; the (2) has no meaning and has been removed in MySQL 8. (MariaDB may follow suit someday.) It sounds like you need only a 1-byte TINYINT UNSIGNED (range: 0..255).
Since this is log info, be aware of Daylight Savings wrt DATETIME. (One hour per year is missing; another hour repeats.) This problem does not occur with TIMESTAMP. Each one takes 5 bytes (unless you include fractional seconds.)
(I can't advise on unnecessary indexes without seeing the queries.) SHOW TABLE STATUS will tell you how much space is being consumed by all the indexes.
Are the 3 tables of similar size?
Re "orphaning" -- You need at least 2 partitions -- one being filled (0-100% full) and an older partition (100% full)
"30M-40M rows per year" times 400 customers. Does that add up to 500 rows inserted per second? Are they INSERTed one row at a time? High speed ingestion
Are there more deletes and selects than inserts? And/or do they involve more than single rows? (I'm fishing for more info go help with some other issues you either have or are threatening to have.) Even with Deletes and no Partitioning, the disk growth will slow down as free space is generated, then reused. ("Rince and repeat.")
Without partitioning, see Huge Deletes . But... DELETEing data from a table does not shrink it disk footprint. However if each 'customer' has 1/400th of the data; and (of course) you do each customer separately, then there may not be any disk problem
I've given you a lot to think about. Answer some of my questions; I may have more advice.
In the table of 350 million records, the structure is:
CREATE TABLE `table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`job_id` int(10) unsigned NOT NULL,
`lock` mediumint(6) unsigned DEFAULT '0',
`time` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `job_id` (`job_id`),
KEY `lock` (`lock`),
KEY `time` (`time`)
) ENGINE=MyISAM;
What index should I create to speed up the query:
UPDATE `table` SET `lock` = 1 WHERE `lock` = 0 ORDER BY `time` ASC LIMIT 500;
lock is declared to be NULLable. Does this mean that the value is often NULL? If so, then there is a nasty problem in MyISAM (not InnoDB) that may lead to 500 additional fragmentation hits.
When a MyISAM row is updated and it becomes longer, then the row will not longer fit where it is. (Now my detailed knowledge gets fuzzy.) The new row will be put somewhere else and/or it will be broken into two parts, with a link between the parts. That implies writes in two places.
As Gordon pointed out, any change to any indexed column, lock in your case, involved a costly index update -- remove a 'row' from one place in the index's BTree and add a row in another place.
Does lock have only values 0 or 1? Then use TINYINT (1 byte), not MEDIUMINT (3 bytes).
You should check MAX(id). If it is clean, id's max will be about 350M (not too close to the limit of 4B). But if there has been any churn, it may be much closer to the limit.
I, too, advocate switching to InnoDB. However your 10GB (data+indexes) will grow to 20-30GB in the conversion.
Are you "locking the oldest unlocked" thingies? Will you then do a select to see what got locked?
If this is too slow, don't do 500 at once, pick a lower number.
With InnoDB, can you avoid locking? Perhaps transactional locking would suffice?
I think we need to see the rest of the environment -- other tables, job "flow", etc. There may be other things we can suggest.
And I second the motion for INDEX(lock, time). But when doing so, DROP the index on just lock as being redundant.
And when converting to InnoDB, do all the index changes in the same ALTER. This will run faster than separate passes.
For this query:
UPDATE `table`
SET `lock` = 1
WHERE `lock` = 0
ORDER BY `time` ASC
LIMIT 500;
The best index is table(lock, time). Do note, however, that the update also needs to update the index, so you should test how well this works in practice. Do not make this a clustered index. That will just slow down the process.
I've really simple query to get MIN and MAX values, it looks like:
SELECT MAX(value_avg)
, MIN(value_avg)
FROM value_data
WHERE value_id = 769
AND time_id BETWEEN 214000 AND 219760;
And here you are the schema of the value_data table:
CREATE TABLE `value_data` (
`value_id` int(11) NOT NULL,
`time_id` bigint(20) NOT NULL,
`value_min` float DEFAULT NULL,
`value_avg` float DEFAULT NULL,
`value_max` float DEFAULT NULL,
KEY `idx_vdata_vid` (`value_id`),
KEY `idx_vdata_tid` (`time_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
As you see, the query and the table are simple and I don't see anything wrong here, but when I execute this query, it takes about ~9 seconds to get data. I also made profile of this query, and 99% of time is "Sending data".
The table is really big and it weighs about 2 GB, but is it a problem? I don't think this table is too big, it must be something else...
MySQL can easily handle a database of that size. However, you should be able to improve the performance of this query and probably the table in general. By changing the time_id column to an UNSIGNED INT NOT NULL, you can significantly decrease the size of the data and indexes on that column. Also, the query you mention could benefit from a composite index on (value_id, time_id). With that index, it would be able to use the index for both parts of the query instead of just one as it is now.
Also, please edit your question with an EXPLAIN of the query. It should confirm what I expect about the indexes, but it's always helpful information to have.
Edit:
You don't have a PRIMARY index defined for the table, which definitely isn't helping your situation. If the values of (value_id, time_id) are unique, you should probably make the new composite index I mention above the PRIMARY index for the table.
We are running MySQL/ISAM database with a following table:
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`tm_stamp`,`fk_channel`)
);
The tm_stamp-fk_channel combination is required unique, hence the compound primary key. Now, for certain irrelevant reason, the database will be migrated to InnoDB engine. Upon googling something about it, i found out that the key will dictate the physical ordering of the data on the disk. 90% of the queries currently go as follows:
SELECT value FROM measurements
WHERE fk_channel=A AND tm_stamp>=B and tm_stamp<=C
ORDER BY tm_stamp ASC
Inserts are 99% in order of tm_stamp, it's a storage for dataloggers network. The table has low millions of rows but growing steadily. The questions are
Should the sole change of storage engine result in any significant performance change, better or worse?
Does the order of columns in the index matter with regards to the most popular SELECT? This blog suggest something along that line.
Thanks to the nature of clustered index, may we perhaps leave out the ORDER BY clause and gain some performance?
Edit 1:
It appears that changing the primary key from
PRIMARY KEY (`tm_stamp`,`fk_channel`)
to
PRIMARY KEY (`fk_channel`,`tm_stamp`)
always makes sense, for both MyISAM and InnoDB. See http://sqlfiddle.com/#!2/0aa08/1 for proof this is so.
Original answer:
To determine if changing
PRIMARY KEY (`tm_stamp`,`fk_channel`)
to
PRIMARY KEY (`fk_channel`,`tm_stamp`)
would improve your query's performance, you need to determine which field's values cardinality is higher (which field's values are more varied). Running
SELECT COUNT(DISTINCT tm_stamp), COUNT(DISTINCT fk_channel) FROM measurements;
will give you the cardinality of the columns.
So, to answer your question properly we first need to know: What are the common range of values between B and C? 60? 3,600? 86,400? more?
For example, let's say that
SELECT COUNT(DISTINCT tm_stamp), COUNT(DISTINCT fk_channel) FROM measurements;
returns 32,768 and 256. 32,768 divided by 256 is 128. This tells us that tm_stamp has 128 unique values for every value of fk_channel.
So if the difference between B and C is usually less than 128, then leave tm_stamp as the first field in the primary key. If 128 or greater, then make fk_channel the first field.
Another question: Does fk_channel need to be an INT (4 billion unique values, half of which are negative)? If not, then changing fk_channel to TINYINT UNSIGNED (if you have 256 unique values), or SMALLINT UNSIGNED (65536 unique values) would save a lot of time and space.
For example, let's say you have 256 maximum possible fk_channel values, and 65,536 possible values, then you could change your schema via:
create table measurements_new (
tm_stamp INT UNSIGNED NOT NULL DEFAULT '0',
fk_channel TINYINT UNSIGNED NOT NULL DEFAULT '0', -- remove UNSIGNED if values can be negative
value SMALLINT UNSIGNED DEFAULT NULL, -- remove UNSIGNED if values can be negative
PRIMARY KEY (tm_stamp,fk_channel)
) ENGINE=InnoDB
SELECT
tm_stamp,
fk_channel,
value
FROM
measurements
ORDER BY
tm_stamp,
fk_channel;
RENAME TABLE measurements TO measurements_old, measurements_new TO measurements;
This will store the existing data in the new table in PRIMARY KEY order, which will improve performance somewhat.
Staring at the Query
SELECT value FROM measurements
WHERE fk_channel=A AND tm_stamp>=B and tm_stamp<=C
ORDER BY tm_stamp ASC
Your static value is fk_channel and the moving ordered values is tm_stamp. This addresses your second question which seems to be at the heart of the Query's needs.
You would be way better off with PRIMARY KEY columns reversed
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`fk_channel`,`tm_stamp`)
);
As for the first question, the storage engine dictates what gets cached.
MyISAM caches index pages only in the Key Cache (sized by key_buffer_size)
InnoDB caches data and indexes in the Buffer Pool (sized by innodb_buffer_pool_size)
I wrote about this in the DBA StackExchange
If you remain with MyISAM, you could change the primary key to include the value column:
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`fk_channel`,`tm_stamp`,`value`)
) ENGINE=MyISAM;
That way, your Query's data retrieval is strictly from one file at most, the .MYI of the MyISAM table. The table need not be read at all.
If your switch to InnoDB, fk_channel,tm_stamp gets loaded twice into RAM
Once from InnoDB data page
Once from InnoDB index page
The order of your arguments in the WHERE clause is irrellavent here, the optimizer will pick the best key option (usually a direct comparison on a indexed field over a > or < comparison). With your initial example, the best option was the tm_stamp <> comparison which was not a direct equality check and therefore sub-par.
However, the order of the clustered key does matters.... If the exact comparison is always on the fk_channel column, I'd change the PK to be:
PRIMARY KEY (`fk_channel`,`tm_stamp`)
Now you've got an index that will benefit from the fk_channel=A in your where clause.
Also, while the storage engine plays a role somewhat, but I don't think the issue here is between innodb & myisam.
Finally, I don't think the ORDER BY clause has much bearing on your issue, that's done post query. A group by could affect your performance....
I have the following two tables in my database (the indexing is not complete as it will be based on which engine I use):
Table 1:
CREATE TABLE `primary_images` (
`imgId` smallint(6) unsigned NOT NULL AUTO_INCREMENT,
`imgTitle` varchar(255) DEFAULT NULL,
`view` varchar(45) DEFAULT NULL,
`secondary` enum('true','false') NOT NULL DEFAULT 'false',
`imgURL` varchar(255) DEFAULT NULL,
`imgWidth` smallint(6) DEFAULT NULL,
`imgHeight` smallint(6) DEFAULT NULL,
`imgDate` datetime DEFAULT NULL,
`imgClass` enum('jeans','t-shirts','shoes','dress_shirts') DEFAULT NULL,
`imgFamily` enum('boss','lacoste','tr') DEFAULT NULL,
`imgGender` enum('mens','womens') NOT NULL DEFAULT 'mens',
PRIMARY KEY (`imgId`),
UNIQUE KEY `imgDate` (`imgDate`)
)
Table 2:
CREATE TABLE `secondary_images` (
`imgId` smallint(6) unsigned NOT NULL AUTO_INCREMENT,
`primaryId` smallint(6) unsigned DEFAULT NULL,
`view` varchar(45) DEFAULT NULL,
`imgURL` varchar(255) DEFAULT NULL,
`imgWidth` smallint(6) DEFAULT NULL,
`imgHeight` smallint(6) DEFAULT NULL,
`imgDate` datetime DEFAULT NULL,
PRIMARY KEY (`imgId`),
UNIQUE KEY `imgDate` (`imgDate`)
)
Table 1 will be used to create a thumbnail gallery with links to larger versions of the image. imgClass, imgFamily, and imgGender will refine the thumbnails that are shown.
Table 2 contains images related to those in Table 1. Hence the use of primaryId to relate a single image in Table 1, with one or more images in Table 2. This is where I was thinking of using the Foreign Key ability of InnoDB, but I'm also familiar with the ability of Indexes in MyISAM to do the same.
Without delving too much into the remaining fields, imgDate is used to order the results.
Last, but not least, I should mention that this database is READ ONLY. All data will be entered by me. I have been told that if a database is read only, it should be MyISAM, but I'm hoping you can shed some light on what you would do in my situation.
Always use InnoDB by default.
In MySQL 5.1 later, you should use InnoDB. In MySQL 5.1, you should enable the InnoDB plugin. In MySQL 5.5, the InnoDB plugin is enabled by default so just use it.
The advice years ago was that MyISAM was faster in many scenarios. But that is no longer true if you use a current version of MySQL.
There may be some exotic corner cases where MyISAM performs marginally better for certain workloads (e.g. table-scans, or high-volume INSERT-only work), but the default choice should be InnoDB unless you can prove you have a case that MyISAM does better.
Advantages of InnoDB besides the support for transactions and foreign keys that is usually mentioned include:
InnoDB is more resistant to table corruption than MyISAM.
Row-level locking. In MyISAM, readers block writers and vice-versa.
Support for large buffer pool for both data and indexes. MyISAM key buffer is only for indexes.
MyISAM is stagnant; all future development will be in InnoDB.
See also my answer to MyISAM versus InnoDB
MyISAM won't enable you to do mysql level check. For instance if you want to update the imgId on both tables as a single transaction:
START TRANSACTION;
UPDATE primary_images SET imgId=2 WHERE imgId=1;
UPDATE secondary_images SET imgId=2 WHERE imgId=1;
COMMIT;
Another drawback is integrity check, using InnoDB you can do some error check like to avoid duplicated values in the field UNIQUE KEY imgDate (imgDate). Trust me, this really come at hand and is way less error prone. In my opinion MyISAM is for playing around while some more serious work should rely on InnoDB.
Hope it helps
A few things to consider :
Do you need transaction support?
Will you be using foreign keys?
Will there be a lot of writes on a table?
If answer to any of these questions is "yes", then you should definitely use InnoDB.
Otherwise, you should answer the following questions :
How big are your tables?
How many rows do they contain?
What is the load on your database engine?
What kind of queries you expect to run?
Unless your tables are very large and you expect large load on your database, either one works just fine.
I would prefer MyISAM because it scales pretty well for a wide range of data-sizes and loads.
I would like to add something that people may benefit from:
I've just created a InnoDB table (leaving everything as the default, except changing the collation to Unicode), and populated it with about 300,000 records (rows).
Queries like SELECT COUNT(id) FROM table - would hang until giving an error message, not returning a result;
I've cloned the table with the data into a new MyISAM table -
and that same query, along with other large SELECTqueries - would return fast, and everything worked ok.