Mysql 'Partitioning' vs Splitting data into different tables - mysql

We have a mysql table called posts_content.
The structure is as follows :
CREATE TABLE IF NOT EXISTS `posts_content` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`post_id` int(11) NOT NULL,
`forum_id` int(11) NOT NULL,
`content` longtext CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=79850 ;
The problem is that the table is getting pretty huge. Many giga-bytes of data ( we have a crawling engine ).
We keep inserting data into the table on a daily bases but seldom do we retrieve the data. Now as the table is getting pretty huge its getting difficult to handle the table.
We discussed two possibilities
Use MySQL's partitioning feature to partition the table using the forum_id ( there are about 50 forum_ids so there would be about 50 partitions. Note that even each partition if made so will eventually grow to again many giga-bytes of data maybe even eventually need its own drive
Create separate tables for each forum_id and split the data like that.
I hope I have clearly explained the problem. WHat I need to know is which of the above two would be a better solution in the long run. What are the adv. dis adv. of both the cases.
Thanking you

The difference is that in the first case you leave MySQL to do the sharding, and in the second case you are doing it on your own. MySQL won't scan any shards that do not contain the data, however if you have a query WHERE forum_id IN(...) it may need to scan several shards. As far as I remember, in that case the operation is syncronous, e.g. MySQL queries one partition at a time, and you may want to implement it asyncronously. Generally, if you do the partitioning on your own, you are more flexible, but for simple partitioning, based on the forum_id, if you query only 1 forum_id at a time, MySQL partitioning is OK.
My advice is to read the MySQL documentation on partitioning, especially the restrictions and limitations section, and then decide.

Although this is an old post, caveat with regards to partitioning if your engine is still MyISAM. MySQL 8.0 no longer supports partitioning other than Innodb or NDB storage engines only. In that case, you have to convert your MyISAM table to InnoDB or NDB but you need to remove partitioning first before converting it, else it cannot be used afterwards.

here you have a good answer for your question: https://dba.stackexchange.com/a/24705/15243
Basically, let your system grow and while you get familiarized with partitioning, and when your system really need to be "cropped in pieces", do it with partitioning.

A quick solution for 3x space shrinkage (and probably a speedup) is to compress the content and put it into a MEDIUMBLOB. Do the compression in the client, not the server; this saves on bandwidth and allows you to distribute the computation among the many client servers you have (or will have).
"Sharding" is separating the data across multiple servers. See MariaDB and Spider. This allows for size growth and possibly performance scaling. If you end up sharding, the forum_id may be the best. But that assumes no forum is too big to fit on one server.
"Partitioning" splits up the data, but only within a single server; it does not appear that there is any advantage for your use case. Partitioning by forum_id will not provide any performance.
Remove the FOREIGN KEYs; debug your application instead.

Related

Benchmark MySQL with batch insert on multiple threads within same table

I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:
multiple threads concurrency write to the same table.
support batch insert (each insert transaction will insert bulk of records)
I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.
oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
bulk_insert: support multiple threads, but each thread writes to a different table.
Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?
(from Comment:)
CREATE TABLE IF NOT EXISTS `tableA` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` VARCHAR(63) NOT NULL DEFAULT '',
`data` JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;
(From a MySQL point of view...)
Toss id and the PK -- saves 8 bytes per row.
Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).
Doing those will
Shrink the disk I/O needed (providing some speedup)
Eliminate one of the indexes (probably a significant part of the post-load processing)
Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.
Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.
Another thought...
What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.
If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.
If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.
You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.
{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.
Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)

MySQL performance on large, write-only table

thanks in advance for your answers, and sorry for my bad english, I'm not a native speaker.
We're actually developping a mobile game with a backend. In this mobile game, we've got a money system, we keep track of each transaction for verification purpose.
In order to read a user balance, we've got an intermediary table, in which the user balance is updated on each transaction so the transaction table is never read directly by the users, in order to reduce load on high traffics.
The transaction table is uniquely read from time to time in the backoffice.
Here is the schema of the transaction table :
create table money_money_transaction (
`id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
`userID` INT UNSIGNED NOT NULL,
`amount` INT NOT NULL,
`transactionType` TINYINT NOT NULL,
`created` DATETIME NOT NULL,
CONSTRAINT money_money_transaction_userID FOREIGN KEY (`userID`) REFERENCES `user_user` (`id`)
ON DELETE CASCADE
);
We planned to have a lot of users, the transaction table could grow up to 1 billion row, so my questions are :
Will it affect the performance of other tables ?
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Does MySQL will be able to scale correctly up to this billion row ? Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
You might consider MyRocks (see http://myrocks.io), which is a third-party storage engine that is designed for fast INSERT speed and compressed data storage. I won't make a recommendation that you should switch to MyRocks, because I don't have enough information to make an unequivocal statement about it for your workload. But I will recommend that it's worth your time to evaluate it and see if it works better for your application.
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Yes, MySQL (assuming InnoDB storage engine) stores partial tables in RAM, in the buffer pool. It breaks down tables into pages, and fits pages in the buffer pool as queries request them. It's like a cache. Over time, the most requested pages stay in the buffer pool, and others get evicted. So it more or less balances out to serve most of your queries as quickly as possible. Read https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html for more information.
Will it affect the performance of other tables ?
Tables don't have performance — queries have performance.
The buffer pool has fixed size. Suppose you have six tables that need to share it, their pages must fit into the same buffer pool. There's no way to set priorities for each table, or dedicate buffer pool space for certain tables or "lock" them in RAM. All pages of all tables share the same buffer pool. So as your queries request pages from various tables, they do affect each other in the sense that frequently-requested pages from one table may evict pages from another table.
Does MySQL will be able to scale correctly up to this billion row ?
MySQL has many features to try to help performance and scalability (those are not the same thing). Again, queries have performance, not tables. A table without queries just sits there. It's the queries that get optimized by different techniques.
Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Indexes do add overhead to inserts. You can't eliminate the primary key index, this is a necessary part of every table. But for example, you might find it worthwhile to drop your FOREIGN KEY, which includes an index.
Usually, most tables are read more than they are written to, so it's worth keeping an index to help reads (or even an UPDATE or DELETE that uses a WHERE clause). But if your workload is practically all INSERT, maybe the extra index for the foreign key is purely overhead and gives no benefit for any queries.
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
I worked on benchmarks of Aurora in early 2017, and found that for the application we tested, is was not good for high write traffic. You should always test it for your application, instead of depending on the guess of someone on the internet. But I predict that Aurora in its current form (circa 2017) will completely suck for your all-write workload.

How to optimise a large table in MySQL, when can I benefit from partitioning?

In summary, date range partitioning and memory configuration achieved my goal.
I needed to increase memory allocated to innodb_buffer_pool_size as the default 8M was far too low. Rick James recommends 70% of RAM for this setting, his has a lot of great information.
Edlerd was correct with both suggestions :-)
I split my data into monthly partitions and then ran a 6,000 row response query which originally took between 6 to 12 seconds. It now completes in less than a second (.984/.031). I ran this using the default innodb buffer size (innodb_buffer_pool_size = 8M) to make sure it wasnt just the memory increase.
I then set innodb_buffer_pool_size = 4G and ran the query with an even better response of .062/.032.
I’d also like to mention that increasing the memory has also improved the overall speed of my web application and service which receives and writes messages to this table, I am astounded at how much of a difference this configuration setting has made. The Time To First Byte (TTFB) from my web server is now almost on par with MySQL Workbench which at times would reach 20 seconds.
I also found that the slow query log file was an excellent tool to identify issues, it was there that I saw it suggesting my innodb_buffer_pool_size was low and higlighted all the poor performing queries. This also identified areas where I needed to index other tables.
EDIT 2016-11-12 SOLUTION
I am in the process of refactoring a large table that logs telemetry data, it has been running for about 4-5 months and has generated approx. 54 million records with an average row size approx. 380 bytes.
I have started to see some performance lag on one of my raw data queries that returns all logs for a device over a 24 hour period.
Initially I thought it was indexing, but I think it is the amount of I/O that needs to be processed by MySQL. A typical 24 hour query would contain 2.2k 3k to 9k records and I’d actually like to support an export of about 7 days.
I am not experienced in database performance tuning so still just learning the ropes. I am considering a few strategies.
Tweak compound indexes according to query for raw data, although I think my indexes are OK as the explain plan is showing 100% hit rate.
Consider creating a covering index to include all rows needed
Implement ranged partitioning by date:
a) Keep monthly partitions. E.g. last 6 months
b) Move anything older to archive table.
Create a separate table (vertical partitioning) with the raw data and join it with the IDs of the primary query table. Not sure this is my problem as my indexes are working.
Change my queries to pull data in batches with limits, then order by created date limit X and carry on until no more records are returned.
Review server configuration
1,2 (INDEXES):
I’ll rework my indexes with my queries, but I think I am good here as Explain is showing 100% hit, unless I am reading this wrong.
I’ll try a covering index when they are rebuilt, but how do I determine the knock on effects of making a bad setting? E.G. insert speeds are compromised.
How would I best monitor the performance of my table in a live environment?
EDIT: I've just started using the slow log file which looks like a good tool for finding issues and I suppose a query on the performance_schema might be another option?
3 (PARTITIONING):
I have read a bit about partitions and not sure if the size of my data would make much of a difference.
Rick James suggests >1M records, I’m at 54M and would like to keep around 300M prior to archiving, is my table is complex enough to benefit?
I have to test this out myself as I do not have experience with any of this stuff and it all theoretical to me. I just don’t want to go down this path if it isn’t suitable for my needs.
4 (Vertical partitioning via ‘joined’ detail table): I don’t I think am having table scan issues and I need all rows, so I am not sure this technique would be of benefit.
5 (Use limits and fetch again): Would this free up the server if I used less of its time in a single request? Would I see better I/O throughput at the cost of more commands on the same connection?
6 (Review Config): The other piece would be to review the default non developer configuration that is used when you install MySQL, perhaps there are some settings that can be adjusted? :-)
Thanks for reading, keen to hear any and all suggestions.
The following FYI:
TABLE:
CREATE TABLE `message_log` (
`db_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`db_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`created` datetime DEFAULT NULL,
`device_id` int(10) unsigned NOT NULL,
`display_name` varchar(50) DEFAULT NULL,
`ignition` binary(1) DEFAULT NULL COMMENT 'This is actually IO8 from the falcom device',
`sensor_a` float DEFAULT NULL,
`sensor_b` float DEFAULT NULL,
`lat` double DEFAULT NULL COMMENT 'default GPRMC format ddmm.mmmm \n',
`lon` double DEFAULT NULL COMMENT 'default GPRMC longitude format dddmm.mmmm ',
`heading` float DEFAULT NULL,
`speed` float DEFAULT NULL,
`pos_validity` char(1) DEFAULT NULL,
`device_temp` float DEFAULT NULL,
`device_volts` float DEFAULT NULL,
`satellites` smallint(6) DEFAULT NULL, /* TINYINT will suffice */
`navdist` double DEFAULT NULL,
`navdist2` double DEFAULT NULL,
`IO0` binary(1) DEFAULT NULL COMMENT 'Duress',
`IO1` binary(1) DEFAULT NULL COMMENT 'Fridge On/Off',
`IO2` binary(1) DEFAULT NULL COMMENT 'Not mapped',
`msg_name` varchar(20) DEFAULT NULL, /* Will be removed */
`msg_type` varchar(16) DEFAULT NULL, /* Will be removed */
`msg_id` smallint(6) DEFAULT NULL,
`raw` text, /* Not needed in primary query, considering adding to single table mapped to this ID or a UUID correlation ID to save on #ROWID query */
PRIMARY KEY (`db_id`),
KEY `Name` (`display_name`),
KEY `Created` (`created`),
KEY `DeviceID_AND_Created` (`device_id`,`created`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DeviceID_AND_Created is the main index. I need the PK clustered index because I am using the record ID in a summary table that keeps track of the last message for a given device. Created would be the partition column, so I guess that would also be added to the PK cluster?
QUERY:
SELECT
ml.db_id, ml.db_created, ml.created, ml.device_id, ml.display_name, bin(ml.ignition) as `ignition`,
bin(ml.IO0) as `duress`, bin(ml.IO1) as `fridge`,ml.sensor_a, ml.sensor_b, ml.lat, ml.lon, ml.heading,
ml.speed,ml.pos_validity, ml.satellites, ml.navdist2, ml.navdist,ml.device_temp, ml.device_volts,ml.msg_id
FROM message_log ml
WHERE ml.device_id = #IMEI
AND ml.created BETWEEN #STARTDATE AND DATE_ADD(#STARTDATE,INTERVAL 24 hour)
ORDER BY ml.db_id;
This returns all logs for a given 24 hour period which at the moment is approx. 3k to 9k rows, average row size 381 bytes and will be reduced once I remove one of the TEXT fields (raw)
Implement ranged partitioning by date: a) Keep monthly partitions. E.g. last 6 months b) Move anything older to archive table.
This is a very good idea. I gues all the writes will be in the newest partition and you will query recent data only. You always want a situation where your data and index fits in memory. So no disk i/o on reads.
Depending on your use case it might even be wise to have one partition per week. Then you only have to keep max two weeks of data in memory for reading the last 7 days.
You might also want to tune your buffer sizes (i.e. innodb_buffer_pool_size) if you are using innodb as a engine or myisam_key_cache when using myisam engine.
Also adding ram to the DB machine usually helps as the os can then have the data files in memory.
If you have heavy writes you can also tune other options (i.e. how often writes are persisted to disk with innodb_log_buffer_size). This is in order to let dirty pages be in memory for longer to avoid writing them back to disk too often.
For those who are curious, the following is what I used to create my partition and configure memory.
Creating the partitions
Updated PK to include the range column used in partition
ALTER TABLE message_log
CHANGE COLUMN created DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
DROP PRIMARY KEY,
ADD PRIMARY KEY (db_id, created);
Added the partitions using ALTER TABLE.
In hindsight, I should have created each partition as a single ALTER statement and used Reorganize Partition (and here) on subsequent partitions as doing it in one hit consumed a lot of resources and time.
ALTER TABLE message_log
PARTITION BY RANGE(to_days(created)) (
partition invalid VALUES LESS THAN (0),
partition from201607 VALUES LESS THAN (to_days('2016-08-01')),
partition from201608 VALUES LESS THAN (to_days('2016-09-01')),
partition from201609 VALUES LESS THAN (to_days('2016-10-01')),
partition from201610 VALUES LESS THAN (to_days('2016-11-01')),
partition from201611 VALUES LESS THAN (to_days('2016-12-01')),
partition from201612 VALUES LESS THAN (to_days('2017-01-01')),
partition from201701 VALUES LESS THAN (to_days('2017-02-01')),
partition from201702 VALUES LESS THAN (to_days('2017-03-01')),
partition from201703 VALUES LESS THAN (to_days('2017-04-01')),
partition from201704 VALUES LESS THAN (to_days('2017-05-01')),
partition future values less than (MAXVALUE)
);
NOTE: I am not sure if using to_days() or the raw column makes much difference, but I've seen it used in most examples so I've taken it on as an assumed best practice.
Setting the buffer pool size
To change the value of innodb_db_buffer_pool_size you can find info:
MySQL InnoDB Buffer Pool Resize and Rick Jame's page on memory
You can also do it in MySQL Workbench in the options file menu and then the innoDB tab. Any changes you make here will be written in the config file, but you'll need to stop and start MySQL to read out the configuration, otherwise you can also set the global value to do it live.
Such a deal! I get 4 mentions, even without writing a comment or answer. I'm writing an answer because I may have some further improvements...
Yes, PARTITION BY RANGE(TO_DAYS(...)) is the right way to go. (There may be a small number of alternatives.)
70% of 4GB of RAM is tight. Be sure there is no swapping.
You mentioned one query. If it is the main one of concern, then this would be slightly better:
PRIMARY KEY(device_id, created, db_id), -- desired rows will be clustered
INDEX(db_id) -- to keep AUTO_INCREMENT happy
If you are not purging old data, then the above key suggestion provides just as much efficiency even without partitioning.
lat/lon representation says that DOUBLE is overkill.
Beware of the inefficiency of UUID, especially for huge tables.

What could cause very slow performance of single UPDATEs of a InnoDB table?

I have a table in my web app for storing session data. It's performing badly, and I can't figure out why. Slow query log shows updating a row takes anything from 6 to 60 seconds.
CREATE TABLE `sessions` (
`id` char(40) COLLATE utf8_unicode_ci NOT NULL,
`payload` text COLLATE utf8_unicode_ci NOT NULL,
`last_activity` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `session_id_unique` (`id`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The PK is a char(40) which stores a unique session hash generated by the framework this project uses (Laravel).
(I'm aware of the redundancy of the PK and unique index, but I've tried all combinations and it doesn't have any impact on performance in my testing. This is the current state of it.)
The table is small - fewer than 200 rows.
A typical query from the slow query log looks like this:
INSERT INTO sessions (id, payload, last_activity)
VALUES ('d195825ddefbc606e9087546d1254e9be97147eb',
'YTo1OntzOjY6Il90b2tlbiI7czo0MDoi...around 700 chars...oiMCI7fX0=',
1405679480)
ON DUPLICATE KEY UPDATE
payload=VALUES(payload), last_activity=VALUES(last_activity);
I've done obvious things like checking the table for corruption. I've tried adding a dedicated PK column as an auto increment int, I've tried without a PK, without the unique index, swapping the text column for a very very large varchar, you name it.
I've tried switching the table to use MyISAM, and it's still slow.
Nothing I do seems to make any difference - the table performs very slowly.
My next thought was the query. This is generated by the framework, but I've tested hacking it out into a UPDATE with an INSERT if that fails. The slowness continued on the UPDATE statement.
I've read a lot of questions about slow INSERT and UPDATE statements, but those usually related to bulk transactions. This is just one insert/update per user per request. The site is not remotely busy, and it's on its own VPS with plenty of resources.
What could be causing the slowness?
This is not an answer but SE comment length is too damn short. So.
What happens if you run an identical INSERT ... ON DUPLICATE KEY UPDATE... statement directly on the command line? Please try with and without actual usage of the application. The application may be artificially slowing down this UPDATE (for example, in INNODB a transaction might be opened, but committed after a lot of time was consumed. You tested with MyISAM too which does not support transactions. Perhaps in that case an explicit LOCK could account for the same effect. If the framework uses this trick, I'm not sure, I don't know laravel) Try to benchmark to see if there is a concurrency effect.
Another question: is this a single server? Or is it a master that replicates to one or more slaves?
Apart from this question, a few observations:
the values for id are hex strings. the column is unicode. this means 3*40 bytes are reserved while only 40 are utilized. This is a waste that will make things inefficient in general. It would be much better to use BINARY or ASCII as character encoding. Better yet, change the id column to BINARY data type and store the (unhexed) binary value
A hash for a innodb PK table will scatter the data across pages. The idea to use a auto_incrment pk, or not explicitly declare a pk at all (this will cause innodb to create an autoincrement pk of its own internally) is a good idea.
It looks like the payload is base64 encoded. Again the character encoding is specified to be unicode. Ascii or Binary (the character encoding, not the data type) is much more appropriate.
the HASH keyword in the unique index on ID is meaningless. InnoDB does not implement HASH indexes. Unfortunately MySQL is perfectly silent about this (see http://bugs.mysql.com/bug.php?id=73326)
(while this list does offer angles for improvement it seems unlikely that the extreme slowness can be fixed with this. there must be something else going on)
Frustratingly, the answer is this case was a bad disk. One of the disks in the storage array had gone bad, and so writes were taking forever to complete. Simply that.

mysql create table command add ons question

My team has added these statements with the create tables after the columns are defined:
ENGINE=MyISAM DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC AUTO_INCREMENT=465
The question is the table is a country lookup table. so we know it has a fixed list values of about 275-ish. And this table will be 99% a read only table. Very rare will be any write if i need to update any colunm property.
So do i need all that stuff beyond 'ENGINE=MyISAM DEFAULT CHARSET=utf8'? this is just one table, they have these for all most all tables and i cant understand why lookup tables will have all these commands/
You can look up all those in the CREATE TABLE doc.
You're right though. For the context you describe, they're almost surely completely unnecessary.
Asides
Re: AUTO_INCREMENT as part of your CREATE TABLE -- yeah, that's just because it was part of the SHOW CREATE TABLE of a live table, not because it was part of your teams intentions/ongoing script. No biggie.
Note that CHECKSUM and DELAY_KEY_WRITE are for MyISAM tables only. If that table was InnoDB, the features those two parameters bring are arguably implicitly taken care of (i.e. table integrity and write issues).
Why do we need innoDB for read only lookup tables? I thought innoDB is better for write intensive tables?
Sorry. I didn't mean to imply that you needed InnoDB. It's just a reflex. :)
Wheater or not InnoDB performs better for writing depends on the usage pattern / application. For your context, I wouldn't expect you seeing a performance difference weither you use MyISAM or InnoDB. At any rate, as a rule of thumb, since InnoDB can be acid complient, more resistant to corruption, and stored in memory (in InnoDB's buffer pool) I always advocate for it. MyISAM fails on all those counts.