MySQL performance on large, write-only table - mysql

thanks in advance for your answers, and sorry for my bad english, I'm not a native speaker.
We're actually developping a mobile game with a backend. In this mobile game, we've got a money system, we keep track of each transaction for verification purpose.
In order to read a user balance, we've got an intermediary table, in which the user balance is updated on each transaction so the transaction table is never read directly by the users, in order to reduce load on high traffics.
The transaction table is uniquely read from time to time in the backoffice.
Here is the schema of the transaction table :
create table money_money_transaction (
`id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
`userID` INT UNSIGNED NOT NULL,
`amount` INT NOT NULL,
`transactionType` TINYINT NOT NULL,
`created` DATETIME NOT NULL,
CONSTRAINT money_money_transaction_userID FOREIGN KEY (`userID`) REFERENCES `user_user` (`id`)
ON DELETE CASCADE
);
We planned to have a lot of users, the transaction table could grow up to 1 billion row, so my questions are :
Will it affect the performance of other tables ?
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Does MySQL will be able to scale correctly up to this billion row ? Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?

You might consider MyRocks (see http://myrocks.io), which is a third-party storage engine that is designed for fast INSERT speed and compressed data storage. I won't make a recommendation that you should switch to MyRocks, because I don't have enough information to make an unequivocal statement about it for your workload. But I will recommend that it's worth your time to evaluate it and see if it works better for your application.
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Yes, MySQL (assuming InnoDB storage engine) stores partial tables in RAM, in the buffer pool. It breaks down tables into pages, and fits pages in the buffer pool as queries request them. It's like a cache. Over time, the most requested pages stay in the buffer pool, and others get evicted. So it more or less balances out to serve most of your queries as quickly as possible. Read https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html for more information.
Will it affect the performance of other tables ?
Tables don't have performance — queries have performance.
The buffer pool has fixed size. Suppose you have six tables that need to share it, their pages must fit into the same buffer pool. There's no way to set priorities for each table, or dedicate buffer pool space for certain tables or "lock" them in RAM. All pages of all tables share the same buffer pool. So as your queries request pages from various tables, they do affect each other in the sense that frequently-requested pages from one table may evict pages from another table.
Does MySQL will be able to scale correctly up to this billion row ?
MySQL has many features to try to help performance and scalability (those are not the same thing). Again, queries have performance, not tables. A table without queries just sits there. It's the queries that get optimized by different techniques.
Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Indexes do add overhead to inserts. You can't eliminate the primary key index, this is a necessary part of every table. But for example, you might find it worthwhile to drop your FOREIGN KEY, which includes an index.
Usually, most tables are read more than they are written to, so it's worth keeping an index to help reads (or even an UPDATE or DELETE that uses a WHERE clause). But if your workload is practically all INSERT, maybe the extra index for the foreign key is purely overhead and gives no benefit for any queries.
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
I worked on benchmarks of Aurora in early 2017, and found that for the application we tested, is was not good for high write traffic. You should always test it for your application, instead of depending on the guess of someone on the internet. But I predict that Aurora in its current form (circa 2017) will completely suck for your all-write workload.

Related

Benchmark MySQL with batch insert on multiple threads within same table

I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:
multiple threads concurrency write to the same table.
support batch insert (each insert transaction will insert bulk of records)
I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.
oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
bulk_insert: support multiple threads, but each thread writes to a different table.
Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?
(from Comment:)
CREATE TABLE IF NOT EXISTS `tableA` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` VARCHAR(63) NOT NULL DEFAULT '',
`data` JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;
(From a MySQL point of view...)
Toss id and the PK -- saves 8 bytes per row.
Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).
Doing those will
Shrink the disk I/O needed (providing some speedup)
Eliminate one of the indexes (probably a significant part of the post-load processing)
Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.
Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.
Another thought...
What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.
If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.
If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.
You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.
{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.
Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)

Handling huge MyISAM table for optimisation

I have a huge (and growing) MyISAM table (700millions rows = 140Gb).
CREATE TABLE `keypairs` (
`ID` char(60) NOT NULL,
`pair` char(60) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE=MyISAM
The table option was changed to ROW_FORMAT=FIXED, cause both columns are always fixed length to max (60). And yes yes, ID is well a string sadly and not an INT.
SELECT queries are pretty ok in speed efficiency.
Databases and mysql engine are all 127.0.0.1/localhost. (nothing distant)
Sadly, INSERT is slow as hell. I dont even talk about trying to LOAD DATA millions new rows... takes days.
There won't have any concurrent read on it. All SELECTs are done one by one by only my local server.(it is not for client's use)
(for infos : files sizes .MYD=88Gb, .MYI=53Gb, .TMM=400Mb)
How could i speed up inserts into that table?
Would it help to PARTITION that huge table ? (how then?)
I heard MyISAM is using "structure cache" as .frm files. And that a line into config file is helping mysql keep in memory all the .frm (in case of partitionned), would it help also? Actualy, my .frm file is 9kb only for 700millions rows)
string shortenning/compress function... the ID string? (same idea as rainbow tables) even if it lowers the max allowed unique ID's, i will anyway never reach the max of 60chars. so maybe its an idea? but before creating a new unique ID i have to check if shortened string doesn't exists in db ofc
Same idea as shortening ID strings, what about using md5() on the ID? shorten string means faster or not in that case?
Sort the incoming data before doing the LOAD. This will improve the cacheability of the PRIMARY KEY(id).
PARTITIONing is unlikely to help, unless there is some useful pattern to ID.
PARTITIONing will not help for single-row insert nor for single-row fetch by ID.
If the strings are not a constant width of 60, you are wasting space and speed by saying CHAR instead of VARCHAR. Change that.
MyISAM's FIXED is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
Smaller means more cacheable means less I/O means faster.
The .frm is an encoding of the CREATE TABLE; it is not relevant for this discussion.
A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to BINARY(16).
The PRIMARY KEY is a BTree. If ID is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
Turning the PRIMARY KEY into a secondary key (after adding an AUTO_INCREMENT) will not speed things up -- it still has to update the BTree with ID in it!
How much RAM do you have? For your situation, and for this LOAD, set MyISAM's key_buffer_size to about 70% of available RAM, but not bigger than the .MYI file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD is only being appended to (assuming you have never deleted any rows).
We do need to see your SELECTs to make sure these changes are not destroying performance somewhere else.
Make sure you are using CHARACTER SET latin1 or ascii; utf8 would waste a lot more space with CHAR.
Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size should be lowered to 10M and innodb_buffer_pool_size should be set to 70% of available RAM.
(My bullet items apply to InnoDB except where MyISAM is specified.)
In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.
Hex ID
Since ID is always 60 hex digits, declare it to be BINARY(30) and pack them via UNHEX(...) and fetch via HEX(ID). Test via WHERE ID = UNHEX(...). That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)
To do just the conversion to BINARY(30):
CREATE TABLE new (
ID BINARY(30) NOT NULL,
`pair` char(60) NOT NULL
-- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
SELECT UNHEX(ID),
pair
FROM keypairs;
ALTER TABLE keypairs ADD
PRIMARY KEY (`ID`); -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
new TO keypairs;
DROP TABLE old;
Tiny RAM
With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M and innodb_buffer_pool_size=0. For InnoDB-only: key_buffer_size=10M and innodb_buffer_pool_size=500M. Since ID is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.
The other thing to do to speed up the INSERTs is to sort by ID before starting the LOAD. But that gets tricky with the UNHEX. Here's what I recommend.
Create a MyISAM table, tmp, with ID BINARY(30) and pair, but no indexes. (Don't worry about key_buffer_size; it won't be used.)
LOAD the data into tmp.
ALTER TABLE tmp ORDER BY ID; This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp; This will maximize the caching by feeding rows to keypairs in ID order.
Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs is. I expect step 3 or 4 to take the longest, but I don't know which.
Optimizing a table requires that you optimize for specific queries. You can't determine the best optimization strategy unless you have specific queries in mind. Any optimization improves one type of query at the expense of other types of queries.
For example, if your query is SELECT SUM(pair) FROM keypairs (a query that would have to scan the whole table anyway), partitioning won't help, and just adds overhead.
If we assume your typical query is inserting or selecting one keypair at a time by its primary key, then yes, partitioning can help a lot. It all depends on whether the optimizer can tell that your query will find its data in a narrow subset of partitions (ideally one partition).
Also make sure to tune MyISAM. There aren't many tuning options:
Allocate key_buffer_size as high as you can spare to cache your indexes. Though I haven't ever tried anything higher than about 10GB, and I can't guarantee that MyISAM key buffers are stable at 53GB (the size of your MYI file).
Pre-load the key buffers: https://dev.mysql.com/doc/refman/5.7/en/cache-index.html
Size read_buffer_size and read_rnd_buffer_size appropriately given the queries you run. I can't give a specific value here, you should test different values with your queries.
Size bulk_insert_buffer_size to something large if you want to speed up LOAD DATA INFILE. It's 8MB by default, I'd try at least 256MB. I haven't experimented with that setting, so I can't speak from experience.
I try not to use MyISAM at all. MySQL is definitely trying to deprecate its use.
...is there a mysql command to ALTER TABLE add INT ID increment column automatically?
Yes, see my answer to https://stackoverflow.com/a/251630/20860
First, your primary key is not incrementable.
Which means, roughly: at every insert the index have to be rebalanced.
No wonder it goes slowpoke at the table of such a size.
And such an engine...
So, to the second: what's the point of keeping that MyISAM old junk?
Like, for example, you don't mind to loose row or two (or -teen) in case of an accident? And etc, etc, etc, even setting aside that current MySQL maintainer (Oracle Corp) explicitly discourages usage of MyISAM.
So, here are possible solutions:
1) Switch to Inno;
2) If you can't surrender the char ID, then:
Add autoincrement numerical key and set it primary - then, index would be clustered and the cost of insert would drop significantly;
Turn your current key into secondary index;
3) In case you can - it's obvious

Are random primary keys a pitfall for MySQL Cluster?

I understand that the InnoDB engine relies heavily on primary keys for its storage mechanisms (index layouts, etc), and that it is consequently a bad idea to use a non-sequential primary key (say a random 15 digit integer), because it will cause frequent (not to say systematic) rebuilds of the primary key's BTree, thus slowing exponentially insertions on the table.
I was considering setting up a MySQL Cluster to host my application databases, which need to support a write-intensive load (around 40% writes on about 2M operations a day). Given that NDB records are using primary key hashes to distribute records between the cluster's nodes, I was wondering if this limitation also apply to this engine.
My first guess would be that in the contrary, the randomness would help distribute evenly the data, but I can't find precise information about that. So, does anyone have an insight on this matter ?

Offline synchronization (Performance UUID as a primary key)

I'm working on a project , where some clients have internet connection issues.
When internet connection does not work , we store informations on database located in the client PC.
When we get connection again we sychronise the local DB with the central one.
To avoid conflicts in record ids between the 2 databases we will use UUID [char(36)] instead of autoincrements.
Databases are Mysql with InnoDB engine.
My question is Will this have an impact on the performance for selects, joins etc?
Should we use varbinary(16) instead of char(36) to improve performance ?
note : We already have an existing database with 4 Go data
We are also open to other suggestion to resolve this offline/online issue.
Thanks
Since you didn't say which database engine is being used (MyISAM or InnoDB) then it's difficult to say what's the magnitude of the performance implication.
However, to cut the story short - yes, there will be performance implications for larger sets of data.
The reason for that is that you require 36 bytes for the primary key index opposed to 4 (8 if bigint) bytes for integer.
I'll give you a hint how you can avoid conflicts:
First is to have different autoincrement offset on the databases. If you have 2 databases, you'd have autoincrements to be odd on one and even on another.
Second is to have compound primary key. If you define your primary key as PRIMARY KEY(id, server_id) then you won't get any clashes if you replicate the data into the central DB.
You'll also know where it came from.
The downside is that you need to supply the server_id to every query you do.

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.