Benchmark MySQL with batch insert on multiple threads within same table - mysql

I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:
multiple threads concurrency write to the same table.
support batch insert (each insert transaction will insert bulk of records)
I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.
oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
bulk_insert: support multiple threads, but each thread writes to a different table.
Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?
(from Comment:)
CREATE TABLE IF NOT EXISTS `tableA` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` VARCHAR(63) NOT NULL DEFAULT '',
`data` JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;

(From a MySQL point of view...)
Toss id and the PK -- saves 8 bytes per row.
Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).
Doing those will
Shrink the disk I/O needed (providing some speedup)
Eliminate one of the indexes (probably a significant part of the post-load processing)
Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.
Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.
Another thought...
What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.
If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.
If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.
You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.
{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.
Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)

Related

Test insert query using mysqlslap

First, I am new to mysqlslap
I want to test insert query using mysqlslap on my existing database. Table which I want to test has primary and composite unique.
So, how to do performance test on this table using mysqlslap concurrently?
I should not face mysql error duplicate key
Below is skeleton for my table:
CREATE TABLE data (
id bigint(20) NOT NULL,
column1 bigint(20) DEFAULT NULL,
column2 varchar(255) NOT NULL DEFAULT '0',
datacolumn1 VARCHAR(255) NOT NULL DEFAULT '',
datacolumn2 VARCHAR(2048) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY profiles_UNIQUE (column1,column2),
INDEX id_idx (id),
INDEX unq_id_idx (column1, column2) USING BTREE
) ENGINE=innodb DEFAULT CHARSET=latin1;
Please help me on this
There are several problems with benchmarking INSERTs. The speed will change as you insert more and more, but not in an easily predictable way.
An Insert performs (roughly) this way:
Check for duplicate key. You have two unique keys (the PK and a UNIQUE). Each BTree will be drilled down to check for a dup. Assuming no dup...
The row will be inserted in the data (a BTree keyed by the PK)
A "row" will be inserted into each Unique's BTree. In your case, there is a BTree effectively ordered by (column1, column2) and containing (id).
Stuff is put into the "Change Buffer" for each non-unique index.
If you had an AUTO_INCREMENT or a UUID or ..., there will be more discussion.
The Change Buffer is effectively a "delayed write" to non-unique indexes. This delay has to be dealt with eventually. That is, at some time, things will slow down if a background process fails to keep up with the changes. That is, if you insert 1 million rows, you may not hit this slowdown; if you insert 10 million rows, you may hit it.
Another variable: VARCHAR(2048) (and other TEXT and BLOB columns) may or may not be stored "off record". This depends on the size of the row, the size of that column, and "row format". A big string may take an extra disk hit, thereby slowing down the benchmark, probably by a noticeable amount. That is, if you benchmark with only small strings and certain row formats, you will get a faster insert time than otherwise.
And you need to understand how the benchmark program runs -- versus how your application will run:
Insert rows one at a time in a single thread -- each being a transaction.
Insert rows one at a time in a single thread -- lots batched into a transaction.
Insert 100 rows at a time in a single thread in a single transaction.
LOAD DATA.
Multiple threads with each of the above.
Different transaction isolation settings.
Etc.
(I am not a fan of benchmarks because of how many flaws they have.) The 'best' benchmark for comparing hardware or limited schema/app changes: Capture the "general log" from a running application; capture the database at the start of that; time the re-applying of that log.
Designing a table/insert for 50K inserted rows/sec
Minimize indexes. In your case, all you need is PRIMARY KEY(col1, col2); toss the rest; toss id. Please explain what col1 and col2 are; there may be more tips here.
Get rid of the table. Seriously, consider summarize the 50K rows every second and store only the summarization. If it is practical, this will greatly speed things up. Or maybe a minute's worth.
Batch insert rows in some way. The details here depend on whether you have one or many clients doing the inserts, whether you need to massage the data as it comes, in, etc. More discussion: http://mysql.rjweb.org/doc.php/staging_table
What is in those strings? Can/should they be 'normalized'?
Let's discuss the math. Will you be loading about 10 petabytes per year? Do you have that much disk space? What will you do with the data? How long will it take to read even a small part of that data? Or will it be a "write only" database??
More math. 50K rows * 0.5KB = 25MB writing to disk per second. What device do you have? Can it handle, say, 2x that? (With your original schema, it would be more like 60MB/s because of all the indexes.)
After comments
OK, so more like 3TB before you toss the data and start over (in 2 hours)? For that, I would suggest PARTITION BY RANGE and use some time function that gives you 5 minutes in each partition. This will give you a reasonable number of partitions (about 25) and the DROP PARTITION will be dropping only about 100GB, which might not overwhelm the filesystem. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
As for the strings... You suggest 25KB, yet the declarations don't allow for that much???

MySQL performance on large, write-only table

thanks in advance for your answers, and sorry for my bad english, I'm not a native speaker.
We're actually developping a mobile game with a backend. In this mobile game, we've got a money system, we keep track of each transaction for verification purpose.
In order to read a user balance, we've got an intermediary table, in which the user balance is updated on each transaction so the transaction table is never read directly by the users, in order to reduce load on high traffics.
The transaction table is uniquely read from time to time in the backoffice.
Here is the schema of the transaction table :
create table money_money_transaction (
`id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
`userID` INT UNSIGNED NOT NULL,
`amount` INT NOT NULL,
`transactionType` TINYINT NOT NULL,
`created` DATETIME NOT NULL,
CONSTRAINT money_money_transaction_userID FOREIGN KEY (`userID`) REFERENCES `user_user` (`id`)
ON DELETE CASCADE
);
We planned to have a lot of users, the transaction table could grow up to 1 billion row, so my questions are :
Will it affect the performance of other tables ?
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Does MySQL will be able to scale correctly up to this billion row ? Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
You might consider MyRocks (see http://myrocks.io), which is a third-party storage engine that is designed for fast INSERT speed and compressed data storage. I won't make a recommendation that you should switch to MyRocks, because I don't have enough information to make an unequivocal statement about it for your workload. But I will recommend that it's worth your time to evaluate it and see if it works better for your application.
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Yes, MySQL (assuming InnoDB storage engine) stores partial tables in RAM, in the buffer pool. It breaks down tables into pages, and fits pages in the buffer pool as queries request them. It's like a cache. Over time, the most requested pages stay in the buffer pool, and others get evicted. So it more or less balances out to serve most of your queries as quickly as possible. Read https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html for more information.
Will it affect the performance of other tables ?
Tables don't have performance — queries have performance.
The buffer pool has fixed size. Suppose you have six tables that need to share it, their pages must fit into the same buffer pool. There's no way to set priorities for each table, or dedicate buffer pool space for certain tables or "lock" them in RAM. All pages of all tables share the same buffer pool. So as your queries request pages from various tables, they do affect each other in the sense that frequently-requested pages from one table may evict pages from another table.
Does MySQL will be able to scale correctly up to this billion row ?
MySQL has many features to try to help performance and scalability (those are not the same thing). Again, queries have performance, not tables. A table without queries just sits there. It's the queries that get optimized by different techniques.
Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Indexes do add overhead to inserts. You can't eliminate the primary key index, this is a necessary part of every table. But for example, you might find it worthwhile to drop your FOREIGN KEY, which includes an index.
Usually, most tables are read more than they are written to, so it's worth keeping an index to help reads (or even an UPDATE or DELETE that uses a WHERE clause). But if your workload is practically all INSERT, maybe the extra index for the foreign key is purely overhead and gives no benefit for any queries.
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
I worked on benchmarks of Aurora in early 2017, and found that for the application we tested, is was not good for high write traffic. You should always test it for your application, instead of depending on the guess of someone on the internet. But I predict that Aurora in its current form (circa 2017) will completely suck for your all-write workload.

MyISAM 'SELECT ... WHERE IN' queries are very slow for ~10M rows table and 100 concurrent connections (each query returns ~50K rows)

I have a MyISAM table with around 10M rows. For a single 'SELECT ... WHERE IN' query (with ~5000 values) it takes ~0.05s to get ~50K rows. However, when performing 100 concurrent similar queries the time rises to ~18s. It makes no sense to me since I have all the indexes in memory and the amount of data returned is not so big in size (~500Kb). Any idea what could be making this so slow? Thank you.
CREATE TABLE data (
A bigint(20) UNSIGNED NOT NULL,
B int(10) UNSIGNED NOT NULL,
C smallint(5) UNSIGNED NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
ALTER TABLE data ADD KEY A_key (A);
The used query:
SELECT * FROM data WHERE A IN (VAL1, VAL2, ...);
You don't say that you're using a stored procedure, so I would start there. A stored proc is compiled, which means that it should leverage the 'caching' of the query's execution plan. Since that plan is against in-memory data, you'll get some more performance out of it.
Even though plan-caching differs from server-to-server, you can still leverage a procedure for performance. E.G. You could make several procedures for your most-common queries. Albeit: this often requires application/client changes to use those procs. I have never experimented with having a single proc that checks for query param range(s), and then use a case to call one of several static queries.
5000 rows in 50ms is not bad -- probably most of that time is shoveling the data onto the network.
Assuming that is really what the schema looks like, let me explain what is happening in MyISAM. (Not all of this applies to InnoDB, which you should migrate to.)
INDEX(A) is manifested in a BTree of pairs: [A, record number]. For each of the 5000 A values, drill down the BTree. (The BTree will be about 4 levels deep for a 10M-row table.) The BTree is in 1KB blocks and cached in the key_buffer. What is the value of key_buffer_size? How much RAM do you have? What does SHOW TABLE STATUS say? (I want to use those to determine whether the size should be adjusted.)
Once the record number is found, then it does a 'seek' into the .MYD file to find the record, read it (15 bytes), and send it out. The OS caches these blocks, not MySQL.
That's several thousand potentially cached disk reads. 50ms is enough time to do only about 5 spinning disk reads, so I would say that most, if not all, reads were avoided because of the two caches.
100 concurrent threads... I assume each is reading 50 rows? Let me list the bottlenecks:
100 vs 1 connections to spin up.
100 threads vs 1 contending for the CPU
100 threads vs 1 contending for the the key_buffer
100 threads vs 1 contending for the OS cache
100 threads vs 1 contending for the Query Cache (if you have this turned on; here's a reason to turn it off)
The last major improvements to MyISAM were back in the days of single-core machines. Meanwhile, InnoDB has made great strides in properly handline multiple threads. (Still 100 is beyond the reach of even the latest release.) (MyISAM has been removed from MySQL 8.0.)

What could cause very slow performance of single UPDATEs of a InnoDB table?

I have a table in my web app for storing session data. It's performing badly, and I can't figure out why. Slow query log shows updating a row takes anything from 6 to 60 seconds.
CREATE TABLE `sessions` (
`id` char(40) COLLATE utf8_unicode_ci NOT NULL,
`payload` text COLLATE utf8_unicode_ci NOT NULL,
`last_activity` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `session_id_unique` (`id`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The PK is a char(40) which stores a unique session hash generated by the framework this project uses (Laravel).
(I'm aware of the redundancy of the PK and unique index, but I've tried all combinations and it doesn't have any impact on performance in my testing. This is the current state of it.)
The table is small - fewer than 200 rows.
A typical query from the slow query log looks like this:
INSERT INTO sessions (id, payload, last_activity)
VALUES ('d195825ddefbc606e9087546d1254e9be97147eb',
'YTo1OntzOjY6Il90b2tlbiI7czo0MDoi...around 700 chars...oiMCI7fX0=',
1405679480)
ON DUPLICATE KEY UPDATE
payload=VALUES(payload), last_activity=VALUES(last_activity);
I've done obvious things like checking the table for corruption. I've tried adding a dedicated PK column as an auto increment int, I've tried without a PK, without the unique index, swapping the text column for a very very large varchar, you name it.
I've tried switching the table to use MyISAM, and it's still slow.
Nothing I do seems to make any difference - the table performs very slowly.
My next thought was the query. This is generated by the framework, but I've tested hacking it out into a UPDATE with an INSERT if that fails. The slowness continued on the UPDATE statement.
I've read a lot of questions about slow INSERT and UPDATE statements, but those usually related to bulk transactions. This is just one insert/update per user per request. The site is not remotely busy, and it's on its own VPS with plenty of resources.
What could be causing the slowness?
This is not an answer but SE comment length is too damn short. So.
What happens if you run an identical INSERT ... ON DUPLICATE KEY UPDATE... statement directly on the command line? Please try with and without actual usage of the application. The application may be artificially slowing down this UPDATE (for example, in INNODB a transaction might be opened, but committed after a lot of time was consumed. You tested with MyISAM too which does not support transactions. Perhaps in that case an explicit LOCK could account for the same effect. If the framework uses this trick, I'm not sure, I don't know laravel) Try to benchmark to see if there is a concurrency effect.
Another question: is this a single server? Or is it a master that replicates to one or more slaves?
Apart from this question, a few observations:
the values for id are hex strings. the column is unicode. this means 3*40 bytes are reserved while only 40 are utilized. This is a waste that will make things inefficient in general. It would be much better to use BINARY or ASCII as character encoding. Better yet, change the id column to BINARY data type and store the (unhexed) binary value
A hash for a innodb PK table will scatter the data across pages. The idea to use a auto_incrment pk, or not explicitly declare a pk at all (this will cause innodb to create an autoincrement pk of its own internally) is a good idea.
It looks like the payload is base64 encoded. Again the character encoding is specified to be unicode. Ascii or Binary (the character encoding, not the data type) is much more appropriate.
the HASH keyword in the unique index on ID is meaningless. InnoDB does not implement HASH indexes. Unfortunately MySQL is perfectly silent about this (see http://bugs.mysql.com/bug.php?id=73326)
(while this list does offer angles for improvement it seems unlikely that the extreme slowness can be fixed with this. there must be something else going on)
Frustratingly, the answer is this case was a bad disk. One of the disks in the storage array had gone bad, and so writes were taking forever to complete. Simply that.

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.