Thanks in advance for your help!
Question
While executing a large volume of concurrent writes to a simple table using the MySQL Memory Storage Engine is there an objective performance difference between having all the writes be (A) updates to a very small table (say 100 rows) vs (B) inserts? By performance difference, I'm thinking speed/locking - but if there are other significant variable(s) please say so. Since the answer to this less specific question is often "it depends" I've written scenario's (A) & (B) below to provide context & define the detail's in hopes of allowing for an objective answer.
Example Scenarios
I've written scenario's (A) & (B) below to help illustrate & provide context. You can assume excess of RAM & CPU, MySQL 5.7 if it matters, the scenarios are simplified, and I'm using the Memory engine to remove Disk I/O from the equation (and I'm aware it uses table-level locking). Thanks again for your help!
~ Scenario A ~
1) I've got a memory table with ~100 rows like this:
CREATE TABLE cache (
campaign_id MEDIUMINT UNSIGNED NOT NULL,
sum_clicks SMALLINT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (campaign_id)
) Engine=MEMORY DEFAULT CHARSET=latin1;
2) And ~1k worker threads populating said table like so:
UPDATE cache SET sum_clicks+=x WHERE campaign_id=y;
3) And finally, a job that runs every ~hour which does:
CREATE TABLE IF NOT EXISTS next_cache LIKE cache;
INSERT INTO next_cache (campaign_id) SELECT id FROM campaigns;
RENAME TABLE cache TO old_cache, next_cache TO cache;
SELECT * FROM old_cache...into somewhere else;
TRUNCATE old_cache;
RENAME TABLE old_cache TO next_cache; // for next time
~ Scenario B ~
1) I've got a memory table like this:
CREATE TABLE cache (
campaign_id MEDIUMINT UNSIGNED NOT NULL,
sum_clicks SMALLINT UNSIGNED NOT NULL DEFAULT 0
) Engine=MEMORY DEFAULT CHARSET=latin1;
2) And ~1k worker threads populating said table like so:
INSERT INTO cache VALUES (y,x);
3) And finally, a job that runs every ~hour which does:
(~same thing as scenario A's 3rd step)
Post Script
For those searching stackOverflow for this I found these stackOverflow questions & answers helpful, especially if you are open to using storage engines beyond the MEMORY engine. concurrent-insert-with-mysql and
insert-vs-update-mysql-7-million-rows
With 1K worker threads hitting this table, they will seriously stumble over each other. Note that MEMORY uses table locking. You are likely to be better off with an InnoDB table.
Regardless of the Engine, do 'batch' INSERTs/UPDATEs whenever practical. That is, insert/update multiple rows in a single statement and/or in a single transaction.
Tips on high-speed ingestion -- My 'staging' table is very similar to your 'cache', though used for a different purpose.
Related
I want to test high-intensive write between InnoDB and MyRock engine of the MySQL database. For this purpose, I use sysbench to benchmark. My requirements are:
multiple threads concurrency write to the same table.
support batch insert (each insert transaction will insert bulk of records)
I check all pre-made tests of sysbench and I don't see any tests that satisfy my requirements.
oltp_write_only: supports multiple threads that write to the same table. But this test doesn't have bulk insert option.
bulk_insert: support multiple threads, but each thread writes to a different table.
Are there any pre-made sysbench tests satisfied my requirement? If not, can I find custom Lua scripts somewhere which already are done this?
(from Comment:)
CREATE TABLE IF NOT EXISTS `tableA` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`user_id` VARCHAR(63) NOT NULL DEFAULT '',
`data` JSON NOT NULL DEFAULT '{}',
PRIMARY KEY (`id`),
UNIQUE INDEX `user_id_UNIQUE` (`user_id` ASC)
) ENGINE = InnoDB;
(From a MySQL point of view...)
Toss id and the PK -- saves 8 bytes per row.
Promote UNIQUE(user_id) to PRIMARY KEY(user_id) -- might save 40 bytes per row (depends on LENGTH(user_id)).
Doing those will
Shrink the disk I/O needed (providing some speedup)
Eliminate one of the indexes (probably a significant part of the post-load processing)
Run OS monitoring tools to see what percentage of the I/O is being consumed. That is likely to be the limiting factor.
Benchmarking products are handy for limited situations. For your situation (and many others), it is best to build your product and time it.
Another thought...
What does the JSON look like? If the JSON has a simple structure (a consistent set of key:value pairs), then the disk footprint might be half as much (hence speed doubling) if you made individual columns. The processing to change from JSON to individual columns would be done in the client, which may (or may not) cancel out the savings I predict.
If the JSON is more complex, there still might be savings by pulling out "columns" that are always present.
If the JSON is "big", then compress it in the client, then write to a BLOB. This may shrink the disk footprint and network bandwidth by a factor of 3.
You mentioned 250GB for 250M rows? That's 1000 bytes/row. That means the JSON averages 700 bytes? (Note: there is overhead.) Compressing the JSON column into a BLOB would shrink to maybe 400 bytes/row total, hence only 100GB for 250M rows.
{"b": 100} takes about 10 bytes. If b could be stored in a 2-byte SMALLINT column, that would shrink the record considerably.
Another thing: If you promote user_id to PK, then this is worth considering: Use a file sort to sort the table by user_id before loading it. This is probably faster than INSERTing the rows 'randomly'. (If the data is already sorted, then this extra sort would be wasted.)
thanks in advance for your answers, and sorry for my bad english, I'm not a native speaker.
We're actually developping a mobile game with a backend. In this mobile game, we've got a money system, we keep track of each transaction for verification purpose.
In order to read a user balance, we've got an intermediary table, in which the user balance is updated on each transaction so the transaction table is never read directly by the users, in order to reduce load on high traffics.
The transaction table is uniquely read from time to time in the backoffice.
Here is the schema of the transaction table :
create table money_money_transaction (
`id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
`userID` INT UNSIGNED NOT NULL,
`amount` INT NOT NULL,
`transactionType` TINYINT NOT NULL,
`created` DATETIME NOT NULL,
CONSTRAINT money_money_transaction_userID FOREIGN KEY (`userID`) REFERENCES `user_user` (`id`)
ON DELETE CASCADE
);
We planned to have a lot of users, the transaction table could grow up to 1 billion row, so my questions are :
Will it affect the performance of other tables ?
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Does MySQL will be able to scale correctly up to this billion row ? Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
You might consider MyRocks (see http://myrocks.io), which is a third-party storage engine that is designed for fast INSERT speed and compressed data storage. I won't make a recommendation that you should switch to MyRocks, because I don't have enough information to make an unequivocal statement about it for your workload. But I will recommend that it's worth your time to evaluate it and see if it works better for your application.
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Yes, MySQL (assuming InnoDB storage engine) stores partial tables in RAM, in the buffer pool. It breaks down tables into pages, and fits pages in the buffer pool as queries request them. It's like a cache. Over time, the most requested pages stay in the buffer pool, and others get evicted. So it more or less balances out to serve most of your queries as quickly as possible. Read https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html for more information.
Will it affect the performance of other tables ?
Tables don't have performance — queries have performance.
The buffer pool has fixed size. Suppose you have six tables that need to share it, their pages must fit into the same buffer pool. There's no way to set priorities for each table, or dedicate buffer pool space for certain tables or "lock" them in RAM. All pages of all tables share the same buffer pool. So as your queries request pages from various tables, they do affect each other in the sense that frequently-requested pages from one table may evict pages from another table.
Does MySQL will be able to scale correctly up to this billion row ?
MySQL has many features to try to help performance and scalability (those are not the same thing). Again, queries have performance, not tables. A table without queries just sits there. It's the queries that get optimized by different techniques.
Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Indexes do add overhead to inserts. You can't eliminate the primary key index, this is a necessary part of every table. But for example, you might find it worthwhile to drop your FOREIGN KEY, which includes an index.
Usually, most tables are read more than they are written to, so it's worth keeping an index to help reads (or even an UPDATE or DELETE that uses a WHERE clause). But if your workload is practically all INSERT, maybe the extra index for the foreign key is purely overhead and gives no benefit for any queries.
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
I worked on benchmarks of Aurora in early 2017, and found that for the application we tested, is was not good for high write traffic. You should always test it for your application, instead of depending on the guess of someone on the internet. But I predict that Aurora in its current form (circa 2017) will completely suck for your all-write workload.
I have a MyISAM table with around 10M rows. For a single 'SELECT ... WHERE IN' query (with ~5000 values) it takes ~0.05s to get ~50K rows. However, when performing 100 concurrent similar queries the time rises to ~18s. It makes no sense to me since I have all the indexes in memory and the amount of data returned is not so big in size (~500Kb). Any idea what could be making this so slow? Thank you.
CREATE TABLE data (
A bigint(20) UNSIGNED NOT NULL,
B int(10) UNSIGNED NOT NULL,
C smallint(5) UNSIGNED NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
ALTER TABLE data ADD KEY A_key (A);
The used query:
SELECT * FROM data WHERE A IN (VAL1, VAL2, ...);
You don't say that you're using a stored procedure, so I would start there. A stored proc is compiled, which means that it should leverage the 'caching' of the query's execution plan. Since that plan is against in-memory data, you'll get some more performance out of it.
Even though plan-caching differs from server-to-server, you can still leverage a procedure for performance. E.G. You could make several procedures for your most-common queries. Albeit: this often requires application/client changes to use those procs. I have never experimented with having a single proc that checks for query param range(s), and then use a case to call one of several static queries.
5000 rows in 50ms is not bad -- probably most of that time is shoveling the data onto the network.
Assuming that is really what the schema looks like, let me explain what is happening in MyISAM. (Not all of this applies to InnoDB, which you should migrate to.)
INDEX(A) is manifested in a BTree of pairs: [A, record number]. For each of the 5000 A values, drill down the BTree. (The BTree will be about 4 levels deep for a 10M-row table.) The BTree is in 1KB blocks and cached in the key_buffer. What is the value of key_buffer_size? How much RAM do you have? What does SHOW TABLE STATUS say? (I want to use those to determine whether the size should be adjusted.)
Once the record number is found, then it does a 'seek' into the .MYD file to find the record, read it (15 bytes), and send it out. The OS caches these blocks, not MySQL.
That's several thousand potentially cached disk reads. 50ms is enough time to do only about 5 spinning disk reads, so I would say that most, if not all, reads were avoided because of the two caches.
100 concurrent threads... I assume each is reading 50 rows? Let me list the bottlenecks:
100 vs 1 connections to spin up.
100 threads vs 1 contending for the CPU
100 threads vs 1 contending for the the key_buffer
100 threads vs 1 contending for the OS cache
100 threads vs 1 contending for the Query Cache (if you have this turned on; here's a reason to turn it off)
The last major improvements to MyISAM were back in the days of single-core machines. Meanwhile, InnoDB has made great strides in properly handline multiple threads. (Still 100 is beyond the reach of even the latest release.) (MyISAM has been removed from MySQL 8.0.)
I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.
does anyone knows why I get an overhead of 131.0 MiB on a newly created table (zero rows)?
im using phpmy admin and the code of my script is
CREATE TABLE IF NOT EXISTS `mydb`.`mytable` (
`idRol` INT NOT NULL AUTO_INCREMENT ,
`Rol` VARCHAR(45) NOT NULL ,
PRIMARY KEY (`idRol`) )
ENGINE = InnoDB;
thanks in advance.
InnoDB uses a shared table space. That means that per default all the tables regardless of database are stored in a single file in the filesystem. This differs from for example MyISAM which stores every table as a single file.
The behaviour of InnoDB can be changed, although I don't think it's really necessary in this case. See Using Per-Table Tablespaces.
The overhead is probably the space left by deleted rows, and InnoDB will reuse it when you insert new data. It's nothing to be concerned about.
It might be because mysql generated an index on 'idRol'
Storing an index takes some space, but I am not sure if this is the reason. It's only a guess. I'm not a DBA.