Are random primary keys a pitfall for MySQL Cluster? - mysql

I understand that the InnoDB engine relies heavily on primary keys for its storage mechanisms (index layouts, etc), and that it is consequently a bad idea to use a non-sequential primary key (say a random 15 digit integer), because it will cause frequent (not to say systematic) rebuilds of the primary key's BTree, thus slowing exponentially insertions on the table.
I was considering setting up a MySQL Cluster to host my application databases, which need to support a write-intensive load (around 40% writes on about 2M operations a day). Given that NDB records are using primary key hashes to distribute records between the cluster's nodes, I was wondering if this limitation also apply to this engine.
My first guess would be that in the contrary, the randomness would help distribute evenly the data, but I can't find precise information about that. So, does anyone have an insight on this matter ?

Related

Improving MySQL performance on RDS by partitioning

I am trying to improve a performance of some large tables (can be millions of records) in a MySQL 8.0.20 DB on RDS.
Scaling up DB instance and IOPS is not the way to go, as it is very expensive (the DB is live 24/7).
Proper indexes (including composite ones) do already exist to improve the query performance.
The DB is mostly read-heavy, with occasional massive writes - when these writes happen, reads can be just as massive at the same time.
I thought about doing partitioning. Since MySQL doesn't support vertical partitioning, I considered doing horizontal partitioning - which should work very well for these large tables, as they contain activity records from dozens/hundreds of accounts, and storing each account's records in a separate partition makes a lot of sense to me.
But these tables do contain some constraints with foreign keys, which rules out using MySQL's horizontal partitioning : Restrictions and Limitations on Partitioning
Foreign keys not supported for partitioned InnoDB tables. Partitioned tables using the InnoDB storage engine do not support foreign keys. More specifically, this means that the following two statements are true:
No definition of an InnoDB table employing user-defined partitioning may contain foreign key references; no InnoDB table whose definition contains foreign key references may be partitioned.
No InnoDB table definition may contain a foreign key reference to a user-partitioned table; no InnoDB table with user-defined partitioning may contain columns referenced by foreign keys.
What are my options, other than doing "sharding" by using separate tables to store activity records on a per account basis? That would require a big code change to accommodate such tables. Hopefully there is a better way, that would only require changes in MySQL, and not the application code. If the code needs to be changed - the less the better :)
storing each account's records in a separate partition makes a lot of sense to me
Instead, have the PRIMARY KEY start with acct_id. This provides performance at least as good as PARTITION BY acct_id, saves disk space, and "clusters" an account's data together for "locality of reference".
The DB is mostly read-heavy
Replicas allows 'infinite' scaling of reads. But if you are not overloading the single machine now, there may be no need for this.
with occasional massive writes
Let's discuss techniques to help with that. Please explain what those writes entail -- hourly/daily/sporadic? replace random rows / whole table / etc? keyed off what? Etc.
Proper indexes (including composite ones) do already exist to improve the query performance.
Use the slowlog (with long_query_time = 1 or lower) to verify. Use pt-query-digest to find the top one or two queries. Show them to us -- we can help you "think out of the box".
read-heavy
Is the working set size less than innodb_buffer_pool_size? That is, are you CPU-bound and not I/O-bound?
More on PARTITION
PRIMARY KEY(acct_id, ..some other columns..) orders the data primarily on acct_id and makes this efficient: WHERE acct_id=123 AND ....
PARTITION BY .. (acct_id) -- A PARTITION is implemented as a separate "table". "Partition pruning" is the act of deciding which partition(s) are needed for the query. So WHERE acct_id=123 AND ... will first do that pruning, then look for the row(s) in that "table" to handle the AND .... Hopefully, there is a good index (perhaps the PRIMARY KEY) to handle that part of the filtering.
The pruning is sort of takes the place of one level of BTree. It is hard to predict which will be slower or faster.
Note that when partitioning by, say, acct_id, there is usually not efficient to start the index with that column. (However, it would need to be later in the PK.)
Big Deletes
There are several ways to do a "big delete" while minimizing the impact on the system. Partitioning by date is optimal but does not sound viable for your type of data. Check out the others listed here: http://mysql.rjweb.org/doc.php/deletebig
Since you say that the deletion is usually less than 15%, the "copy over what needs to be kept" technique is not applicable either.
Before sharding or partitioning, first analyze your queries to make sure they are as optimized as you can make them. This usually means designing indexes specifically to support the queries you run. You might like my presentation How to Design Indexes, Really (video).
Partitioning isn't as much a solution as people think. It has many restrictions, including the foreign key issue you found. Besides that, it only improves queries that can take advantage of partition pruning.
Also, I've done a lot of benchmarking of Amazon RDS for my current job and also a previous job. RDS is slow. It's really slow. It uses remote EBS storage, so it's bound to incur overhead for every read from storage or write to storage. RDS is just not suitable for any application that needs high performance.
Amazon Aurora is significantly better on latency and throughput. But it's also very expensive. The more you use it, the more you use I/O requests, and they charge extra for that. For a busy app, you end up spending as much as you did for RDS with high provisioned IOPS.
The only way I found to get high performance in the cloud is to forget about managed databases like RDS and Aurora, and instead install and run your own instance of MySQL on an ec2 instance with locally-attached NVMe storage. This means the i3 family of ec2 instances. But local storage is ephemeral instance storage, so if the instance restarts, you lose your data. So you must add one or more replicas and have a failover plan.
If you need an OLTP database in the cloud, and you also need top-tier performance, you either have to spend $$$ for a managed database, or else you need to hire full-time DevOps and DBA staff to run it.
Sorry to give you the bad news, but the TANSTAAFL adage remains true.

MySQL performance on large, write-only table

thanks in advance for your answers, and sorry for my bad english, I'm not a native speaker.
We're actually developping a mobile game with a backend. In this mobile game, we've got a money system, we keep track of each transaction for verification purpose.
In order to read a user balance, we've got an intermediary table, in which the user balance is updated on each transaction so the transaction table is never read directly by the users, in order to reduce load on high traffics.
The transaction table is uniquely read from time to time in the backoffice.
Here is the schema of the transaction table :
create table money_money_transaction (
`id` BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
`userID` INT UNSIGNED NOT NULL,
`amount` INT NOT NULL,
`transactionType` TINYINT NOT NULL,
`created` DATETIME NOT NULL,
CONSTRAINT money_money_transaction_userID FOREIGN KEY (`userID`) REFERENCES `user_user` (`id`)
ON DELETE CASCADE
);
We planned to have a lot of users, the transaction table could grow up to 1 billion row, so my questions are :
Will it affect the performance of other tables ?
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Does MySQL will be able to scale correctly up to this billion row ? Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
You might consider MyRocks (see http://myrocks.io), which is a third-party storage engine that is designed for fast INSERT speed and compressed data storage. I won't make a recommendation that you should switch to MyRocks, because I don't have enough information to make an unequivocal statement about it for your workload. But I will recommend that it's worth your time to evaluate it and see if it works better for your application.
If the database is too large to fit in RAM, does MySQL have some sort of optimisation, storing in RAM only the most read table ?
Yes, MySQL (assuming InnoDB storage engine) stores partial tables in RAM, in the buffer pool. It breaks down tables into pages, and fits pages in the buffer pool as queries request them. It's like a cache. Over time, the most requested pages stay in the buffer pool, and others get evicted. So it more or less balances out to serve most of your queries as quickly as possible. Read https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool.html for more information.
Will it affect the performance of other tables ?
Tables don't have performance — queries have performance.
The buffer pool has fixed size. Suppose you have six tables that need to share it, their pages must fit into the same buffer pool. There's no way to set priorities for each table, or dedicate buffer pool space for certain tables or "lock" them in RAM. All pages of all tables share the same buffer pool. So as your queries request pages from various tables, they do affect each other in the sense that frequently-requested pages from one table may evict pages from another table.
Does MySQL will be able to scale correctly up to this billion row ?
MySQL has many features to try to help performance and scalability (those are not the same thing). Again, queries have performance, not tables. A table without queries just sits there. It's the queries that get optimized by different techniques.
Knowing we do mostly insert and that the only index is on the id (the id is needed for details) and that there is no "bulk insert" (there will not be 1M insert to do concurrently on this table)
Indexes do add overhead to inserts. You can't eliminate the primary key index, this is a necessary part of every table. But for example, you might find it worthwhile to drop your FOREIGN KEY, which includes an index.
Usually, most tables are read more than they are written to, so it's worth keeping an index to help reads (or even an UPDATE or DELETE that uses a WHERE clause). But if your workload is practically all INSERT, maybe the extra index for the foreign key is purely overhead and gives no benefit for any queries.
Also, we're on a RDS server, so we could switch to Aurora and try a master-master or master-slave replication if needed. Do you think it would help in this case ?
I worked on benchmarks of Aurora in early 2017, and found that for the application we tested, is was not good for high write traffic. You should always test it for your application, instead of depending on the guess of someone on the internet. But I predict that Aurora in its current form (circa 2017) will completely suck for your all-write workload.

Offline synchronization (Performance UUID as a primary key)

I'm working on a project , where some clients have internet connection issues.
When internet connection does not work , we store informations on database located in the client PC.
When we get connection again we sychronise the local DB with the central one.
To avoid conflicts in record ids between the 2 databases we will use UUID [char(36)] instead of autoincrements.
Databases are Mysql with InnoDB engine.
My question is Will this have an impact on the performance for selects, joins etc?
Should we use varbinary(16) instead of char(36) to improve performance ?
note : We already have an existing database with 4 Go data
We are also open to other suggestion to resolve this offline/online issue.
Thanks
Since you didn't say which database engine is being used (MyISAM or InnoDB) then it's difficult to say what's the magnitude of the performance implication.
However, to cut the story short - yes, there will be performance implications for larger sets of data.
The reason for that is that you require 36 bytes for the primary key index opposed to 4 (8 if bigint) bytes for integer.
I'll give you a hint how you can avoid conflicts:
First is to have different autoincrement offset on the databases. If you have 2 databases, you'd have autoincrements to be odd on one and even on another.
Second is to have compound primary key. If you define your primary key as PRIMARY KEY(id, server_id) then you won't get any clashes if you replicate the data into the central DB.
You'll also know where it came from.
The downside is that you need to supply the server_id to every query you do.

Insertion speed slowdown as the table grows in mysql

I am trying to get a better understanding about insertion speed and performance patterns in mysql for a custom product. I have two tables to which I keep appending new rows. The two tables are defined as follows:
CREATE TABLE events (
added_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id BINARY(16) NOT NULL,
body MEDIUMBLOB,
UNIQUE KEY (id)) ENGINE InnoDB;
CREATE TABLE index_fpid (
fpid VARCHAR(255) NOT NULL,
event_id BINARY(16) NOT NULL UNIQUE,
PRIMARY KEY (fpid, event_id)) ENGINE InnoDB;
And I keep inserting new objects to both tables (for each new object, I insert the relevant information to both tables in one transaction). At first, I get around 600 insertions / sec, but after ~ 30000 rows, I get a significant slowdown (around 200 insertions/sec), and then a more slower, but still noticeable slowdown.
I can see that as the table grows, the IO wait numbers get higher and higher. My first thought was memory taken by the index, but those are done on a VM which has 768 Mb, and is dedicated to this task alone (2/3 of memory are unused). Also, I have a hard time seeing 30000 rows taking so much memory, even more so just the indexes (the whole mysql data dir < 100 Mb anyway). To confirm this, I allocated very little memory to the VM (64 Mb), and the slowdown pattern is almost identical (i.e. slowdown appears after the same numbers of insertions), so I suspect some configuration issues, especially since I am relatively new to databases.
The pattern looks as follows:
I have a self-contained python script which reproduces the issue, that I can make available if that's helpful.
Configuration:
Ubuntu 10.04, 32 bits running on KVM, 760 Mb allocated to it.
Mysql 5.1, out of the box configuration with separate files for tables
[EDIT]
Thank you very much to Eric Holmberg, he nailed it. Here are the graphs after fixing the innodb_buffer_pool_size to a reasonable value:
Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.
# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M
The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.
Good luck!
Your indexes may just need to be analyzed and optimized during the insert, they gradually get out of shape as you go along. The other option of course is to disable indexes entirely when you're inserting and rebuild them later which should give more consistent performance.
Great link about insert speed.
ANALYZE. OPTIMIZE
Verifying that the insert doesn't violate a key constraint takes some time, and that time grows as the table gets larger. If you're interested in flat out performance, using LOAD DATA INFILE will improve your insert speed considerably.

MySQL vs SQLite + UNIQUE Indexes

For reasons that are irrelevant to this question I'll need to run several SQLite databases instead of the more common MySQL for some of my projects, I would like to know how SQLite compares to MySQL in terms of speed and performance regarding disk I/O (the database will be hosted in a USB 2.0 pen drive).
I've read the Database Speed Comparison page at http://www.sqlite.org/speed.html and I must say I was surprised by the performance of SQLite but since those benchmarks are a bit old I was looking for a more updated benchmark (SQLite 3 vs MySQL 5), again my main concern is disk performance, not CPU/RAM.
Also since I don't have that much experience with SQLite I'm also curious if it has anything similar to the TRIGGER (on update, delete) events in the InnoDB MySQL engine. I also couldn't find any way to declare a field as being UNIQUE like MySQL has, only PRIMARY KEY - is there anything I'm missing?
As a final question I would like to know if a good (preferably free or open source) SQLite database manager exists.
A few questions in there:
In terms of disk I/O limits, I wouldn't imagine that the database engine makes a lot of difference. There might be a few small things, but I think it's mostly just whether the database can read/write data as fast as your application wants it to. Since you'd be using the same amount of data with either MySQL or SQLite, I'd think it won't change much.
SQLite does support triggers: CREATE TRIGGER Syntax
SQLite does support UNIQUE constraints: column constraint definition syntax.
To manage my SQLite databases, I use the Firefox Add-on SQLite Manager. It's quite good, does everything I want it to.
In terms of disk I/O limits, I wouldn't imagine that the database engine makes
a lot of difference.
In Mysql/myISAM the data is stored UNORDERED, so RANGE reads ON PRIMARY KEY will theoretically need to issue several HDD SEEK operations.
In Mysql/InnoDB the data is sorted by PRIMARY KEY, so RANGE reads ON PRIMARY KEY will be done using one DISK SEEK operation (in theory).
To sum that up:
myISAM - data is written on HDD unordered. Slow PRI-KEY range reads if pri key is not AUTO INCREMENT unique field.
InnoDB - data ordered, bad for flash drives (as data needs to be re-ordered after insert = additional writes). Very fast for PRI KEY range reads, slow for writes.
InnoDB is not suitable for flash memory. As seeks are very fast (so you won't get too much benefit from reordering the data), and additional writes needed to maintain the order are damaging to flash memory.
myISAM / innoDB makes a huge difference for conventional and flash drives (i don't know what about SQLite), but i'd rather use mysql/myisam.
I actually prefer using SQLiteSpy http://www.portablefreeware.com/?id=1165 as my SQLite interface.
It supports things like REGEXP which can come in handy.