Understand Sharding in MYSQL cluster with different storage engines

Understand Sharding in MYSQL cluster with different storage engines - mysql

After I study in MySQL, I learn that there are popular two types of the cluster which are InnoDB and NDB. I want to discuss is about sharding.
The InnoDB cluster does not really distribute data by partitioning to each node. It just partitions the data locally (each node has the same copied data by replication), while the NDB cluster does. Furthermore, the downside of the InnoDB cluster is application level partitioning which means have to decide which PARTITION is going to use.
e.g. SELECT * FROM table PARTITION (p1).
Do I understand it right?

Short Answer: InnoDB Cluster does not provide sharding. (That is, splitting table(s) across multiple servers.) NDB does.
Long Answer:
For any "ordinary" database, simply use InnoDB. Perhaps only 1% of MySQL users "need" NDB. Don't even consider it until you have discussed your application with someone familiar with both NDB and InnoDB.
Perhaps only 1% of InnoDB users ever "need" PARTITIONing. When I encounter that in this forum, I usually spend time explaining why they would actually be better off without Partitioning. Again, let's hear what your application is.
"Partitioning" is often confused with "Sharding". For MySQL, Sharding, not partitioning, involves putting different rows on different physical servers. Sharding is also a 1% feature. Again, let's discuss whether it is even relevant.
I am happy to discuss any of the above in more detail, but only in a more focused context.
In general, it is best to prototype in InnoDB, grow the dataset until you can see a real need for NDB / Partitioning / Sharding. By then, you will have a better feel for which you need and how to do it (server topology, partition / shard key, etc)

You seem to have the idea that you must specify the partition in a query:
SELECT * FROM table PARTITION (p1);
This is not required. One of the features of partitioning is that if it can infer which partition to read from the logic of your query, it does that automatically.
Suppose your table were partitioned by the created_at column. A query that references a specific date in that column would know which partition to access, without you needing to specify it in the table hint syntax.
SELECT * FROM table WHERE created_at = '2020-11-28';
Which partition it accesses depends on the way you defined the partitioning when you created the table. But it is deterministic, as long as your search condition references the column used as the partition key. See https://dev.mysql.com/doc/refman/8.0/en/partitioning-pruning.html to read more about this.
If you run a query that does not reference the partition key column, then it cannot make this inference. Say you partitioned by created_at but you ran this query:
SELECT * FROM table WHERE user_id = 12345;
Rows for that user_id could occur in any or even all of the partitions. There's no way the partition engine can guess which partitions contain the matching rows without reading the partitions. So that's what it does — it reads all of the partitions.
But if you somehow knew that you are only interested in the rows in partition p1, that's when you would specify it in your query as you showed.
You are correct that InnoDB Cluster does not do sharding for you. All nodes have a copy of all data. It's meant as a solution for redundancy, not scalability.
NDB Cluster isn't for sharding either. All data is stored in the same cluster, but the cluster may have multiple data nodes. But the purpose of NDB having multiple data nodes is not scalability, it's primarily for high availability (HA). As a secondary benefit, it gives you a way to expand storage by adding more nodes.
But if you're not careful about designing your database tables and queries, you might cause it to run queries slower than if you stored all data on the same physical node.
I've seen it happen before: a MySQL user designed their database to run on a single node, then some salesperson told them that NDB Cluster is faster, so the user moved their database to NDB Cluster without any regard to matching their tables and queries to the distributed architecture. The result was that their queries have to gather data from every storage node, and their performance degrades.
This is characteristic of every distributed database architecture.
Sometimes it's referred to as "cross-shard queries" or "fan-out queries." But the basic principle is that you can get scalability only if your query can get its results by visiting only one (or at least a small subset) of the shards. If it has to "fan-out," then you've lost any scalability benefits.
So it requires you design your tables very carefully, keeping in mind the queries you're going to run against the data.

Related

Improving MySQL performance on RDS by partitioning

I am trying to improve a performance of some large tables (can be millions of records) in a MySQL 8.0.20 DB on RDS.
Scaling up DB instance and IOPS is not the way to go, as it is very expensive (the DB is live 24/7).
Proper indexes (including composite ones) do already exist to improve the query performance.
The DB is mostly read-heavy, with occasional massive writes - when these writes happen, reads can be just as massive at the same time.
I thought about doing partitioning. Since MySQL doesn't support vertical partitioning, I considered doing horizontal partitioning - which should work very well for these large tables, as they contain activity records from dozens/hundreds of accounts, and storing each account's records in a separate partition makes a lot of sense to me.
But these tables do contain some constraints with foreign keys, which rules out using MySQL's horizontal partitioning : Restrictions and Limitations on Partitioning
Foreign keys not supported for partitioned InnoDB tables. Partitioned tables using the InnoDB storage engine do not support foreign keys. More specifically, this means that the following two statements are true:
No definition of an InnoDB table employing user-defined partitioning may contain foreign key references; no InnoDB table whose definition contains foreign key references may be partitioned.
No InnoDB table definition may contain a foreign key reference to a user-partitioned table; no InnoDB table with user-defined partitioning may contain columns referenced by foreign keys.
What are my options, other than doing "sharding" by using separate tables to store activity records on a per account basis? That would require a big code change to accommodate such tables. Hopefully there is a better way, that would only require changes in MySQL, and not the application code. If the code needs to be changed - the less the better :)

storing each account's records in a separate partition makes a lot of sense to me
Instead, have the PRIMARY KEY start with acct_id. This provides performance at least as good as PARTITION BY acct_id, saves disk space, and "clusters" an account's data together for "locality of reference".
The DB is mostly read-heavy
Replicas allows 'infinite' scaling of reads. But if you are not overloading the single machine now, there may be no need for this.
with occasional massive writes
Let's discuss techniques to help with that. Please explain what those writes entail -- hourly/daily/sporadic? replace random rows / whole table / etc? keyed off what? Etc.
Proper indexes (including composite ones) do already exist to improve the query performance.
Use the slowlog (with long_query_time = 1 or lower) to verify. Use pt-query-digest to find the top one or two queries. Show them to us -- we can help you "think out of the box".
read-heavy
Is the working set size less than innodb_buffer_pool_size? That is, are you CPU-bound and not I/O-bound?
More on PARTITION
PRIMARY KEY(acct_id, ..some other columns..) orders the data primarily on acct_id and makes this efficient: WHERE acct_id=123 AND ....
PARTITION BY .. (acct_id) -- A PARTITION is implemented as a separate "table". "Partition pruning" is the act of deciding which partition(s) are needed for the query. So WHERE acct_id=123 AND ... will first do that pruning, then look for the row(s) in that "table" to handle the AND .... Hopefully, there is a good index (perhaps the PRIMARY KEY) to handle that part of the filtering.
The pruning is sort of takes the place of one level of BTree. It is hard to predict which will be slower or faster.
Note that when partitioning by, say, acct_id, there is usually not efficient to start the index with that column. (However, it would need to be later in the PK.)
Big Deletes
There are several ways to do a "big delete" while minimizing the impact on the system. Partitioning by date is optimal but does not sound viable for your type of data. Check out the others listed here: http://mysql.rjweb.org/doc.php/deletebig
Since you say that the deletion is usually less than 15%, the "copy over what needs to be kept" technique is not applicable either.

Before sharding or partitioning, first analyze your queries to make sure they are as optimized as you can make them. This usually means designing indexes specifically to support the queries you run. You might like my presentation How to Design Indexes, Really (video).
Partitioning isn't as much a solution as people think. It has many restrictions, including the foreign key issue you found. Besides that, it only improves queries that can take advantage of partition pruning.
Also, I've done a lot of benchmarking of Amazon RDS for my current job and also a previous job. RDS is slow. It's really slow. It uses remote EBS storage, so it's bound to incur overhead for every read from storage or write to storage. RDS is just not suitable for any application that needs high performance.
Amazon Aurora is significantly better on latency and throughput. But it's also very expensive. The more you use it, the more you use I/O requests, and they charge extra for that. For a busy app, you end up spending as much as you did for RDS with high provisioned IOPS.
The only way I found to get high performance in the cloud is to forget about managed databases like RDS and Aurora, and instead install and run your own instance of MySQL on an ec2 instance with locally-attached NVMe storage. This means the i3 family of ec2 instances. But local storage is ephemeral instance storage, so if the instance restarts, you lose your data. So you must add one or more replicas and have a failover plan.
If you need an OLTP database in the cloud, and you also need top-tier performance, you either have to spend $$$ for a managed database, or else you need to hire full-time DevOps and DBA staff to run it.
Sorry to give you the bad news, but the TANSTAAFL adage remains true.

Database servers, partitions and instances

In MySQL (and PostgreSQL), what exactly constitutes a DB instance and a DB partition?
For example, do different DB partitions need to necessarily live on different database instances? Or can a single DB instance manage multiple partitions? If the latter, what's the point of calling it a "partition"? Would the DB have any knowledge of it in this case?
Here's a quote from a document describing a system design from an online course:
How can we plan for the future growth of our system?
We can have a large number of logical partitions to accommodate future data growth, such that in the beginning, multiple logical partitions reside on a single physical database server. Since each database server can have multiple database instances on it, we can have separate databases for each logical partition on any server. So whenever we feel that a particular database server has a lot of data, we can migrate some logical partitions from it to another server. We can maintain a config file (or a separate database) that can map our logical partitions to database servers; this will enable us to move partitions around easily. Whenever we want to move a partition, we only have to update the config file to announce the change.

These terms are confusing, misused, and inconsistently defined.
For MySQL:
A Database has multiple definitions:
A "schema" (as used by other vendors/standards). This is a collection of tables. There are one or more "databases in an instance.
The instance. You should use "server" or "database server" to be clearer.
The data. "Dataset" might be a better term.
An instance refers to a copy of mysqld running on some machine somewhere.
You can have multiple instances on a single piece of hardware. (Rare)
You can have multiple instances on a single piece of hardware, with the instances in different VMs or Dockers. (handy for testing)
Usually "instance" refers to one server with one copy of MySQL on it. (Typical for larger-scale situations)
A PARTITION is a specific way to lay out a table (in a database).
It is seen in CREATE TABLE (...) PARTITION BY ....
It is a "horizontal" split of the data, often by date, but could be by some other 'column'.
It have no direct impact on performance, making it rarely useful.
Sharding is not implemented in MySQL, but can be done on top of MySQL.
It is also a "horizontal" split of the data, but in this case across multiple "instances".
The use case is, for example, social media where there are millions of "users" that are mostly handled by themselves. That is, most of the queries focus on a single slice of the data, hence it is practical to a bunch of users on one server and do all those queries there.
It can be called "horizontal partitioning" but should not be confused with PARTITIONs of a table.
Vertical partitioning is where some columns are pulled out of a table in put into a parallel table.
Both tables would (normally) have the same PRIMARY KEY, thereby facilitating JOINs.
Vertical partitioning would (normally) be done only in a single "instance".
The purposes include splitting off big text/blog columns; splitting off optional columns (and use LEFT JOIN to get NULLs).
Vertical partitioning was somewhat useful in MyISAM, but rarely useful in InnoDB, since that engine automatically does such.
Replication and Clustering
Multiple instances contain the same data.
Used for "High Availability" (HA).
Used for scaling out reads.
Orthogonally to partitioning or sharding.
Does not make sense to have the instances on the same server (except for testing/experimenting/staging/etc).

Partitions, in terms of MySQL and PostgreSQL feature set, are physical segmentations of data. They exist within a single database instance, and are used to reduce the scope of data you're interacting with at a particular time, to cope with high data volume situations.
The document you're quoting from is speaking of a more abstract concept of a data partition at the system design level.

MySQL sharding and partition in distributed system

I read this from mySQL:
"Unlike other sharded databases, users do not lose the ability to
perform JOIN operations, sacrifice ACID-guarantees or referential
integrity (Foreign Keys) when performing queries and transactions
across shards."
For my understanding.
When you choosing between SQL vs NoSQL.
You will choose NoSQL for easy horizontal scale(sharding and partition) for example you have a lot of data that can not hold in a single database, but scarify Transaction level ACID and Database level joins.
You will choose SQL for ACID guarantee and database joins. But, scarify the easy horizontal scale availability. (You can add one more layer on top of mySQL to handle partition and sharing yourself, but still your will loose ACID and joins if you do that)
But, the above statement declare mySQL as a "perfect" database that handles both scalability while keeping the benefit of SQL database. Did I miss anything here or it's just advertisement?
Also, I don't find any information about how mySQL's sharding architecture looks like?

As already replied the excerpt is about MySQL Cluster (NDB).
MySQL Cluster stores the data in a set of NDB data nodes that
can be accessed from any MySQL Server connected to the
NDB Cluster.
NDB uses transactions to update data and follows the ACID
principle with some special optimisations aroud the D. So
we provide Network Durable, meaning that the transaction is
committed on all live replicas in memory before commit is
sent to the application. It will be consistently durable also
on durable media on all live replicas within about 1 second.
The data nodes are grouped into node group (more or less
synonym to shard). All nodes in one node group contains
all data in that node group. As long as one node in each
node group is alive the cluster is alive.
Transactions can span all node groups (shards). It is possible
to perform join operations that span all node groups (shards).
The join operations are executed by the MySQL Server, but
many joins are pushed down to the NDB data node such that they
are automatically parallelised.
There are a number of base access methods:
1) Primary key access
2) Unique key access (== 2 primary key accesses)
3) Partition pruned scan access (Partition Key is provided in condition)
(this can be both an ordered index scan or full scan). This will only
scan one partition of the table.
4) Ordered index scan
This scan will scan all partitions in parallel using an ordered index
5) Full table scan
This scan will scan will scan all partitions in the table and check each row
All of those access types can have conditions pushed down that are evaluated in the data nodes while accessing the data.
So with MySQL Cluster you get SQL and ACID in a sharded system.
If it is appropriate for your needs depends as usual on your
use case.

The quote you excerpt is from the marketing copy for MySQL NDB Cluster, which is not the same as plain MySQL.
MySQL NDB Cluster is a distributed database built primarily for high availability by making every component redundant. The storage is distributed, and you can have multiple mysqld instances that apply SQL operations to the data on many storage nodes.
But there are disadvantages too. NDB Cluster is more efficient when you do queries for individual rows by primary key (sounds a bit like a distributed key-value store like Cassandra, right?).

MyISAM vs InnoDB for BI / batch query performance (ie, _NOT_ OLTP)

Sure, for a transactional database InnoDB is a slam dunk. MyISAM doesn't support transactions or row-level locking.
But what if I want to do big nasty batch queries that touch hundreds of millions of rows?
Are there areas where MyISAM has relative advantage over InnoDB??
eg, one (minor) one that I know of ... "select count(*) from my_table;" MyISAM knows the answer to this instantly whereas InnoDB may take a minute or more to make up its mind.
--- Dave

MyISAM scales better with very large datasets. InnoDB outperforms MyISAM in many situations until it can't keep the indexes in memory, then performance drop drastically.
MyISAM also supports MERGE tables, which is sort of a "poor man's" sharding. You can add/remove very large sets of data instantaneously. For example, if you have 1 table per business quarter, you can create a merge table of the last 4 quarters, or a specific year, or any range you want. Rather than exporting, deleting and importing to shift data around, you can just redeclare the underlying MERGE table contents. No code change required since the name of the table doesn't change.
MyISAM is also better suited for logging, when you are only adding to a table. Like MERGE tables, you can easily swap out (rotate "logs") a table and/or copy it.
You can copy the DB files associated with a MyISAM table to another computer and just put them in the MySQL data directory and MySQL will automatically add them to the available tables. You can't do that with InnoDB, you need to export/import.
These are all specific cases, but I've taken advantage of each one a number of times.
Of course, with replication, you could use both. A table can be InnoDB on the master and MyISAM on the slave. The structure has to be the same, not the table type. Then you can get the best of both. The BLACKHOLE table type works this way.

Here's a great article comparing various performance points http://www.mysqlperformanceblog.com/2007/01/08/innodb-vs-myisam-vs-falcon-benchmarks-part-1/ - you'll have to evaluate this from quite a few angles, including how you intend to write your queries and what your schema looks like. It's simply not a black and white question.

According to this article, as of v5.6, InnoDB has been developed to the point where it is better in all scenarios. The author is probably a bit biased, but it clearly outlines which tech is seen as the future direction of the platform.

MySQL sharding approaches?

What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?

The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.

Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.

Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/

Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.

As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.

Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).

Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)

You can consider this middleware
shardingsphere

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008