I've read much of the MySQL Cluster docs and some tutorials yet I still have some things not clear, and the major of them right now are:
When a data node restarts (crash and goes up again) will it's data still be available? Updates/Additions will work as usual? Will it "sync"?
Does a cluster work faster than standalone? In particular, I update rows many times, but one at a time, meaning network latency might have impact on the performance. Is there any pattern I can follow to make things faster, such as adding more SQL nodes or adding mode data nodes?
Regarding question #2, an update of a row is in the following syntax:
UPDATE db_accounts.tbl_items SET items=items+%lld WHERE id_account=%u
"id_account" is an index (unique).
MySQL Cluster is an in-memory database (although some columns can be stored on disk, indexed columns cannot). If the plug gets pulled, your data goes away. The recovery process for a node re-joining the cluster is that it pulls the data it lost from a surviving node (requiring good fast links between nodes) and then carefully applies replication events until it catches up and can actively participate. If a recent backup is available, it can be rebuilt from that instead of another node, but the principle is the same: the node has to be repopulated with data from scratch.
MySQL Cluster is basically a distributed hash table. The NDB node that holds a particular row of data is determined by a hash algorithm applied to the primary key. Performance increases by adding nodes, assuming that your data spreads nicely across the nodes. Performance can be badly affected if queries have to touch multiple nodes - ie. complex joins - but is lightning fast for retrieving a specific row given its primary key.
Obviously, given that the nodes are distributed, a slow or congested network will badly affect performance.
Even if your MySQL Cluster table is in-memory, by default any writes are asynchronously checkpointed to disk (can turn this off on a per-table basis).
If the entire Cluster failed (power to the data center lost) then when you bring it back up the data will be retrieved from those disk checkpoints. Downside is that as they were created asynchronously you may be missing a handful of updates.
If a single data node fails then as well as recovering from its local disk copy, it catches up by applying the latest updates from its peer data node.
To add further high availability of your data you can use MySQL asynchronous replication to a second site (on the other side of the world if required).
Related
There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.
Very soon I will be building a database structure that will contain 2 million rows. Generally there are no more than 200 rows queried per minute and of those 200 it'll be 10-20 of those rows that are being queried.
Given the size of the table, I'd like to "store" the queried row somewhere so that any other end users querying this row will be able to get the row data "quicker". I then want this row to be accessed via this for a while and then put back into the main table once it's no longer in use. I believe this will make access quicker and more efficient.
Using the below schema, I'll provide an example. In this case row 1 has been accessed from the application layer. The application layer queries the "accessed" table to see if the row is there. If it is, it uses this and updates the "accessed" table with any changed data. If it isn't, it is queried from the main large table and dropped into the "accessed" table until the cron runs (say 10 minutes later) when all "accessed" data is copied into the main table and deleted from the accessed table.
http://sqlfiddle.com/#!2/d76f6/2
I'm trying to work out the following:
1) Will this show an increase in efficiency (I would imagine each query against "accessed" instead of the main will be significantly faster)?
2) What technology should be used for the "accessed" data storage? It's likely the main table will be stored in MariaDB/MySQL, however I'm happy to run it in flat files, sqlite, a different instance or keep it within the same instance... I'm open to suggestions that will make this more efficient, and in theory there's no reason the application layer couldn't act as an intermediary between any technologies
Premature optimization. Overcomplex design to start with. What you want to implement is a most frequently accessed cache system. However, the duty of a DMBS system is indeed to do these kind of system optimizations for you. There are already caches at disk level, file system level, and database level. What you are saying is that, even before having the system in place, you already know it is not going to perform as expected.
Maybe you know more than you state in your question, but on the face of it, optimizations should be done after, with suitable profiling.
There are a lot of ways to cache data.
On mysql you can use memory tables. Memory tables are much more faster than innodb-myisam tables
You can use memory based key value storage systems like redis, memcached
On application layer you can cache your data to filesystem
According to the "Guide to Scaling Web Databases with MySQL Cluster", MySQL Cluster 7.3 can acchieve 99,999% availability while using synchronous update replication.
This would be a antithesis to the CAP Theorem since it states that perfect availability (99,999% can be seen as this, no?) and consistency is not acchievable in distributed systems.
How would the cluster react for an update, if the datanode which is responsible for the replica, is not reachable? For a synchronous update replication it must block, which would affect availability.
The Guide states:
The data within a data node is synchronously replicated to all nodes
within the Node Group. If a data node fails, then there is always at
least one other data node storing the same information.
In the event of a data node failure, the MySQL Server or application
node can use any other data node in the node group to execute
transactions. The application simply retries the transaction and the
remaining data nodes will successfully satisfy the request.
But how can this work if a Node Group consists of two Nodes and one crashes (example here)? There would be no Node to replicate a Update to what, as far as I understand, would make the update fail while using synchronous update replication?! Is the Replication just suspended for the time there does not exist a Node to write a replica to?
On master-master replication if the connection among the hosts are down, then if you try to alter data in any database of any host then certainly to achieve this kind availability the consistency is getting broken. Because now the hosts are not synched and so the data is not consistent. Please look at the below cases:
Case 1: Getting A and C but not P
For example if I don’t replicate a database then the whole database is inside a single host. So here we are getting Consistency and Availability but not Partition tolerance.
Case 2: Getting C and P but not A
For example if I replicate a database(master-master) and keep each one in two hosts. Part P1 is in host H1 and part P2 is in host H2. Now to get partition tolerance I can cut the connection of H1 and H2. Now to get the consistency I shall not allow anyone to change any of P1 and P2. And eventually we are losing Availability.
Case 3: Getting A and P but not C
For example if I replicate a database(master-master) and keep each one in two hosts. Part P1 is in host H1 and part P2 is in host H2. Now to get partition tolerance I can cut the connection of H1 and H2. Now to get the availability I shall allow anyone to change any of P1 and P2. And eventually we are losing Consistency.
In your example question, the problem does not include partition. Partition means half of the data would stay in one node and another half in the other node (it does not need to be a 50% half, but the data needs to be split into several nodes).
Also in your example question, if one of the nodes crashes, the other is still working; hence you have availability. And because one of the nodes is a replica of the other, you should have no problems with consistency.
Just because the update fails, it does not mean the data is not consistent. If you try to access the data from the cluster you will have consistent data, because you cannot retrieve the inconsistent data from the dead node.
In other words, you only have inconsistent data if you query the cluster and the data retried is inconsistent.
I have been recently looking at nosql solutions for our quite big upcoming database and found that cassandra is good but there are very less resources available online about new releases of cassandra and most of the blogs and articles are related to 0.6 version while now it has also implemented support for hadoop and hive. While on the other hand mysql cluster version is also specifically made to run on horizontal scaled setup using commodity servers.
As we are used to relational model for years and moving to cassandra will need decompiling of brain while the product is still not very mature and community is not also that big to respond quickly to any particular problem I have checked datastax(on of the professional support providers) website and their forums are pretty much dead.
So, how to compare mysql cluster vs cassandra while putting relational and non-relational comparison put aside?
Though cassandra is schema less but still it provies pretty much tabular features like super colum and sub column too so record can be searched from multiple column values.
I have also tried my best to find out how cassandra physically stores updated queries like for a row when a sub column is edited and added quite a big chunk of data then how it physically stores that record and how it accesses that record fast? Because in mysql columns have fixed length allocated so its not a big issue.
Here are some areas where I suspect Cassandra has an advantage:
Excellent support for larger-than-memory data sets
Replication: Cassandra supports arbitrary numbers of fully-distributed replicas instead of just partitioned replicas (so, you don't have to have a number of nodes divisible by your replica count in Cassandra, and there are no corner cases to deal with around primary failover), best-in-class support for multiple datacenters, support for synchronous replication as well as asynchronous (important if you're concerned about full durability), and robust self-healing (hinted handoff, read repair, anti-entropy) to make sure you never have to blow away a backup replica and rebuild it from scratch
No locking during ALTER TABLE, index creation, etc
Substantially simpler and less error-prone administration (compare http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-online-add-node.html and http://wiki.apache.org/cassandra/Operations#Bootstrap). In particular, I would call your attention to how many client or other nodes need to be restarted in the Cassandra scenario: none.
To elaborate on the last a little, most people who haven't actually run Cassandra on a multi-node cluster, don't realize just how well Cassandra has been designed for this. For a two minute taste, see Jake Luciani's demo.
To answer your physical storage question, the key feature that makes Cassandra writes fast is that they are append-only. That is, Cassandra only ever writes sequential blocks to disk; it doesn't need to do any slow seeks to random disk locations during a write.
When a column is updated, two things happen: the write is appended to the commit log (for failure recovery), and the in-memory Memtable is updated. Once the Memtable is full, it is flushed out to disk as a new SSTable. Thus, the length of the data doesn't matter, since you're not trying to fit it into a fixed-length disk structure.
SSTables are read-only - you never go back and overwrite an old value on an update, you just write new ones. On a read, Cassandra first looks in the Memtable for the key. If it doesn't find it, Cassandra scans the SSTables in order from newest to oldest and stops when it finds the key. This gives you the most recent value.
There are a few optimizations as well. Each SSTable has an associated Bloom filter for its keys, which is a compact probabilistic index that can produce false positives but never false negatives. If the key is not in the Bloom filter, you can safely skip that SSTable as it is guaranteed not to contain the key, although you may occasionally read an SSTable that you didn't have to.
When you get too many SSTables, they are merged together into a bigger one in a process called compaction. Essentially this does a big merge sort on the SSTables. This lets Cassandra reclaim the space for values that have been overwritten or deleted, and defragment rows that were spread across multiple SSTables.
See http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/ and http://wiki.apache.org/cassandra/MemtableSSTable for more information.
1st a disclaimer - I work as part of the MySQL Cluster product team
If you are looking to Cluster it would be worth starting with the latest 7.2 Development Release which includes new capabilities to significantly enhance JOIN performnce, as well as a new memcached interface, bypassing the SQL layer
http://dev.mysql.com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html
If you are familiar already with MySQL, then the following documentation highlights differences between InnoDB and the current GA 7.1 release:
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndb-innodb-workloads.html
While these don't provide direct comparisons with Cassandra, they do at least provide the latest information on Cluster from which you can base any comparison
Another option these days is relational model in cassandra with playORM and as long as you partition your really really big tables, you can do joins and all the stuff you are familiar with using Scalable SQL like so
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS p(:partId) select p FROM TABLE as p INNER JOIN p.security as s where s.securityType = :type and p.numShares = :shares"),
NOTE: The TABLE is a Trades table and p.security references the Security table. Trades is partitioned so it can have unlimited partitions and Security table is smaller so it is not partitioned but you can do all the Scalabla SQL with joins you want to.
What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.
Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/
Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.
As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.
Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).
Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)
You can consider this middleware
shardingsphere