What are the limitations of implementing MySQL NDB Cluster? - mysql

I want to implement NDB Cluster for MySQL Cluster 6. I want to do it for very huge data structure with minimum 2 million records.
I want to know is if there are any limitations of implementing NDB cluster. For example, RAM size, number of databases, or size of database for NDB cluster.

2 million databases? I asssume you meant "rows".
Anyway, concerning limitations: one of the most important things to keep in mind is that NDB/MySQL Cluster is not a general purpose database. Most notably, join operations, but also subqueries and range opertions (queries like: orders created between now and a week ago), can be considerably slower than what you might expect. This is in part due to the fact that the data is distributed across multiple nodes. Although some improvements have been made, Join performance can still be very disappointing.
On the other hand, if you need to deal with many (preferably small) concurrent transactions (typically single row updates/inserts/delete lookups by primary key) and you mangage to keep all of your data in memory, then it can be a very scalable and performant solution.
You should ask yourself why you want cluster. If you simply want your ordinary database that you have now, except with added 99,999% availability, then you may be disappointed. Certainly MySQL cluster can provide you with great availability and uptime, but the workload of your app may not be very well suited for the thtings cluster is good for. Plus you may be able to use another high availability solution to increase the uptime of your otherwise traditional database.
BTW - here's a list of limitations as per the doc: http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations.html
But whatever you do, try out cluster, see if its good for you. MySQL cluster is not "MySQL + 5 nines". You'll find out when you try.

NDB cluster comes with two type of storage options.
1.In Memory Storage.
2.Disk storage.
NDB introduced as in memory data storage and in version 7.4(MYSQL 5.6) onwards started supporting disk storage.
current version 7.5(MySQL 5.7) supports disk storage and in this case there will be no size constraints as data is going to reside in disk and limit depend on disk storage space available with you.
Disk Storage configurations - https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-disk-data-symlinks.html
In Memory storage in NDB cluster is also quite mature and you can define memory usage in management node config.ini file.
example -
DataMemory=3072M
IndexMemory=384M
in an average table(depend on data stored in columns) total db size should be less then 1GB which can easily be configured.
Note - in my own implementation i faced one performance challenge as performance of NDB degrades with increasing number of rows in table.
Under high load concurrency read will degrade with number of increasing row.
Make sure you don't go for full table scan and provide sufficient where clause predicate.
For proper performance define secondary index properly as per your query pattern.
Defining secondary index will again increase memory consumption so plan your query pattern and memory resources accordingly.

Related

Why does my MySQL search take so long?

I have a MySQL table with about 150 million rows. When i perform a search with one clause it takes about 1.5 minutes for it to get the result. Why does it take so long? I am running debian in virtualbox with 2 CPU cores and 4gb of ram. I am using MySQL and apache2.
I am a bit new to this so so don't know what more information to provide.
Searches, or rather queries, in databases like MySQL or any other Relational Database Management System (RDBMS) are subject to a number of factors for performance including:
Structure of the WHERE clause and Indexing to support it
Contention for system resources such as Memory and CPU
The amount of data being retrieved and how it is delivered
Some quick wins and strategies for each:
Structure of the WHERE clause and Indexing to support it
Order your WHERE clause in the order that will cut down the results by the biggest margin as you go from left to right. Also, use Indexes and align these Indexes to the order of those columns in the WHERE clause. If you're searching a large database with SELECT * FROM TABLE WHERE SomeID = 5 AND CreatedDate > '10-01-2015' then be sure you have an Index in place with the columns SomeID and CreatedDate in the order that makes the most sense. If SomeID is a column that is highly unique or likely to have results much smaller than CreatedDate > '10-01-2015' then you should create the query in that order and an Index with columns in the same order.
Contention for system resources such as Memory and CPU
Are you using a table that is constantly updated? There are transactional databases (OLTP) and databases meant for analysis (OLAP). If you're hitting a table that is being constantly updated you may be slowing things down for everyone including yourself. Remember you're a citizen in an environment and as such you need to respect the other use cases. This includes knowing a bit about how the system is used, what resources are available and making sure you are mindful of how your queries will affect others.
The amount of data being retrieved and how it is delivered
Even the best query cannot escape the time it takes to get data from one place to another. You can optimize settings of the RDBMS, have incredible bandwidth etc. but many factors including disk IOPS, network bandwidth, et. al. all play into a cost of doing business. Make sure you're using the right protocols to transfer, have good disk IOPS and all the Best Practices around MySQL.
Some final thoughts:
If you're using AWS and hosting your database in the cloud you may
consider using Amazon Aurora which is a MySQL-compatible RDBMS
that is substantially faster than MySQL.

Storage engine for large amounts of constantly inserted data which should be available instantly

Our server (several Java applications on Debian) handles incoming data (GNSS observations) that should be:
immediately (delay <200ms) delivered to other applications,
stored for further use.
Sometimes (several times a day maybe) about million of archived records will be fetched from the database. Record size is about 12 double precision fields + timestamp and some ids. There are no UPDATEs; DELETEs are very rare but massive. Incoming flow is up to hundred records per second. So I had to choose storage engine for this data.
I tried using MySQL (InnoDB). One application inserts, others constantly check last record id and if it is updated, fetch new records. This part works fine. But I've met following issues:
Records are quite large (about 200-240 bytes per record).
Fetching million of archived records is unacceptable slow (tens of minutes or more).
File-based storage will work just fine (since there are no inserts in the middle of DB and selections are mostly like 'WHERE ID=1 AND TIME BETWEEN 2000 AND 3000', but there are other problems:
Looking for new data might be not so easy.
Other data like logs and configs are stored in same database and I prefer to have one database for everything.
Can you advice some suitable database engine (SQL preferred, but not necessary)? Maybe it is possible to fine-tune MySQL to reduce record size and fetch time for continious strips of data?
MongoDB is not acceptable since DB size is limited on 32-bit machines. Any engine that does not provide quick access for recently inserted data is not acceptable too.
I'd recommend using TokuDB storage engine for MySQL. It's free for up to 50GB of user data, and it's pricing model isn't terrible, making it a great choice for storing large amounts of data.
It's got higher insert speed compared to InnoDB and MyISAM and scales much better as the dataset grows (InnoDB tends to deteriorate once working dataset doesn't fit the RAM making its performance dependant on the I/O of the HDD subsystem).
It's also ACID compliant and supports multiple clustered indexes (which would be a great choice for massive DELETEs you're planning to do). Also, hot schema changes are supported (ALTER TABLE doesn't lock the tables, and changes are quick on huge tables - I'm talking gigabyte-sized tables being altered in mere seconds).
From my personal use, I experienced about 5 - 10 times less disk usage due to TokuDB's compression, and it's much, much faster than MyISAM or InnoDB.
Even though it sounds like I'm trying to advertise this product - I'm not, it's just simply amazing since you can use monolithic data-store without expensive scaling plans like partitioning across nodes to scale the writes.
There really is no getting around how long it takes to load millions of records from disk. Your 32-bit requirement means you are limited in how much RAM you can use for memory based data structures. But, if you want to use MySQL, you may be able to get good performance using multiple table types.
If you need really fast non-blocking inserts. You can use the black hole table type and replication. The server where the inserts occur has a black hole table type that replicates to another server where the table is Innodb or MyISAM.
Since you don't do UPDATEs, I think MyISAM would be better than Innodb in this scenario. You can use the MERGE table type for MyISAM (not available for Innodb). Not sure what your data set is like, but you could have 1 table per day (hour, week?), your MERGE table would then be a superset of those tables. Assuming you want to delete old data by day, just redeclare the MERGE table to not include the old tables. This action is instantaneous. Dropping old tables is also extremely fast.
To check for new data, you can look at "todays" table directly rather than going through the MERGE table.

MySQL Memory Table, Memcached or anything else?

A dataset growing currently >1 million which requires constant lookup / updation of user specific data.
looking for fastest and scalable option with high TPS.
Memcache/memcacheddb vs mysql memory tables are a big confusion for implementation and scaling options.
Can any one provide proper scaling / tps and performance information for which one to land on?
Does the integrity of this data matter? If it does, you can immediately rule out memcached and MySQL memory tables, as neither one is persisted to durable storage. memcachedb is at least persisted, but it doesn't make the same sorts of reliability guarantees that a normal (R)DBMS would.
If you've got a large dataset, you don't scale it by throwing hardware at it. Since you didn't say what's your growth rate, it's difficult to suggest anything.
If you need to scale writes - you partition your table.
If you need to scale reads - you create master > multiple slaves replication cluster.
Also, there's an engine called TokuDB available for MySQL - more info at www.tokutek.com. It's extremely fast for certain things (updates, hot index addition and similar) but not that excellent when it comes to mass updating. It's worth checking out.

Can I expect a significant performance boost by moving a large key value store from MySQL to a NoSQL DB?

I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.
I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.
Queries are typically for around 100 consecutive values,
eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;
I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.
Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.
Thanks
Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.
To make your own test with it should take less than 1 hour.
One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).
I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load.
at any rate 200ms is way too much for a key/value store.
If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn
if you are looking for a distributed solution you can try any Dynamo clone:
Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular.
keep in mind that looking for strong consistency slows down these systems quite a bit.
I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.

Will a MySQL table with 20,000,000 records be fast with concurrent access?

I ran a lookup test against an indexed MySQL table containing 20,000,000 records, and according to my results, it takes 0.004 seconds to retrieve a record given an id--even when joining against another table containing 4,000 records. This was on a 3GHz dual-core machine, with only one user (me) accessing the database. Writes were also fast, as this table took under ten minutes to create all 20,000,000 records.
Assuming my test was accurate, can I expect performance to be as as snappy on a production server, with, say, 200 users concurrently reading from and writing to this table?
I assume InnoDB would be best?
That depends on the storage engine you're going to use and what's the read/write ratio.
InnoDB will be better if there are lot of writes. If it's reads with very occasional write, MyISAM might be faster. MyISAM uses table level locking, so it locks up whole table whenever you need to update. InnoDB uses row level locking, so you can have concurrent updates on different rows.
InnoDB is definitely safer, so I'd stick with it anyhow.
BTW. remember that right now RAM is very cheap, so buy a lot.
Depends on any number of factors:
Server hardware (Especially RAM)
Server configuration
Data size
Number of indexes and index size
Storage engine
Writer/reader ratio
I wouldn't expect it to scale that well. More importantly, this kind of thing is to important to speculate about. Benchmark it and see for yourself.
Regarding storage engine, I wouldn't dare to use anything but InnoDB for a table of that size that is both read and written to. If you run any write query that isn't a primitive insert or single row update you'll end up locking the table using MyISAM, which yields terrible performance as a result.
There's no reason that MySql couldn't handle that kind of load without any significant issues. There are a number of other variables involved though (otherwise, it's a 'how long is a piece of string' question). Personally, I've had a number of tables in various databases that are well beyond that range.
How large is each record (on average)
How much RAM does the database server have - and how much is allocated to the various configurations of Mysql/InnoDB.
A default configuration may only allow for a default 8MB buffer between disk and client (which might work fine for a single user) - but trying to fit a 6GB+ database through that is doomed to failure. That problem was real btw - and was causing several crashes a day of a database/website till I was brought in to trouble-shoot it.
If you are likely to do a great deal more with that database, I'd recommend getting someone with a little more experience, or at least oing what you can to be able to give it some optimisations. Reading 'High Performance MySQL, 2nd Edition' is a good start, as is looking at some tools like Maatkit.
As long as your schema design and DAL are constructed well enough, you understand query optimization inside out, can adjust all the server configuration settings at a professional level, and have "enough" hardware properly configured, yes (except for sufficiently pathological cases).
Same answer both engines.
You should probably perform a load test to verify, but as long as the index was created properly (meaning indexes are optimized to your query statements), the SELECT queries should perform at an acceptable speed (the INSERTS and/or UPDATES may be more of a speed issue though depending on how many indexes you have, and how large the indexes get).