Couchbase index sizing - couchbase

I have one cluster with one node.
Community Edition 5.1.1 build 5723.
I am trying to figure out disk requirements.
I have around 320 million docs, 250 giga of data (after compaction)
and 4 GSI indexes.
Document key is 60 characters long and index key is 42 characters long and a number (long).
CREATE INDEX index_tx_from ON tx-history(from,blockNumber) WITH { “defer_build”:true }
When I used a single 2 Tera SSD, I ran out of disk space.
I delete the indexes, run only one of them (as in the above example), the size reach 800 gig, but after compaction, its on 100 gig only.
This specific index will hold all the docs, the rest of the indexes will be smaller.
As I see it I need 1 tera ssd for data, 2 tera? or even more (seperate disk) for indexing, but this is only cause of compaction requirements.
My questions are:
How can I calculate the most accurate disk size.
What is the best approach to reduce the size,
doc key can't be shorter
Thanks,
Ady.

Here are the general sizing guidelines for Couchbase:
https://docs.couchbase.com/server/5.5/install/sizing-general.html
Couchbase stores documents in a compressed format using the Snappy library. It's done that for quite some time (since version 3.x, if I'm not mistaken). The new feature in CB 5.5 Enterprise Edition deals with compressing documents in RAM; it's a per-bucket setting.
There are a few performance issues with your particular setup:
Running a large dataset on a single node with a mix of key/value operations and N1QL queries. At the very least, you should consider multiple nodes with ample RAM, CPU, and disk space.
You should consider optimizing your indexes. The index definition in your post will effectively have 320M+ records. Here is an article that can help you get started (check out the partial indexes section): https://blog.couchbase.com/database-indexing-best-practices/ .
There is also a plethora of N1QL information in Couchbase N1QL guides (available for download in PDF): https://blog.couchbase.com/a-guide-to-n1ql-features-in-couchbase-5-5-special-edition/

Related

mysql cluster catching up with cassandra?

I have been recently looking at nosql solutions for our quite big upcoming database and found that cassandra is good but there are very less resources available online about new releases of cassandra and most of the blogs and articles are related to 0.6 version while now it has also implemented support for hadoop and hive. While on the other hand mysql cluster version is also specifically made to run on horizontal scaled setup using commodity servers.
As we are used to relational model for years and moving to cassandra will need decompiling of brain while the product is still not very mature and community is not also that big to respond quickly to any particular problem I have checked datastax(on of the professional support providers) website and their forums are pretty much dead.
So, how to compare mysql cluster vs cassandra while putting relational and non-relational comparison put aside?
Though cassandra is schema less but still it provies pretty much tabular features like super colum and sub column too so record can be searched from multiple column values.
I have also tried my best to find out how cassandra physically stores updated queries like for a row when a sub column is edited and added quite a big chunk of data then how it physically stores that record and how it accesses that record fast? Because in mysql columns have fixed length allocated so its not a big issue.
Here are some areas where I suspect Cassandra has an advantage:
Excellent support for larger-than-memory data sets
Replication: Cassandra supports arbitrary numbers of fully-distributed replicas instead of just partitioned replicas (so, you don't have to have a number of nodes divisible by your replica count in Cassandra, and there are no corner cases to deal with around primary failover), best-in-class support for multiple datacenters, support for synchronous replication as well as asynchronous (important if you're concerned about full durability), and robust self-healing (hinted handoff, read repair, anti-entropy) to make sure you never have to blow away a backup replica and rebuild it from scratch
No locking during ALTER TABLE, index creation, etc
Substantially simpler and less error-prone administration (compare http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-online-add-node.html and http://wiki.apache.org/cassandra/Operations#Bootstrap). In particular, I would call your attention to how many client or other nodes need to be restarted in the Cassandra scenario: none.
To elaborate on the last a little, most people who haven't actually run Cassandra on a multi-node cluster, don't realize just how well Cassandra has been designed for this. For a two minute taste, see Jake Luciani's demo.
To answer your physical storage question, the key feature that makes Cassandra writes fast is that they are append-only. That is, Cassandra only ever writes sequential blocks to disk; it doesn't need to do any slow seeks to random disk locations during a write.
When a column is updated, two things happen: the write is appended to the commit log (for failure recovery), and the in-memory Memtable is updated. Once the Memtable is full, it is flushed out to disk as a new SSTable. Thus, the length of the data doesn't matter, since you're not trying to fit it into a fixed-length disk structure.
SSTables are read-only - you never go back and overwrite an old value on an update, you just write new ones. On a read, Cassandra first looks in the Memtable for the key. If it doesn't find it, Cassandra scans the SSTables in order from newest to oldest and stops when it finds the key. This gives you the most recent value.
There are a few optimizations as well. Each SSTable has an associated Bloom filter for its keys, which is a compact probabilistic index that can produce false positives but never false negatives. If the key is not in the Bloom filter, you can safely skip that SSTable as it is guaranteed not to contain the key, although you may occasionally read an SSTable that you didn't have to.
When you get too many SSTables, they are merged together into a bigger one in a process called compaction. Essentially this does a big merge sort on the SSTables. This lets Cassandra reclaim the space for values that have been overwritten or deleted, and defragment rows that were spread across multiple SSTables.
See http://www.mikeperham.com/2010/03/13/cassandra-internals-writing/ and http://wiki.apache.org/cassandra/MemtableSSTable for more information.
1st a disclaimer - I work as part of the MySQL Cluster product team
If you are looking to Cluster it would be worth starting with the latest 7.2 Development Release which includes new capabilities to significantly enhance JOIN performnce, as well as a new memcached interface, bypassing the SQL layer
http://dev.mysql.com/tech-resources/articles/mysql-cluster-labs-dev-milestone-release.html
If you are familiar already with MySQL, then the following documentation highlights differences between InnoDB and the current GA 7.1 release:
http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndb-innodb-workloads.html
While these don't provide direct comparisons with Cassandra, they do at least provide the latest information on Cluster from which you can base any comparison
Another option these days is relational model in cassandra with playORM and as long as you partition your really really big tables, you can do joins and all the stuff you are familiar with using Scalable SQL like so
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS p(:partId) select p FROM TABLE as p INNER JOIN p.security as s where s.securityType = :type and p.numShares = :shares"),
NOTE: The TABLE is a Trades table and p.security references the Security table. Trades is partitioned so it can have unlimited partitions and Security table is smaller so it is not partitioned but you can do all the Scalabla SQL with joins you want to.

MySQL Cluster is much slower than InnoDB

I have a denormalized table product with about 6 million rows (~ 2GB) mainly for lookups. Fields include price, color, unitprice, weight, ...
I have BTREE indexes on color etc. Query conditions are dynamically generated from the Web, such as
select count(*)
from product
where color = 1 and price > 5 and price < 100 and weight > 30 ... etc
and
select *
from product
where color = 2 and price > 35 and unitprice < 110
order by weight
limit 25;
I used to use InnoDB and tried MEMORY tables, and switched to NDB hoping more concurrent queries can be done faster. I have 2 tables with the same schema, indexes, and data. One is InnoDB while the other is NDB. But the results are very disappointing:for the queries mentioned above, InnoDB is like 50 times faster than NDB. It's like 0.8 seocond vs 40 seconds. For this test I was running only a single select query repeatedbly. Both InnoDB and NDB queries are using the same index on color.
I am using mysql-5.1.47 ndb-7.1.5 on a dual Xeon 5506 (8 cores total), 32GB memory running CentOS 5. I set up 2 NDB Data nodes, one MGM node and one MYSQL node on the same box. For each node I allocated like 9GB memory, and also tried MaxNoOfExecutionThreads=8, LockPagesInMainMemory, LockExecuteThreadToCPU and many other config parameters, but no luck. While NDB is running the query, my peak CPU load was only like 200%, i.e., only 2 out of 8 cores were busy. Most of the time it was like 100%. I was using ndbmtd, and verified in the data node log and the LQH threads were indeed spawned.
I also tried explain, profiling -- it just showing that Sending data was consuming most of the time. I also went thru some Mysql Cluster tuning documents available online, not very helpful in my case.
Anybody can shed some light on this? Is there any better way to tune an NDB database? Appreciate it!
You need to pick the right storage engine for your application.
myISAM -- read frequently / write infrequently. Ideal for data lookups in big tables. Does reasonably well with complex indexes and is quite good for batch reloads.
MEMORY -- good for fast access to relatively small and simple tables.
InnoDB -- good for transaction processing. Also good for a mixed read / write workload.
NDB -- relatively less mature. Good for fault tolerance.
The mySQL server is not inherently multiprocessor software. So adding cores isn't necessarily going to jack up performance. A good host for mySQL is a decent two-core system with plenty of RAM and the fastest disk IO channels and disks you can afford. Do NOT put your mySQL data files on a networked or shared file system, unless you don't care about query performance.
If you're running on Linux issue these two commands (on the machine running the mySQL server) to see whether you're burning all your cpu, or burning all your disk IO:
sar -u 1 10
sar -d 1 10
Your application sounds like a candidate for myISAM. It sounds like you have plenty of hardware. In that case you can build a master server and an automatically replicated slave server But you may be fine with just one server. This will be easier to maintain.
Edit It's eight years latar and this answer is now basically obsolete.

Can I expect a significant performance boost by moving a large key value store from MySQL to a NoSQL DB?

I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.
I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.
Queries are typically for around 100 consecutive values,
eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;
I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.
Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.
Thanks
Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.
To make your own test with it should take less than 1 hour.
One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).
I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load.
at any rate 200ms is way too much for a key/value store.
If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn
if you are looking for a distributed solution you can try any Dynamo clone:
Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular.
keep in mind that looking for strong consistency slows down these systems quite a bit.
I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.

Can mysql handle a dataset of 50gb?

Can mysql handle a dataset of 50gb (only text) efficiently ? If not, what database technologies should I use ?
thanks
Technically, I would say yes. MySQL can handle 50GB of data, efficiently.
If you are looking for a few examples, Facebook moved to Cassandra only after it was storing over 7 Terabytes of inbox data.
Source: Lakshman, Malik: Cassandra - A Decentralized Structured Storage System.
Wikipedia also handles hundreds of Gigabytes of text data in MySQL.
Any backend that uses b-trees (like all the popular ones for MySQL) get dramatically slower when the index doesn't fit in RAM anymore. Depending on your query needs, Cassandra might be a good fit, or Lucandra (Lucene + Cassandra) -- http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

What are the limitations of implementing MySQL NDB Cluster?

I want to implement NDB Cluster for MySQL Cluster 6. I want to do it for very huge data structure with minimum 2 million records.
I want to know is if there are any limitations of implementing NDB cluster. For example, RAM size, number of databases, or size of database for NDB cluster.
2 million databases? I asssume you meant "rows".
Anyway, concerning limitations: one of the most important things to keep in mind is that NDB/MySQL Cluster is not a general purpose database. Most notably, join operations, but also subqueries and range opertions (queries like: orders created between now and a week ago), can be considerably slower than what you might expect. This is in part due to the fact that the data is distributed across multiple nodes. Although some improvements have been made, Join performance can still be very disappointing.
On the other hand, if you need to deal with many (preferably small) concurrent transactions (typically single row updates/inserts/delete lookups by primary key) and you mangage to keep all of your data in memory, then it can be a very scalable and performant solution.
You should ask yourself why you want cluster. If you simply want your ordinary database that you have now, except with added 99,999% availability, then you may be disappointed. Certainly MySQL cluster can provide you with great availability and uptime, but the workload of your app may not be very well suited for the thtings cluster is good for. Plus you may be able to use another high availability solution to increase the uptime of your otherwise traditional database.
BTW - here's a list of limitations as per the doc: http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations.html
But whatever you do, try out cluster, see if its good for you. MySQL cluster is not "MySQL + 5 nines". You'll find out when you try.
NDB cluster comes with two type of storage options.
1.In Memory Storage.
2.Disk storage.
NDB introduced as in memory data storage and in version 7.4(MYSQL 5.6) onwards started supporting disk storage.
current version 7.5(MySQL 5.7) supports disk storage and in this case there will be no size constraints as data is going to reside in disk and limit depend on disk storage space available with you.
Disk Storage configurations - https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-disk-data-symlinks.html
In Memory storage in NDB cluster is also quite mature and you can define memory usage in management node config.ini file.
example -
DataMemory=3072M
IndexMemory=384M
in an average table(depend on data stored in columns) total db size should be less then 1GB which can easily be configured.
Note - in my own implementation i faced one performance challenge as performance of NDB degrades with increasing number of rows in table.
Under high load concurrency read will degrade with number of increasing row.
Make sure you don't go for full table scan and provide sufficient where clause predicate.
For proper performance define secondary index properly as per your query pattern.
Defining secondary index will again increase memory consumption so plan your query pattern and memory resources accordingly.