How Couchbase Server support high concurrency and high throughput - couchbase

I am curious to know, how couchbase server support high concurrency and high throughput.

It's a very broad question to answer but I'll try to cover some of the key reasons for why Couchbase is fast and scalable.
Writes in Couchbase are by default asynchronous,replication and persistence happen in the background, and the smart clients (SKD's) are notified of success or failure. So basically any new documents or mutations to documents are written to ram and then asynchronously flushed to disk in the background and replicated to other nodes. This means that there is no waiting time or contention on IO/disk speed. (This means it is possible to write to ram and then the node to fall over before the request has been persisted to disk or replicated to a secondary/third node). It is possible to make writes synchronously but it will slow down throughput considerably.
When dealing with ram, writes and read are VERY fast (we've only pushed our cluster to 20k operations a second) but large companies easily hit upwards of 400k operations a second. LinkedIN sustain this ops rate with only 4 nodes ---> http://www.couchbase.com/customer-stories
In traditional database architectures usually the setup would be a master DB (Mysql/Postgres/Oracle) coupled with a slave DB for data redundancy, also writes/reads can be split between the 2 as load gets higher. Couchbase is meant to be used as a distributed system (Couchbase recommend at least 3 nodes in production). Data is automatically sharded between the nodes in a cluster thus spreading the writes/reads across multiple machines. In the case of needing higher throughput, adding a node in Couchbase is as simple as clicking add node and then rebalance cluster, the data will be automatically partitioned across the new cluster map.
So essentially writing/reading from ram with async disk persistence + distributed reads and writes == high throughput
Hope that helps!

#scalabilitysolved already gave a great overview, but if you want a longer (and more detailed) description take a look at the Couchbase_Server_Architecture_Review on couchbase.com

Related

Resize Amazon RDS storage

We are currently working with a 200 GB database and we are running out of space, so we would like to increment the allocated storage.
We are using General Purpose (SSD) and a MySQL 5.5.53 database (without Multi-AZ deployment).
If I go to the Amazon RDS menu and change the Allocated storage to a bit more (from 200 to 500) I get the following "warnings":
Deplete the initial General Purpose (SSD) I/O credits, leading to longer conversion times: What does this mean?
Impact instance performance until operation completes: And this is the most important question for me. Can I resize the instance with 0 downtime? I mean, I dont care if the queries are a bit slower if they work while it's resizing, but what I dont want to to is to stop all my production websites, resize the instance, and open them again (aka have downtime).
Thanks in advance.
You can expect degraded performance but you should really test the impact in a dev environment before running this on production so you're not caught off guard. If you perform this operation during off-peak hours you should be fine though.
To answer your questions:
RDS instances can burst with something called I/O credits. Burst means its performance can go above the baseline performance to meet spikes in demand. It shouldn't be a big deal if you burn through them unless your instance relies on them (you can determine this from the rds instance metrics). Have a read through I/O Credits and Burst Performance.
Changing the disk size will not result in a complete rds instance outage, just performance degradation so it's better to do it during off-peak hours to minimise the impact as much as possible.
First according to RDS FAQs, there should be no downtime at all as long as you are only increasing storage size but not upgrading instance tier.
Q: Will my DB instance remain available during scaling?
The storage capacity allocated to your DB Instance can be increased
while maintaining DB Instance availability.
Second, according to RDS documentation:
Baseline I/O performance for General Purpose SSD storage is 3 IOPS for
each GiB, which means that larger volumes have better performance....
Volumes below 1 TiB in size also have ability to burst to 3,000 IOPS
for extended periods of time (burst is not relevant for volumes above
1 TiB). Instance I/O credit balance determines burst performance.
I can not say for certain why but I guess when RDS increase the disk size, it may defragment the data or rearrange data blocks, which causes heavy I/O. If you server is under heavy usage during the resizing, it may fully consume the I/O credits and result in less I/O and longer conversion times. However given that you started with 200GB I suppose it should be fine.
Finally I would suggest you to use multi-az deployemnt if you are so worried about downtime or performance impact. During maintenance windows or snapshots, there will be a brief I/O suspension for a few seconds, which can be avoided with standby or read replicas.
The technical answer is that AWS supports no downtime when scaling storage.
However, in the real world you need to factor how busy your current database is and how the "slowdown" will affect users. Consider the possibility that connections might timeout or the site may appear slower than usual for the duration of the scaling event.
In my experience, RDS storage resizing has been smooth without problems. However, we pick the best time of day (least busy) to implement this. We also go thru a backup procedure. We snapshot and bring up a standby server to switch over to manually just in case.

Getting a very bad performance with galera as compared to a standalone mariaDB server

I am getting an unacceptable low performance with the galera setup i created. In my setup there are 2 nodes in active-active and i am doing read/writes on both the nodes in a round robin fashion using HA-proxy load balancer.
I was easily able to get over 10000 TPS on my application with the single mariadb server with the below configuration:
36 vpcu, 60 GB RAM, SSD, 10Gig dedicated pipe
With galera i am hardly getting 3500 TPS although i am using 2 nodes(36vcpu, 60 GB RAM) of DB load balanced by ha-proxy. For information, ha-proxy is hosted as a standalone node on a different server. I have removed ha-proxy as of now but there is no improvement in performance.
Can someone please suggest some tuning parameters in my.cnf i should consider to tune this severely under-performing setup.
I am using the below my.cnf file:
I was easily able to get over 10000 TPS on my application with the
single mariadb server with the below configuration: 36 vpcu, 60 GB
RAM, SSD, 10Gig dedicated pipe
With galera i am hardly getting 3500 TPS although i am using 2
nodes(36vcpu, 60 GB RAM) of DB load balanced by ha-proxy.
Clusters based on Galera are not designed to scale writes as I see you intend to do; In fact, as Rick mentioned above: sending writes to multiple nodes for the same tables will end up causing certification conflicts that will reflect as deadlocks for your application, adding huge overhead.
I am getting an unacceptable low performance with the galera setup i
created. In my setup there are 2 nodes in active-active and i am doing
read/writes on both the nodes in a round robin fashion using HA-proxy
load balancer.
Please send all writes to a single node and see if that improves performane; There will always be some overhead due to the nature of virtually-synchronous replication that Galera uses, which literally adds network overhead to each write you perform (albeit true clock-based parallel replication will offset this impact quite a bit, still you are bound to see slightly lower throughput volumes).
Also make sure to keep your transactions short and COMMIT as soon as you are done with an atomic unit of work, since replication-certification process is single-threaded and will stall writes on the other nodes (if you see that your writer node shows transactions wsrep pre-commit stage that means the other nodes are doing certification for a large transaction or that the node is suffering performance problems of some sort -swap, full disk, abusively large reads, etc.
Hope that helps, and let us know how it goes when you move to single node.
Turn off the QC:
query_cache_size = 0 -- not 22 bytes
query_cache_type = OFF -- QC is incompatible with Galera
Increase innodb_io_capacity
How far apart (ping time) are the two nodes?
Suggest you pretend that it is Master-Slave. That is, have HAProxy send all traffic to one node, leaving the other as a hot backup. Certain things can run faster in this mode; I don't know about your app.

How to make program NUMA ready?

My program uses shared memory as a data storage. This data must be available to any application running, and fetching this data must be fast. But some applications can run on different NUMA nodes, and data access for them is realy expensive. Is data duplication for every NUMA node is the only way for doing this?
There are two primary sources of slowdown that can be attributed to NUMA. The first is the increased latency of remote access which can vary depending on the platform. On the platforms that I work with, there is about a 30% hit in latency.
The other source of performance loss can come from contention over the communication links and controllers between NUMA nodes.
The default allocation scheme for Linux is to allocate the data on the node where it was created. If majority of the data in the application is initialized by a single thread then it'll generate a lot of cross NUMA domain traffic and contention for that one memory node.
If your data is read only, then replication is a good solution.
Otherwise, interleaving the data allocation across all your nodes will distribute the requests across all the nodes and will help relieve congestion.
To interleave the data, you can use set_mempolicy() from numaif.h if you are using Linux.

Coherence Topology Suggestion

Data to be cached:
100 Gb data
Objects of size 500-5000 bytes
1000 objects updated/inserted in average per minute (peak 5000)
Need suggestion for Coherence topology in production and test (distributed with backup)
number of servers
nodes per server
heap size per node
Questions
How much free available memory is needed per node compared to memory used by cached data (assume 100% usage is not possible)
How much overhead will 1-2 additional indexes per cache element generate?
We do not know how many read operations will be done, the solution will be used by clients where low response times are critical (more than data consistency) and depend on each use-case. The cache will be updated from DB by polling at a fixed frequency and populating the cache (since cache is data master, not the system using the cache).
The rule of thumb for sizing a JVM for Coherence is that the data is 1/3 the heap assuming 1 backup: 1/3 for cache data, 1/3 for backup, and 1/3 for index and overhead.
The biggest difficulty in sizing is that there are no good ways to estimate index sizes. You have to try with real-world data and measure.
A rule of thumb for JDK 1.6 JVMs is start with 4GB heaps, so you would need 75 cache server nodes. Some people have been successful with much larger heaps (16GB), so it is worth experimenting. With large heaps (e.g, 16GB) you should not need as much as 1/3 for overhead and can hold more than 1/3 for data. With heaps greater than 16GB, garbage collector tuning becomes critical.
For maximum performance, you should have 1 core per node.
The number of server machines depends on practical limits of manageability, capacity (cores and memory), and failure. For example, even if you have a server that can handle 32 nodes, what happens to your cluster when a machine fails? The cluster will be machine safe (backups are not on the same machine) but recovery would be very slow given the massive amount of data to be moved to new backups. On the other hand 75 machines is hard to manage.
I've seen Coherence have latencies of 250 micro seconds (not millis) for a 1K object put, including network hops and backup. So, the number of inserts and updates you are looking for should be achievable. Test with multiple threads inserting/updating and make sure your test client is not the bottleneck.
A few more "rules of thumb":
1) For high availability, three nodes is a good minimum.
2) With Java 7, you can use larger heaps (e.g. 27GB) and the G1 garbage collector.
3) For 100GB of data, using David's guidelines, you will want 300GB total of heap. On servers with 128GB of memory, that could be done with 3 physical servers, each running 4 JVMs with 27GB heap each (~324GB total).
4) Index memory usage varies significantly with data type and arity. It is best to test with a representative data set, both with and without indexes, to see what the memory usage difference is.

Scalability comparison between different DBMSs

By what factor does the performance (read queries/sec) increase when a machine is added to a cluster of machines running either:
a Bigtable-like database
MySQL?
Google's research paper on Bigtable suggests that "near-linear" scaling is achieved can be achieved with Bigtable. This page here featuring MySQL's marketing jargon suggests that MySQL is capable of scaling linearly.
Where is the truth?
Having built and benchmarked several applications using VoltDB I consistently measure between 90% and 95% of additional transactional throughput as each new server is added to the cluster. So if an application is performing 100,000 transaction per second (TPS) on a single server, I measure 190,000 TPS on 2 servers, 280,000 TPS on 3 servers, and so on. At some point we expect the server to server networking to become a bottleneck but our largest cluster (30 servers) is still above 90%.
If you don't do that many writes to the database, MySQL may be a good and easy solution, especially if coupled with memcached in order to increase the read speed.
OTOH if you data is constantly changing, you should probably look somewhere else:
Cassandra
VoltDB
Riak
MongoDB
CouchDB
HBase
These systems have been designed to scale linearly with the number of computers added to the system.
A full list is available here.