Why is seperate Write and Read better? - mysql

I can't understand why separate write and read is better than write and read in one server.
For example, I have a mysql cluster with three machines: node1, node2,node3.
One possible architecture is:
All write requests to node1, but all read requests to node2 and node3.
The second possible architecture is:
All of these three nodes handle both writes and reads.
We can see in architecture one that the write to node1 pressure very huge, so I prefer architecture two.
Also, why does mongodb separates writes to primary node and reads to secondary nodes.

This is an issue of scale for both MySQL and MongoDB. In the simplest application with a small dataset and low traffic volume, having all writes and reads go to one server gives you a simple architecture. In a very high volume read application with low volume writes, a single write node replicating to more than one read nodes gives you the ability to scale your reads just by adding another node. In a high read AND write volume application you might consider sharding (in MySQL you do it yourself or find a tool to help), in mongodb you run mongos that handles sharding for you. Sharding will put records on a specific instance based on some key that you define. The key will determine the instance each record should be stored on. You can imagine that sharding would be more complicated to manage than a single server for read/write access. You would be right, even in a case like mongodb that does the sharding for you once you define a key (or just use the default key).

MySQL Cluster also supports auto-sharding - by default hashing the primary key, but users can feed their own keys in to provide more distribution awareness. Each node in the cluster is a master, and internal load balancing will distribute the loads across nodes
While very high level, the short demo posted here introduces you to the concepts of sharding in MySQL Cluster:
http://www.oracle.com/pls/ebn/swf_viewer.load?p_shows_id=11464419

Related

How efficiently use MySQL for Stock/TimeSeries related data?

I use Python and MySQL to ingest data via API and generate signals and order execution. Currently, things are functional yet coupled, that is, the single script is fetching data, storing it in MySQL, generating signals, and then executing orders. By tightly coupled does not mean all logic is in the same file, there are separate functions for different tasks. If somehow the script breaks everything will be halted. The way DB tables are generated is based on the instrument available on the fly after running a filter mechanism. The python code creates a different table of the same schema but with different table names based on the instrument name.
Now I am willing to separate the parts:
Data Ingestion (A Must)
Signal Generation
Order Execution
Reporting
First three I am mainly focusing. My concern is that if separate processes are running, acting on the same tables, will it generate any lock or something? How do I take care of it smoothly? or, is MySQL good enough for this or I move on to some other DB Like Postgres or others?
We are already using Digital Ocean Instance, MySQL is currently installed on the same instance.
If you intend to ingest/query time-series at scale, a conventional RDBMS will fall short at one point or another. They are designed for a use case in which reads are more frequent than writes, and optimise for that.
There is a whole family of databases designed specifically for working with Time-Series data. These time-series databases can ingest data at high throughput while running queries on top, and they usually give you lifecycle capabilities so you can decide what to do when data keeps growing.
There are many options available, both open source and proprietary. Out of those databases I would recommend you to try QuestDB because of a few reasons:
It is open source and Apache 2.0 licensed, so you can use it anywhere for anything
It is a single binary (or docker container) to operate
You query data using SQL, (with extensions for time series)
You can insert data using SQL, but you will experience locks if using concurrent clients. However you can also ingest data using the ILP protocol which is designed for ingestion speed. There are official clients in 7 languages so you don't have to deal with the low-level details
It is blazingly fast. I have seen over 2 million inserts per second on a single instance and some users report sustained workloads of over 100,000 events per second
It is well supported on Digital Ocean
There are a lot of public references (and many users who are not a reference) in the finance/trading/crypto industries

Replication via Kafka vs. mysql events

I have a need to maintain a copy of an external database (including some additional derived data). With the same set of hardware, which one of the following solutions would give me faster consistency (low lag) with high availability? Assume updates to external database happen at 1000 records per second.
a) Create local mysql replica of an external db using mysql 5.7 replication (binary log file mechanism).
OR
b) Get real time Kafka events from external system, doing HTTP GET to fetch updated object details and use these details to maintain a local mysql replica.
The first will almost certainly give you lower lag (since there are just two systems and not three). Availability is about same - Kafka is high availability, but you have two databases on both sides anyway.
The second is better if you think you'll want to send the data in real-time to additional system. That is:
MySQL1 -> Kafka -> (MySQL2 + Elastic Search + Cassandra + ...)
I hate to answer questions with a 'just use this oddball thing instead' but I do worry you're gearing up a bit too heavy than you may need to -- or maybe you do, and I mis-read.
Consider a gossipy tool like serf.io . It's almost finished, and could give you exactly what you may need with something lighter than a kafka cluster or mysql pair.

Where can I find the clear definitions for a Couchbase Cluster, Couchbase Node and a Couchbase Bucket?

I am new to Couchbase and NoSQL terminologies. From my understanding a Couchbase node is a single system running a Couchbase Server application and a collection of such nodes having the same data by replication form a Couchbase Cluster.
Also, a Couchbase Bucket is somewhat like a table in RDBMS wherein you put your documents. But how can I relate the Node with the Bucket? Can someone please explain me about it in simple terms?
a Node is a single machine (1 IP/ hostname) that executes Couchbase Server
a Cluster is a group of Nodes that talk together. Data is distributed between the nodes automatically, so the load is balanced. The cluster can also provides replication of data for resilience.
a Bucket is the "logical" entity where your data is stored. It is both a namespace (like a database schema) and a table, to some extent. You can store multiple types of data in a single bucket, it doesn't care what form the data takes as long as it is a key and its associated value (so you can store users, apples and oranges in a same Bucket).
The bucket acts gives the level of granularity for things like configuration (how much of the available memory do you want to dedicate to this bucket?), replication factor (how many backup copies of each document do you want in other nodes?), password protection...
Note that I said that Buckets where a "logical" entity? They are in fact divided into 1024 virtual fragments which are spread between all the nodes of the cluster (that's how data distribution is achieved).

Mysql cluster for dummies

So what's the idea behind a cluster?
You have multiple machines with the same copy of the DB where you spread the read/write? Is this correct?
How does this idea work? When I make a select query the cluster analyzes which server has less read/writes and points my query to that server?
When you should start using a cluster, I know this is a tricky question, but mabe someone can give me an example like, 1 million visits and a 100 million rows DB.
1) Correct. Every data node does not hold a full copy of the cluster data, but every single bit of data is stored on at least two nodes.
2) Essentially correct. MySQL Cluster supports distributed transactions.
3) When vertical scaling is not possible anymore, and replication becomes impractical :)
As promised, some recommended readings:
Setting Up Multi-Master Circular Replication with MySQL (simple tutorial)
Circular Replication in MySQL (higher-level warnings about conflicts)
MySQL Cluster Multi-Computer How-To (step-by-step tutorial, it assumes multiple physical machines, but you can run your test with all processes running on the same machine by following these instructions)
The MySQL Performance Blog is a reference in this field
1->your 1st point is correct in a way.But i think if multiple machines would share the same data it would be replication instead of clustering.
In clustering the data is divided among the various machines and there is horizontal partitioning means the dividing of the data is based on the rows,the records are divided by using some algorithm among those machines.
the dividing of data is done in such a way that each record will get a unique key just as in case of a key-value pair and each machine also has a unique machine_id related which is used to define which key value pair would go to which machine.
we call each machine a cluster and each cluster consists of an individual mysql-server, individual data and a cluster manager.and also there is a data sharing between all the cluster nodes so that all the data is available to the every node at any time.
the retrieval of data is done through memcached devices/servers for fast retrieval and
there is also a replication server for a particular cluster to save the data.
2->yes, there is a possibility because there is a sharing of all the data among all the cluster nodes. and also you can use a load balancer to balance the load.But the idea of load balancer is quiet common because they are being used by most of the servers. but if you are trying you just for your knowledge then there is no need because you will not get to notice the type of load that creates the requirement of a load balancer the cluster manager itself can do the whole thing.
3->RandomSeed is right. you do feel the need of a cluster when your replication becomes impractical means if you are using the master server for writes and slave for reads then at some time when the traffic becomes huge such that the sever would not be able to work smoothly then you will feel the need of clustering. simply to speed up the whole process.
this is not the only case, this is just one of the scenario this is only just a case.
hope this is helpful for you!!

Does the MySQL NDB Cluster consider node distance? Will it use the replicates if they are nearer?

I'm building a very small NDB cluster with only 3 machines. This means that machine 1 will serve as both MGM Server, MySQL Server, and NDB data node. The database is only 7 GB so I plan to replicate each node at least once. Now, since a query might end up using data that is cached in the NDB node on machine one, even if it isn't node the primary source for that data, access would be much faster (for obvious reasons).
Does the NDB cluster work like that? Every example I see has at least 5 machines. The manual doesn't seem to mention how to handle node differences like this one.
There are a couple of questions here :
Availability / NoOfReplicas
MySQL Cluster can give high availability when data is replicated across 2 or more data node processes. This requires that the NoOfReplicas configuration parameter is set to 2 or greater. With NoOfReplicas=1, each row is stored in only one data node, and a data node failure would mean that some data is unavailable and therefore the database as a whole is unavailable.
Number of machines / hosts
For HA configurations with NoOfReplicas=2, there should be at least 3 separate hosts. 1 is needed for each of the data node processes, which has a copy of all of the data. A third is needed to act as an 'arbitrator' when communication between the 2 data node processes fails. This ensures that only one of the data nodes continues to accept write transactions, and avoids data divergence (split brain). With only two hosts, the cluster will only be resilient to the failure of one of the hosts, if the other host fails instead, the whole cluster will fail. The arbitration role is very lightweight, so this third machine can be used for almost any other task as well.
Data locality
In a 2 node configuration with NoOfReplicas=2, each data node process stores all of the data. However, this does not mean that only one data node process is used to read/write data. Both processes are involved with writes (as they must maintain copies), and generally, either process could be involved in a read.
Some work to improve read locality in a 2-node configuration is under consideration, but nothing is concrete.
This means that when MySQLD (or another NdbApi client) is colocated with one of the two data nodes, there will still be quite a lot of communication with the other data node.