I have a need to maintain a copy of an external database (including some additional derived data). With the same set of hardware, which one of the following solutions would give me faster consistency (low lag) with high availability? Assume updates to external database happen at 1000 records per second.
a) Create local mysql replica of an external db using mysql 5.7 replication (binary log file mechanism).
OR
b) Get real time Kafka events from external system, doing HTTP GET to fetch updated object details and use these details to maintain a local mysql replica.
The first will almost certainly give you lower lag (since there are just two systems and not three). Availability is about same - Kafka is high availability, but you have two databases on both sides anyway.
The second is better if you think you'll want to send the data in real-time to additional system. That is:
MySQL1 -> Kafka -> (MySQL2 + Elastic Search + Cassandra + ...)
I hate to answer questions with a 'just use this oddball thing instead' but I do worry you're gearing up a bit too heavy than you may need to -- or maybe you do, and I mis-read.
Consider a gossipy tool like serf.io . It's almost finished, and could give you exactly what you may need with something lighter than a kafka cluster or mysql pair.
Related
I have been trying to read data out of MySQL using Kafka connect using MySQL source connector for database and debezium connector for bin logs. I am trying to understand what could be the better way to pull the change data. Bin logs has overhead of writing to logs etc while reading from database has the overhead of querying database. What are the other major advantages and disadvantages that are associated with both of these approaches? What could be a better way of capturing change data? Also starting from MySQL 8 the bin logs are enabled by default. Does this mean it could be a better way of doing things?
This question can be summarized as follows:
What are the pros and cons of a log-based CDC (represented by Debezium Connector) versus a polling-based CDC (represented by JDBC Source Connector)?
Query-based CDC:
✓ Usually easier to setup
✓ Requires fewer permissions
✗ Impact of polling the DB
✗ Needs specific columns in source schema to track changes
✗ Can't track deletes
✗ Can't track multiple events between polling interval
Log-based CDC:
✓ All data changes are captured
✓ Low delays of events while avoiding increased CPU load
✓ No impact on data model
✓ Can capture deletes
✓ Can capture old record state and further meta data
✗ More setup steps
✗ Higher system previleges required
✗ Can be expensive for some proprietary DB
Reference:
Five Advantages of Log-Based Change Data Capture by Gunnar
Morling
No More Silos: How to Integrate Your Databases with Apache
Kafka and CDC by Robin Moffatt
StackOverflow: Kafka Connect JDBC vs Debezium CDC
The list given by #Iskuskov Alexander is great. I'd add a few more points:
Log-based CDC also requires writing to logs (you mentioned this in your question). This has overhead not only for performance, but also storage space.
Log-based CDC requires a continuous stream of logs. If the CDC misses a log, then the replica cannot be kept in sync, and the whole replica must be replaced by a new replica initialized by a new snapshot of the database.
If your CDC is offline periodically, this means you need to keep logs until the CDC runs, and this can be hard to predict how long it will be. This leads to needing more storage space.
That said, query-based CDC has its own drawbacks. At my company, we have used a query-based CDC, but we found that it is inconvenient, and we're working on replacing it with a Debezium log-based solution. For many of the reasons in the other answer, and also:
Query-based CDC makes it hard to keep schema changes in sync with the replica, so if a schema change occurs in the source database, it may require the replica be trashed and replaced with a fresh snapshot.
The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot, and users don't like this downtime. Also snapshot transfers increase the network bandwidth requirements.
Neither solution is "better" than the other. Both have pros and cons. Your job as an engineer is to select the option that fits your project's requirements the best. In other words, choose the one whose disadvantages are least bad for your needs.
We can't make that choice for you, because you know your project better than we do.
Re your comments:
Enabling binary logs has no overhead for read queries, but significant overhead for write queries. The overhead became greater in MySQL 8.0, as measured by Percona CTO Vadim Tkachenko and reported here: https://www.percona.com/blog/2018/05/04/how-binary-logs-affect-mysql-8-0-performance/
He concludes the overhead of binary logs is about 13% for MySQL 5.7, and up to 30% for MySQL 8.0.
Can you also explain "The replica is frequently in a "rebuilding" state for hours, when it needs to be reinitialized from a snapshot"? Do you mean building a replication database?
Yes, if you need to build a new replica, you acquire a snapshot of the source database and import it to the replica. Every step of this takes time:
Create the snapshot of the source
Transfer the snapshot to the host where the replica lives
Import the snapshot into the replica instance
How long depends on the size of the database, but it can be hours or even days. While waiting for this, users can't use the replica database, at least not if they want their queries to analyze a complete copy of the source data. They have to wait for the import to finish.
I have a system where there's a mysql database to which changes are done. Then I have other machines that connect to this mysql database every ten minutes or so, and re-download tables concerning them (for example, one machine might download tables A, B, C, while another machine might download table A, D, E).
Without using Debezium or Kafka, is there a way to get all MySQL changes done after a certain timestamp, so that only those changes are sent to a machine requesting the updates, instead of the whole tables ? ... For example, machine X might want all mysql changes done since it last contacted the mysql database, and then apply those changes to its own old data to update it.
Is there some way to do this ?
MySQL can be setup to automatically replicate databases, tables etc. automatically. If the connection is lost, it will catch up when the connection is restored.
Take a look at this page MySQL V5.5 Replication, or this one MySQL V8.0 Replication
You can use Debezium as a library embedded into your application, if you don't want or can deploy a Kafka cluster.
Alternatively, you could directly use the MySQL Binlog Connector (it's used by the Debezium connector underneath, too), it lets you read the binlog from given offset positions. But then you'd have to deal with many things yourself, which are handled by solutions such as Debezium already, e.g. the correct handling of schema metadata and changes (e.g. additions of columns to existing tables). Usually this involves parsing the MySQL DDL, which by itself is quite complex.
Disclaimer: I'm the lead of Debezium
Just some contexts: In our old data pipeline system, we are running MySQL 5.6. or Aurora on Amazon rds. Bad thing about our old data pipeline is running a lot of heavy computations on the database servers because we are handcuffed by what was designed: treating transactional databases as data warehouse and our backend API directly “fishing” the databases heavily in our old system. We are currently patching this old data pipeline, while re-design the new data warehouse in SnowFlake.
In our old data pipeline system, the data pipeline calculation is a series of sequential MySQL queries. As our data grows bigger and bigger in the old data pipeline, what the problem now is the calculation might just hang forever at, for example, the step 3 MySQL query, while all metrics in Amazon CloudWatch/ grafana we are monitoring (CPU, database connections, freeable memory, network throughput, swap usages, read latency, available storage, write latency, etc. ) looks normal. The MySQL slow query log is not really helpful here because each of our query in the data pipeline is essentially slow anyway (can takes hours to run a query because the old data pipeline is running a lot of heavy computations on the database servers). The way we usually solve these problems is to “blindly” upgrade the MySQL/Aurora Amazon rds service and hoping it will solve the issue. I am wondering
(1) What are the recommended database metrics in MySQL 5.6. or Aurora on Amazon rds we should monitor real-time to help us identify why a query freezes forever? Like innodb_buffer_pool_size?
(2) Is there any existing tool and/or in-house approach where we can predict how many hardware resources we need before we can confidently execute a query and know it will succeed? Could someone share some 2 cents?
One thought: Since Amazon rds sometimes is a bit blackbox, one possible way is to host our own MySQL server on an Amazon EC2 instance in parallel to our Amazon MySQL 5.6/Aurora rds production server, so we can ssh into MySQL server and run a lot of command tools like mytop (https://www.tecmint.com/mysql-performance-monitoring/) to gather a lot more real time MySQL metrics which can help us triage the issue. Open to any 2 cents from gurus. Thank you!
None of the tools mentioned at that link should need to run on the database server itself, and to the extent that this is true, there should be no difference in their behavior if they aren't. Run them on any Linux server, giving the appropriate --host and --user and --password arguments (in whatever form they may expect). Even mysqladmin works remotely. Most of the MySQL command line tools do (such as the mysql cli, mysqldump, mysqlbinlog, and even mysqlcheck).
There is no magic coupling that most administrative utilities can gain by running on the same server as MySQL Server itself -- this is a common misconception but, in fact, even when running on the same machine, they still have to make a connection to the server, just like any other client. They may connect to the unix socket locally rather than using TCP, but it's still an ordinary client connection, and provides no extra capabilities.
It is also possible to run an external replica of an RDS/MySQL or Aurora/MySQL server on your own EC2 instance (or in your own data center, even). But this isn't likely to tell you a whole lot that you can't learn from the RDS metrics, particularly in light of the above. (Note also, that even replica servers acquire their replication streams using an ordinary client connection back to the master server.)
Avoid the temptation to tweak server parameters. On RDS, most of the defaults are quite sane, and unless you know specifically and precisely why you want to adjust a parameter... don't do it.
The most likely explanation for slow queries... is poorly written queries and/or poorly designed indexes.
If you are not familiar with EXPLAIN SELECT, then you need to learn it, live it, an love it. SQL is declarative, not procedural. That is, SQL tells the server what you want -- not specifically how to obtain it internall. For example: SELECT ... FROM x JOIN y tells the server to match up the rows from table x and y ON a certain criteria, but does not tell the server whether to read from x then find the matching rows in y... or read from y and find the matching rows in x. The net result is the same either way -- it doesn't matter which table the server examines first, internally -- but if the query or the indexes don't allow the server to correctly deduce the optimum path to the results you've requested, it can spend countless hours churning through unnecessary effort.
Take for an extreme and overly-simplified example, a table with millions of rows and table with 1 row. It would make sense to read the small table first, so you know what 1 value you're trying to join in the large table. It would make no sense to read throuh each row in the large table, then go over and check the small table for a match for each of the millions of rows. The order in which you join tables can be different than the order in which the actual joining is done.
And that's where EXPLAIN comes in. This allows you to inspect the query plan -- the strategy the internal query optimizer has concluded will get it to the answer you need with the least amount of effort. This is the core of the magic of relational database systems -- finding the correct solution in the optimal time, based on what it knows about the data. EXPLAIN shows you the order in which the tables are being accessed, how they're being joined, which indexes are being used, and an estimate of the number of rows from each table are involved -- and these numbers multiply together to give you an estimate of the number of permutations involved in resolving your query. Two small tables, each with 50,000 rows, joined without a proper index, means an entirely unreasonable 2,500,000,000 unique combinations between the two tables that must be evaluated; every row must be compared to every other row. In short, if this turns out to be the kind of thing that you are (unknowingly) asking the server to do, then you are definitely doing something wrong. Inspecting your query plan should be second nature any time you write a complex query, to ensure that the server is using a sensible strategy to resolve it.
The output is cryptic, but secret decoder rings are available.
https://dev.mysql.com/doc/refman/5.7/en/explain.html#explain-execution-plan
So what's the idea behind a cluster?
You have multiple machines with the same copy of the DB where you spread the read/write? Is this correct?
How does this idea work? When I make a select query the cluster analyzes which server has less read/writes and points my query to that server?
When you should start using a cluster, I know this is a tricky question, but mabe someone can give me an example like, 1 million visits and a 100 million rows DB.
1) Correct. Every data node does not hold a full copy of the cluster data, but every single bit of data is stored on at least two nodes.
2) Essentially correct. MySQL Cluster supports distributed transactions.
3) When vertical scaling is not possible anymore, and replication becomes impractical :)
As promised, some recommended readings:
Setting Up Multi-Master Circular Replication with MySQL (simple tutorial)
Circular Replication in MySQL (higher-level warnings about conflicts)
MySQL Cluster Multi-Computer How-To (step-by-step tutorial, it assumes multiple physical machines, but you can run your test with all processes running on the same machine by following these instructions)
The MySQL Performance Blog is a reference in this field
1->your 1st point is correct in a way.But i think if multiple machines would share the same data it would be replication instead of clustering.
In clustering the data is divided among the various machines and there is horizontal partitioning means the dividing of the data is based on the rows,the records are divided by using some algorithm among those machines.
the dividing of data is done in such a way that each record will get a unique key just as in case of a key-value pair and each machine also has a unique machine_id related which is used to define which key value pair would go to which machine.
we call each machine a cluster and each cluster consists of an individual mysql-server, individual data and a cluster manager.and also there is a data sharing between all the cluster nodes so that all the data is available to the every node at any time.
the retrieval of data is done through memcached devices/servers for fast retrieval and
there is also a replication server for a particular cluster to save the data.
2->yes, there is a possibility because there is a sharing of all the data among all the cluster nodes. and also you can use a load balancer to balance the load.But the idea of load balancer is quiet common because they are being used by most of the servers. but if you are trying you just for your knowledge then there is no need because you will not get to notice the type of load that creates the requirement of a load balancer the cluster manager itself can do the whole thing.
3->RandomSeed is right. you do feel the need of a cluster when your replication becomes impractical means if you are using the master server for writes and slave for reads then at some time when the traffic becomes huge such that the sever would not be able to work smoothly then you will feel the need of clustering. simply to speed up the whole process.
this is not the only case, this is just one of the scenario this is only just a case.
hope this is helpful for you!!
We are running a Java PoS (Point of Sale) application at various shops, with a MySql backend. I want to keep the databases in the shops synchronised with a database on a host server.
When some changes happen in a shop, they should get updated on the host server. How do I achieve this?
Replication is not very hard to create.
Here's some good tutorials:
http://www.ghacks.net/2009/04/09/set-up-mysql-database-replication/
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
http://www.lassosoft.com/Beginners-Guide-to-MySQL-Replication
Here some simple rules you will have to keep in mind (there's more of course but that is the main concept):
Setup 1 server (master) for writing data.
Setup 1 or more servers (slaves) for reading data.
This way, you will avoid errors.
For example:
If your script insert into the same tables on both master and slave, you will have duplicate primary key conflict.
You can view the "slave" as a "backup" server which hold the same information as the master but cannot add data directly, only follow what the master server instructions.
NOTE: Of course you can read from the master and you can write to the slave but make sure you don't write to the same tables (master to slave and slave to master).
I would recommend to monitor your servers to make sure everything is fine.
Let me know if you need additional help
three different approaches:
Classic client/server approach: don't put any database in the shops; simply have the applications access your server. Of course it's better if you set a VPN, but simply wrapping the connection in SSL or ssh is reasonable. Pro: it's the way databases were originally thought. Con: if you have high latency, complex operations could get slow, you might have to use stored procedures to reduce the number of round trips.
replicated master/master: as #Book Of Zeus suggested. Cons: somewhat more complex to setup (especially if you have several shops), breaking in any shop machine could potentially compromise the whole system. Pros: better responsivity as read operations are totally local and write operations are propagated asynchronously.
offline operations + sync step: do all work locally and from time to time (might be once an hour, daily, weekly, whatever) write a summary with all new/modified records from the last sync operation and send to the server. Pros: can work without network, fast, easy to check (if the summary is readable). Cons: you don't have real-time information.
SymmetricDS is the answer. It supports multiple subscribers with one direction or bi-directional asynchronous data replication. It uses web and database technologies to replicate tables between relational databases, in near real time if desired.
Comprehensive and robust Java API to suit your needs.
Have a look at Schema and Data Comparison tools in dbForge Studio for MySQL. These tool will help you to compare, to see the differences, generate a synchronization script and synchronize two databases.