A current project I am working on has been exclusively using MySQL as our RDMS. We are currently looking to segment the database into two different databases. One will be moving to RedShift (which runs using a modified Postgresql) while the other will continue using MySQL.
My concern does not stem from splitting the data, but rather how applications will interact with the segmented data. Effectively our current application will be reading static data from RedShift and writing to the MySQL database and I am curious if it is a bad practice to intermingle these Query Languages.
Would it be better to migrate the MySQL DB to Postgres to limit complications arising from their differences?
We (Looker) work with many customers (100s) that have both MySQL and Redshift. The progression as their needs grow is usually:
MySQL
MySQL + MySQL slave
MySQL + MySQL Writable Slave
MySQL + MySQL Writable Slave + Redshift
So your best bet, if you haven't done so is to setup a MySQL Replica slave database. The replica slave follows your master write database and is essentially an exact copy of your master.
You can also make your Replica Writable. This becomes really useful for building summary tables. Here are some instructions on how to make a writable replica in RDS, but you can do it with in other systems too.
http://www.looker.com/docs/setup-and-management/database-config/mysql-rds
If have big event data that you want to integrate with your transactional data, the next step is to setup a process that migrates all your MySQL data into Redshift and pumps in data from other sources (like your event data, for example). Moving all the data, gives you the ability to ask any question from Redshift.
Redshift will lag hours or more behind the MySQL database. If you need to answer real time questions, query MySQL. If you want general insights, query the Redshift database.
Related
I have a system where there's a mysql database to which changes are done. Then I have other machines that connect to this mysql database every ten minutes or so, and re-download tables concerning them (for example, one machine might download tables A, B, C, while another machine might download table A, D, E).
Without using Debezium or Kafka, is there a way to get all MySQL changes done after a certain timestamp, so that only those changes are sent to a machine requesting the updates, instead of the whole tables ? ... For example, machine X might want all mysql changes done since it last contacted the mysql database, and then apply those changes to its own old data to update it.
Is there some way to do this ?
MySQL can be setup to automatically replicate databases, tables etc. automatically. If the connection is lost, it will catch up when the connection is restored.
Take a look at this page MySQL V5.5 Replication, or this one MySQL V8.0 Replication
You can use Debezium as a library embedded into your application, if you don't want or can deploy a Kafka cluster.
Alternatively, you could directly use the MySQL Binlog Connector (it's used by the Debezium connector underneath, too), it lets you read the binlog from given offset positions. But then you'd have to deal with many things yourself, which are handled by solutions such as Debezium already, e.g. the correct handling of schema metadata and changes (e.g. additions of columns to existing tables). Usually this involves parsing the MySQL DDL, which by itself is quite complex.
Disclaimer: I'm the lead of Debezium
I am having one MySQL Server and one PostgreSQL server.
Need to replicate or re-insert set of data from multiples tables
of MySQL to be Streamed/Synced to the PostgreSQL table.
This replication can be based on time(Sync) or event such as
a new insert in the table(Stream).
I tried using the below replication tools but all these tools will be able to sync table to table only.Its not allowing to choose the columns from different tables of the source database(MySQL) and insert in to different tables in the destination database(PostgreSQL).
Symmetricds
dbconvert
pgloader
Postgresql FDW
Now I have to write an application to query the data from MySQL
and insert in to PostgreSQL as a cron job .
Its cumbersome and error prone to sync the data.
This is not able to stream(Event based) the data for realtime replication.
it would be great if some tools already solving this problem.
Please let me know if there is opensource library or tool can do this for me.
Thanks in advance.
To achieve a replication with one of tools you proposed you can do the following:
Create a separate schema in PostgreSQL and add views so that they completely copy the table structure of MySQL. You will then add rules or triggers to the views to handle inserts/updates/deletes and redirect them to the tables of your choice.
This way you have the complete freedom to transform your data during the replication, yet still use the common tools.
maybe this tool can help you. https://github.com/the4thdoctor/pg_chameleon
Pg_chameleon is a replication tool from MySQL to PostgreSQL developed in Python 2.7/3.5. The system relies on the mysql-replication library to pull the changes from MySQL and covert them into a jsonb object. A plpgsql function decodes the jsonb and replays the changes into the PostgreSQL database.
The tool can initialise the replica pulling out the data from MySQL but this requires the FLUSH TABLE WITH READ LOCK; to work properly.
The tool can pull the data from a cascading replica when the MySQL slave is configured with log-slave-updates.
Currently I am facing a issue during sync the data from mysql to hbase, I need a near real time data sync from mysql to hbase, and I need to merge multiple mysql tables into one hbase table during the data sync.
I tried sqoop looks like it can not fit our requirements.
So are there any existing tools/libs can be used for my case, or any other solutions I can try with spark.
Consider using Apache Phoenix on HBase. It will give you low-latency SQL queries (so it is suitable for OLTP and easy to use for OLAP) on data stored in HBase so you don't have to worry about syncing. It also has NoSQL features such as the ability to dynamically add columns during query-time.
To satisfy your use case, you could run Phoenix for OLTP, and a second instance of Phoenix on a read replica to run table joins for OLAP.
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hbase_read_replicas.html
Secondary replicas are refreshed at intervals controlled by a timer (hbase.regionserver.storefile.refresh.period), and so are guaranteed to be at most that interval of milliseconds behind the primary RegionServer.
This solution satisfies your requirements for OLTP, OLAP, and near real-time syncing while giving your transactional database scalability that you would not easily have with MySQL. Apache Phoenix also offers full integration with the Hadoop ecosystem so it will integrate well with your current analytics stack.
I have task to implement particular database structure:
Multiple mysql servers with data using the same schema. Each server can see and edit only his particular part of data.
And
One master server with his own data that can run queries using data from all previously mentioned servers, but cannot edit them.
Example would be multiple hospitals with data of their patients and master server that can use combined data from all hospitals.
Previous system was written using mysql cluster, so i tried it naturally. I can create mysql cluster with multiple nodes and maybe even partition data so i can have particular set of data in particular node, but as far as i know i can't connect to single node using mysql, because it is already connected to cluster.
Can it be done with mysql cluster? Is there other framework that can do that easily?
You could try to use http://galeracluster.com/. You can perform updates on all slaves and every server has all data, but it might still meet your requirements.
Two machines, each running mysql, each synchronized to the other peer-to-peer. I do not want a master db replicated. Rather, I want two users to be able to work on the data offline (each running a mysql server on his machine) and then when reconnected synchronize to each other. Any way to do this with mysql? Any other database I should be looking at to accomplish this better than mysql?
Two-way replication is provided by various database systems (e.g. SQLServer, Sybase etc.) but there are always problems with such a set up.
For example, if the same row is updated at the same time on the two databases, which update wins?
If your aim is to provide a highly-available MySQL database, then there are better options than using replication. MySQL has a clustering solution (though I've not had much success with it) or you can use things like DRBD and heartbeat to provide automatic failover with no loss of data.
If you mean synchronous writing back and forth, this would cause serious data consistency issues. I think you may be referring to MySQL replication, wherein a master server sends its updates to one or more slave database servers, which can be queried.
As for "Other Database Options" SQLServer supports a fairly advanced "replication" process for synchronizing the data between two or more db's. Looks like MySql has something like this as well though.