I want to create connector (something like Debezium in Kafka-Connect) to reflect every change in MySQL source database in BigQuery tables.
There is one problem - source database is dropped every 10mins and re-created - some of rows are the same, some are updated and some are totally new. So I cannot do it via Debezium, because every 10min I would have all records in Kafka.
I want migrate to BQ tables just new or updated values. The mechanism to "copy" the whole source database but to deduplicate records (which not be excatly the same because this will be a new database). So for example create hash from every record and check - if hash is already in BQ - pass and if there is not add it.
I think it should be this:
Best effort de-duplication
but how to create the whole pipeline with MySQL as a source.
Cloud Data Fusion Replication lets you replicate your data continuously and in real time from operational data stores, such as SQL Server and MySQL, into BigQuery.
To use Replication, you create a new instance of Cloud Data Fusion and add the Replication app.
In short you do below
Set up your MySQL database to enable replication.
Create and run a Cloud Data Fusion Replication pipeline.
View the results in BigQuery.
Yu can see more at Replicating data from MySQL to BigQuery
Related
I have a server with a database and a table. Upon insert, I want to be able copy the data to another table on another database on another server. I have read intensively on the subject, almost all the articles suggest I use tables with federated engines together with triggers, some others suggest I use replication feature on mysql databases, unfortunately the two servers or databases I am working with have some limitations.
There is no replication feature on either of them, it have been disabled and I can't get them to enable it.
Both databases can't create tables with federated engines.
The client does not want a server script file such as php file checking consistently for new data and sending over to populate the remote database.
However, each server have whitelisted each others IP and communication through a php file is possible between the two servers on either sides. I am able to create triggers on both tables as well. I want to write to the the remote database when ever there is a new data to the local database. There are no foreign key constraints on any of the tables. There is no read latency requirements from when a row is being updated to when it appears on the other server. The updated table does not have a updated timestamp/datetime column. PHP version: 7.1.14. Maria DB Version 5.5.62-0ubuntu0.14.04.1
Question:
Is there anyway to copy data before or after insert using triggers when the tables are not Federated engine tables(They use other engines, eg. INNO DB) ie... Can triggers have a connection to another server?
Is there any other way to copy data to another server without using a federated engine table or replication or using a script that checks for new data every minute or at regular intervals?
I am using Oracle database but open to use other database so tagging all of them.
I am designing one system in which I have to inject all the data of existing database table into new database and whatever changes happens in existing database should reflect in new database on daily basis. My approach is.
I will copy all the data of existing database to new database.
Then I will create a trigger which will record all the changes in the table and store in another table(all the DML operations).
Once in a day my API will read the data generated by trigger and copy into new system. I don't need live data so I will schedule job only once in a day to copy data into new database
is this the proper approach? any suggestions?
Common practice would be to back up your primary instance and restore it on the secondary once a day.
You could schedule the backup and restore in sequence as a daily jobs.
If your copy database is Sql server, then I suggested you use LinkedServer. Based on the documentation:
Linked servers enable you to implement distributed databases that can
fetch and update data in other databases. They are a good solution in
the scenarios where you need to implement database sharding without
need to create a custom application code or directly load from remote
data sources. Linked servers offer the following advantages:
The ability to access data from outside of SQL Server.
The ability to issue distributed queries, updates, commands, and
transactions on heterogeneous data sources across the enterprise.
The ability to address diverse data sources similarly.
You can find more information based on the documentation.
Visit https://learn.microsoft.com/en-us/sql/relational-databases/linked-servers/linked-servers-database-engine?view=sql-server-ver15
I am looking to move some MySQL databases to the cloud in Amazon Redshift. Currently I am creating a Python script to convert the tables to CSVs, encrypt them, put them in S3, then COPY the data into Redshift. However, the way it is set up I would have to copy the data one table at a time. I have read that you can split your data into multiple files and upload them in parallel, however I believe this is still only for loading data into one table. Is there a way to use COPY on multiple tables at once? Having to copy data over from each table individually seems very inefficient.
All of your statements are correct.
The COPY command can load from multiple files in parallel (in fact, that is recommended because it can then spread the load job across multiple nodes), but it only loads on table per COPY command.
You could connect to Redshift via multiple sessions and run a COPY command in each session to load multiple tables simultaneously (but be careful of the impact on production users).
If you are wishing to migrate data from an on-premises database to Amazon Redshift, consider using:
AWS Schema Conversion Tool
AWS Database Migration Service
The Database Migration Service can even perform on-going updates of Redshift whenever data is updated in the source database.
I am working for a client who uses multiple RDS (MySQL) instances on AWS and wants me to consolidate data from there and other sources into a single instance and do reporting off that.
What would be the most efficient way to transfer selective data from other AWS RDS MySQL instances to mine?
I don't want to migrate the entire DB, rather just a few columns and rows based on which have relevant data and what was last created/updated.
One option would be to use a PHP script that'd read from one DB and insert it into another, but it'd be very inefficient. Unlike SQL Server or ORACLE, MySQL also does not have the ability to write queries across servers, else I'd have just used that in a stored procedure.
I'd appreciate any inputs regarding this.
If your overall objective is reporting and analytics, the standard practice is to move your transactional data from RDS to Redshift which will become your data warehouse. This blog article by AWS provides an approach to do it.
For the consolidation operation, you can use AWS Data Migration Service which will allow you to migrate data column wise with following options.
Migrate existing data
Migrate existing data & replicate ongoing changes
Replicate data changes only
For more details read this whitepaper.
Note: If you need to process the data while moving, use AWS Data Pipeline.
Did you take a look at the RDS migration tool?
Currently I have some script which first deletes the table and upload the table from MySQL to Bigquery. And many time it had failed. Plus it run only once a day. I am looking for some scalable and realtime solution. Your Help will be much appreciated :)
Read these series of posts from Wepay, where they detail how they sync their MySQL databases to BigQuery, using Airflow:
https://wecode.wepay.com/posts/wepays-data-warehouse-bigquery-airflow
https://wecode.wepay.com/posts/airflow-wepay
(3rd one is about BigQuery)
As a summary (quoting):
Setup authentication, connections, DAG.
Define which columns to pull from MySQL and load into BigQuery.
Choose how to load the data: incrementally, or fully.
De-duplicating.