Currently I have some script which first deletes the table and upload the table from MySQL to Bigquery. And many time it had failed. Plus it run only once a day. I am looking for some scalable and realtime solution. Your Help will be much appreciated :)
Read these series of posts from Wepay, where they detail how they sync their MySQL databases to BigQuery, using Airflow:
https://wecode.wepay.com/posts/wepays-data-warehouse-bigquery-airflow
https://wecode.wepay.com/posts/airflow-wepay
(3rd one is about BigQuery)
As a summary (quoting):
Setup authentication, connections, DAG.
Define which columns to pull from MySQL and load into BigQuery.
Choose how to load the data: incrementally, or fully.
De-duplicating.
Related
I want to create connector (something like Debezium in Kafka-Connect) to reflect every change in MySQL source database in BigQuery tables.
There is one problem - source database is dropped every 10mins and re-created - some of rows are the same, some are updated and some are totally new. So I cannot do it via Debezium, because every 10min I would have all records in Kafka.
I want migrate to BQ tables just new or updated values. The mechanism to "copy" the whole source database but to deduplicate records (which not be excatly the same because this will be a new database). So for example create hash from every record and check - if hash is already in BQ - pass and if there is not add it.
I think it should be this:
Best effort de-duplication
but how to create the whole pipeline with MySQL as a source.
Cloud Data Fusion Replication lets you replicate your data continuously and in real time from operational data stores, such as SQL Server and MySQL, into BigQuery.
To use Replication, you create a new instance of Cloud Data Fusion and add the Replication app.
In short you do below
Set up your MySQL database to enable replication.
Create and run a Cloud Data Fusion Replication pipeline.
View the results in BigQuery.
Yu can see more at Replicating data from MySQL to BigQuery
I have an application that runs on a MySQL database, the application is somewhat resource intensive on the DB.
My client wants to connect Qlikview to this DB for reporting. I was wondering if someone could point me to a white paper or URL regarding the best way to do this without causing locks etc on my DB.
I have searched the Google to no avail.
Qlikview is in-memory tool with preloaded data so your client have to get data only during periodical reloads not all the time.
The best way is that your client will set reload once per night and make it incremental. If your tables have only new records load every night only records bigger than last primary key loaded.
If your tables have modified records you need to add in mysql last_modified_time field and maybe also set index on that field.
last_modified_time TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
If your fields are get deleted the best is set it as deleted=1 in mysql otherwise your client will need to reload everything from that tables to get to know which rows were deleted.
Additionally your client to save resources should load only data in really simple style per table without JOINS:
SELECT [fields] FROM TABLE WHERE `id` > $(vLastId);
Qlikview is really good and fast for data modelling/joins so all data model your client can create in QLikview.
Reporting can indeed cause problems on a busy transactional database.
One approach you might want to examine is to have a replica (slave) of your database. MySQL supports this very well and your replica data can be as up to date as you require. You could then attach any reporting system to your replica to run heavy reports that won't affect your main database. This also gives you a backup (2nd copy) and the backup can further be used to create offline backups of your data also without affecting your main database.
There's lots of information on the setup of MySQL replicas so that's not too hard.
I hope that helps.
I've been assigned task of data warehousing for Reporting and data analysis. Let me first explain what we are going to do.
Step 1. Replicate production server MySQL database.
Step 2. Scheduled ETL: Read replicated database (MySQL) and push data to PostgreSQL.
Now I need your help on Step 2.
Note: I want saveOrUpdate functionality. If id is available then update it or save it. Data will be picked up based on modified date.
So is there any tool available for scheduled data push in PostgreSQL?, Considering my requirements.
If there ain't any tool available then which programming language I should use for ETL? And other pointers you can provide me to achieve this.
Asked same question https://dba.stackexchange.com/questions/203460/data-warehousing-etl-scheduled-data-migration-from-mysql-to-postgresql on dba.stackexchange.com but I guess it has low userbase so posting it here.
On aws you have DMS. I don't know if you can use it with external services but it's works pretty well.
Currently I have some script which first deletes the table and upload the table from MySQL to Bigquery. And many time it had failed. Plus it run only once a day. I am looking for some scalable and realtime solution. Your Help will be much appreciated :)
Read these series of posts from Wepay, where they detail how they sync their MySQL databases to BigQuery, using Airflow:
https://wecode.wepay.com/posts/wepays-data-warehouse-bigquery-airflow
https://wecode.wepay.com/posts/airflow-wepay
(3rd one is about BigQuery)
As a summary (quoting):
Setup authentication, connections, DAG.
Define which columns to pull from MySQL and load into BigQuery.
Choose how to load the data: incrementally, or fully.
De-duplicating.
I am working for a client who uses multiple RDS (MySQL) instances on AWS and wants me to consolidate data from there and other sources into a single instance and do reporting off that.
What would be the most efficient way to transfer selective data from other AWS RDS MySQL instances to mine?
I don't want to migrate the entire DB, rather just a few columns and rows based on which have relevant data and what was last created/updated.
One option would be to use a PHP script that'd read from one DB and insert it into another, but it'd be very inefficient. Unlike SQL Server or ORACLE, MySQL also does not have the ability to write queries across servers, else I'd have just used that in a stored procedure.
I'd appreciate any inputs regarding this.
If your overall objective is reporting and analytics, the standard practice is to move your transactional data from RDS to Redshift which will become your data warehouse. This blog article by AWS provides an approach to do it.
For the consolidation operation, you can use AWS Data Migration Service which will allow you to migrate data column wise with following options.
Migrate existing data
Migrate existing data & replicate ongoing changes
Replicate data changes only
For more details read this whitepaper.
Note: If you need to process the data while moving, use AWS Data Pipeline.
Did you take a look at the RDS migration tool?