I've been assigned task of data warehousing for Reporting and data analysis. Let me first explain what we are going to do.
Step 1. Replicate production server MySQL database.
Step 2. Scheduled ETL: Read replicated database (MySQL) and push data to PostgreSQL.
Now I need your help on Step 2.
Note: I want saveOrUpdate functionality. If id is available then update it or save it. Data will be picked up based on modified date.
So is there any tool available for scheduled data push in PostgreSQL?, Considering my requirements.
If there ain't any tool available then which programming language I should use for ETL? And other pointers you can provide me to achieve this.
Asked same question https://dba.stackexchange.com/questions/203460/data-warehousing-etl-scheduled-data-migration-from-mysql-to-postgresql on dba.stackexchange.com but I guess it has low userbase so posting it here.
On aws you have DMS. I don't know if you can use it with external services but it's works pretty well.
Related
I provide a solution that handles operations for brick and mortar shops. My next step is to provide analytics for my customers.
As I am in the starting phase I am hoping to find a free way to do it myself instead of using third party solutions. I am not expecting a massive scale at this point but I would like to get it done right instead of running queries off the production database.
And I am thinking for performance concerns I should run the analytics queries from separate tables in the same database. A cron job will run every night to replicate the data from the production tables to the analytics tables.
Is that the proper way to do this?
The other option I have in mind is to run the analytics from a different database (as opposed to just tables). I am using Amazon RDS with MySQL if that makes it more convenient?
It depends on how much analytics you want to provide.
I am a DWH manager and would start off with a small (free) BI (Business Intelligence) solution.
Your production DB and analytics DB should always be separate.
Take a look at Pentaho Data Integration (Community Edition) It's a free ETL tool that will help you get your data from your production to your analytics database and also can perform transformation.
check out some free reporting software like Jaspersoft to help you provide a Reporting Platform for customers (if that's what you want, otherwise just use Excel).
BI never wants to throw away data. If you think that your data in the analytics DB is gonna grow large (2TB +) don't use MySQL but rather PostgreSQL. MySQL does not handle big data well.
If you are really serious about this, read "The Datawarehouse Toolkit" by Ralph Kimball. That will set you up with some basic Data Warehouse knowledge.
Amazon RDS provides something call a Read-Replica. Which automatically performs replication and is optimised for reading.
I like this solution for its high convenience. Downside: its price-tag.
I currently have 2 MySQL Serve running on a different machines. Once of them is a staging environment (A) and another is a production environment (B). What I need to do is to take data from (A) and update/insert into B based on the conditions. If MySQL had Linked option then I can simply create a stored procedure that does the work for me and that would solve my whole problem. Unfortunately a great product like MySQL does not have this necessary future.
But since I can't write a procedure to do that what application I can use that will do the integration for me? note this integration will need to be automatic so it can be done daily and sometimes hourly.
My question is there an integration application out there that will integrate data from on MySQL Server to another automatically?
Thanks
I'm not a MySQL guy, and I don't know if it has a Linked/Federated option or not, but I understand it has very good replication. You could replicate tables from A into copy tables in B, then put your SP on B and run it whenever you want.
You could also write an application/batch script/ETL job that transfers the data and applies the conditions and just run it from a scheduler. If you do this frequently (meaning you do many other processes like this), I'd lean toward an ETL tool. I use Pentaho Data Integration. There are several others.
I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase
You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.
So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)
I'm looking for a possible solution for the following problem.
First the situation I'm at:
I've 2 databases, 1 Oracle DB and 1 MySQL DB. Although they have a lot of similarities they are not identical. A lot of tables are available on both the Oracle DB and the MySQL DB but the Oracle tables are often more extensive and contain more columns.
The situation with the databases can't be changed, so I've to deal with that.
Now I'm looking for the following:
I want to synchronise data from Oracle to MySQL and vice versa. This has to be done real time or as close to real time as possible. So when changes are made at one DB they have to be synced to the other DB as quickly as possible.
Also not every table has to be in sync, so the solution must offer a way of selecting which tables have to be synced and which not.
Because the databases are not identical replication isn't an option I think. But what is?
I hope you guys can help me with finding a way of doing this or a tool which does exactly what I need. Maybe you know some good papers/articles I can use?
Thanks!
Thanks for the comments.
I did some further research on ETL and EAI.
I found out that I am searching for an ETL tool.
I read your question and your answer. I have worked on both Oracle, SQL, ETL and data warehouses and here are my suggestions:
It is good to have a readymade ETL tool. But, if your application is big enough to make you need a tailor made ETL tool, I suggest you for a home-made ETL process.
If your transactional database is on Oracle, you can have triggers set up on the key tables that would further trigger an external procedure written in C, C++ or Java.
The reason behind using an external procedure is to be able to communicate with both databases at a time - Oracle and MySQL.
You can read more about Oracle External Procedures here.
If not through ExtProc, you can develop a separate application in Java or .Net that would extract data from the first database, transform it according to your business rules and load it into your warehouse.
In either approaches that you choose, you will have greater control on the ETL process if you implement your own tool, rather than going for a readymade tool.
I have a production db running on Oracle 10g. I want to set up a data warehouse using a MySQL 5.5 database and ideally would like to use CDC to identify any changes to the live DB and populate those changes to the warehouse.
Has anyone done this?
Is it possible without the use of a third party ETL tool, if not can anyone recommend any software for the job?
You can also use oracle database gateway for odbc.
It al depends how many tables you want to replicate and amount of data changed daily
You may end up writing a lot of triggers but it might slow down your database.
If you have creation and last modification fields you may use them as well.
Plus you can copy modified data only on schedule during of peak hours.