Replicating data from mySQL to Hbase using flume: how? - mysql

I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase

You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.

So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)

Related

Logging high volume of impression data (50 million records/month)

We are currently logging impression data for several websites using MySQL and are seeking a more appropriate replacement for logging the high volume of traffic our sites now see. What we ultimately need in the MySQL database is aggregated data.
By "high volume" I mean that we are logging about 50 million entries per month for this impression data. It is important to note that this table activity is almost exclusively write and only rarely read. (Different from this use-case on SO: Which NoSQL database for extremely high volumes of data). We have worked around some of the MySQL performance issues by partitioning the data by range and performing bulk inserts, but in the big picture, we shouldn't be using MySQL.
What we ultimately need in the MySQL database is aggregated data and I believe there are other technologies much better suited for the high-volume logging portion of this use-case. I have read about mongodb, HBase (with map reduce), Cassandra, and Apache Flume and I feel like I'm on the right track, but need some guidance on what technology (or combination) I should be looking at.
What I would like to know specifically is what platforms are best suited for high-volume logging and how to get an aggregated/reduced data set fed into MySQL on a daily basis.
Hive doesn't store information, it only allow you to query "raw" data with like sql language (HQL).
If your aggregated data is enough small to be stored in MySQL and that is the only use of your data, then HBase could be too much for you.
My suggestion is use Hadoop (HDFS and MapReduce
Create log files (text files) with the impression events.
Then move them into HDFS (an alternative could be use kafka or storm if you require a near real-time solution).
Create a MapReduce job capable to read and aggregate your logs and in the reduce output use a DBOutputFormat to store the aggregated data into MySql.
One approach could be to simply dump the raw impression log into flat files. There would be a daily batch which will process these file using MapReduce program. The MapReduce aggregated output could be stored into Hive or HBase.
Please let me know, if you see any problem in this approach. The Bigdata technology stack have many option based on type of data and the way it needs to be aggregated.

How to dump data from Oracle to MySql [duplicate]

We ran into serious performance problems with our Oracle database and we would like to try to migrate it to a MySQL-based database (either MySQL directly or, more preferably, Infobright).
The thing is, we need to let the old and the new system overlap for at least some weeks if not months, before we actually know, if all the features of the new database match our needs.
So, here is our situation:
The Oracle database consists of multiple tables with each millions of rows. During the day, there are literally thousands of statements, which we cannot stop for migration.
Every morning, new data is imported into the Oracle database, replacing some thousands of rows. Copying this process is not a problem, so we could, in theory, import in both databases in parallel.
But, and here the challenge lies, for this to work we need to have an export from the Oracle database with a consistent state from one day. (We cannot export some tables on Monday and some others on Tuesday, etc.) This means, that at least the export should be finished in less than one day.
Our first thought was to dump the schema, but I wasn't able to find a tool to import an Oracle dump file into MySQL. Exporting tables in CSV files might work, but I'm afraid it could take too long.
So my question now is:
What should I do? Is there any tool to import Oracle dump files into MySQL? Does anybody have any experience with such a large-scale migration?
PS: Please, don't suggest performance optimization techniques for Oracle, we already tried a lot :-)
Edit: We already tried some ETL tools before, only to find out, that they were not fast enough: Exporting only one table already took more than 4 hours ...
2nd Edit: Come on folks ... did nobody ever try to export a whole database as fast as possible and convert the data so that it can be imported into another database system?
Oracle does not supply an out-of-the-box unload utility.
Keep in mind without comprehensive info about your environment (oracle version? server platform? how much data? what datatypes?) everything here is YMMV and you would want to give it a go on your system for performance and timing.
My points 1-3 are just generic data movement ideas. Point 4 is a method that will reduce downtime or interruption to minutes or seconds.
1) There are 3rd party utilities available. I have used a few of these but best for you to check them out yourself for your intended purpose. A few 3rd party products are listed here: OraFaq . Unfortunately a lot of them run on Windows which would slow down the data unload process unless your DB server was on windows and you could run the load utility directly on the server.
2) If you don't have any complex datatypes like LOBs then you can roll your own with SQLPLUS. If you did a table at a time then you can easily parallelize it. Topic has been visited on this site probably more than once, here is an example: Linky
3) If you are 10g+ then External Tables might be a performant way to accomplish this task. If you create some blank external tables with the same structure as your current tables and copy the data to them, the data will be converted to the external table format (a text file). Once again, OraFAQ to the rescue.
4) If you must keep systems in parallel for days/weeks/months then use a change data capture/apply tool for near-zero downtime. Be prepared to pay $$$. I have used Golden Gate Software's tool that can mine the Oracle redo logs and supply insert/update statements to a MySQL Database. You can migrate the bulk of the data with no downtime the week before go-live. Then during your go-live period, shut down the source database, have Golden Gate catch up the last remaining transactions, then open up access to your new target database. I have used this for upgrades and the catch up period was only a few minutes. We already had a site licenses for Golden Gate so it wasn't anything out of pocket for us.
And I'll play the role of Cranky DBA here and say if you can't get Oracle performing well I would love to see a write up of how MySQL fixed your particular issues. If you have an application where you can't touch the SQL, there are still lots of possible ways to tune Oracle. /soapbox
I have built a C# application that can read an Oracle dump (.dmp) file and pump it's tables of data into a SQL Server database.
This application is used nightly on a production basis to migrate a PeopleSoft database to SQL Server. The PeopleSoft database has 1100+ database tables and the Oracle dump file is greater than 4.5GB in size.
This application creates the SQL Server database and tables and then loads all 4.5GB of data in less than 55 minutes running on a dual-core Intel server.
I don't believe it would be too difficult to modify this application to work with other databases provided they have an ADO.NET provider.
yeah, Oracle is pretty slow. :)
You can use any number of ETL tools to move data from Oracle into MySQL. My favourite is SQL Server Integration Services.
If you have Oracle9i or higher, you can implement Change Data Capture. Read more here http://download-east.oracle.com/docs/cd/B14117_01/server.101/b10736/cdc.htm
Then you can take a delta of changes from Oracle to your MySQL or Infobright using any ETL technologies.
We had the same issue. Needed to get tables and data from oracle dbms to mysql dbms.
We used this tool we found online... It worked well.
http://www.sqlines.com/download
This tool will basically help you:
Connect to your source DBMS(ORACLE)
Connect to destination DBMS(MySQL)
Specify schema and tables in the ORACLE DBMS you want to migrate
Press a "Transfer" button to Run the migration process(running inbuilt migration queries)
Get a transfer log, which will tell how many records were READ from SOURCE and WRITTEN on the destination database, what queries failed.
Hope this will help others that will land on this question.
I've used Pentaho Data Integration to migrate from Oracle to MySql (I also migrated the same data to Postresql, which was about 50% quicker, which I guess was largely due to the different JDBC drivers being used). I followed Roland Bouman's instructions here, almost to the letter, and was very pleasantly suprised at how easy it was:
Copy Table data from one DB to another
I don't know whether it will be appropriate for your data load, but it's worth a shot.
I recently released etlalchemy to accomplish this task. It is an open-sourced solution which allows migration between any 2 SQL databases with 4 lines of Python, and was initially designed to migrate from Oracle to MySQL. Support has been added for MySQL, PostgreSQL, Oracle, SQLite and SQL Server.
This will take care of migrating schema (arguably the most challenging), data, indexes and constraints, with many more options available.
To install:
$ pip install etlalchemy
On El Capitan: pip install --ignore-installed etlalchemy
To run:
from etlalchemy import ETLAlchemySource, ETLAlchemyTarget
orcl_db_source = ETLAlchemySource("oracle+cx_oracle://username:password#hostname/ORACLE_SID")
mysql_db_target = ETLAlchemyTarget("mysql://username:password#hostname/db_name", drop_database=True)
mysql_db_target.addSource(orcl_db_source)
mysql_db_target.migrate()
Concerning performance, this tool utilizes BULK import tools across various RDBMS such as mysqlimport and COPY FROM (postgresql) to carry out migrations efficiently. I was able to migrate a 5GB SQL Server database with 33,105,951 rows into MySQL in 40 minutes, and a 3GB 7,000,000 row Oracle database to MySQL in 13 minutes.
To get more background on the origins of the project, check out this post. If you get any errors running the tool, open an issue on the github repo and I'll patch it up in less than a week!
(To install the "cx_Oracle" Python driver, follow these instructions)
You can use Python, SQL*Plus and mysql.exe (MySQL client) script to copy whole table of just query results.
It will be portable because all those tools exist on Windows and Linux.
When I had to do it I implemented following steps using Python:
Extract data into CSV file using SQL*Plus.
Load dump file into MySQL
using mysql.exe.
You can improve performance by performing parallel load using Tables/Partitions/Sub-partitions.
Disclosure: Oracle-to-MySQL-Data-Migrator is the script I wrote for data integration between Oracle and MySQL on Windows OS.

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.

Pulling data from MySQL into Hadoop

I'm just getting started with learning Hadoop, and I'm wondering the following: suppose I have a bunch of large MySQL production tables that I want to analyze.
It seems like I have to dump all the tables into text files, in order to bring them into the Hadoop filesystem -- is this correct, or is there some way that Hive or Pig or whatever can access the data from MySQL directly?
If I'm dumping all the production tables into text files, do I need to worry about affecting production performance during the dump? (Does it depend on what storage engine the tables are using? What do I do if so?)
Is it better to dump each table into a single file, or to split each table into 64mb (or whatever my block size is) files?
Importing data from mysql can be done very easily. I recommend you to use Cloudera's hadoop distribution, with it comes program called 'sqoop' which provides very simple interface for importing data straight from mysql (other databases are supported too).
Sqoop can be used with mysqldump or normal mysql query (select * ...).
With this tool there's no need to manually partition tables into files. But for hadoop it's much better to have one big file.
Useful links:
Sqoop User Guide
2)
Since I dont know your environment I will aire on the safe, side - YES, worry about affecting production performance.
Depending on the frequency and quantity of data being written, you may find that it processes in an acceptable amount of time, particularly if you are just writing new/changed data. [subject to complexity of your queries]
If you dont require real time or your servers have typically periods when they are under utilized (overnight?) then you could create the files at this time.
Depending on how you have your environment setup, you could replicate/log ship to specific db server(s) who's sole job is to create your data file(s).
3)
No need for you to split the file, HDFS will take care of partitioning the data file into bocks and replicating over the cluster. By default it will automatically split into 64mb data blocks.
see - Apache - HDFS Architecture
re: Wojtek answer - SQOOP clicky (doesn't work in comments)
If you have more questions or specific environment info, let us know
HTH
Ralph

Migrate from Oracle to MySQL

We ran into serious performance problems with our Oracle database and we would like to try to migrate it to a MySQL-based database (either MySQL directly or, more preferably, Infobright).
The thing is, we need to let the old and the new system overlap for at least some weeks if not months, before we actually know, if all the features of the new database match our needs.
So, here is our situation:
The Oracle database consists of multiple tables with each millions of rows. During the day, there are literally thousands of statements, which we cannot stop for migration.
Every morning, new data is imported into the Oracle database, replacing some thousands of rows. Copying this process is not a problem, so we could, in theory, import in both databases in parallel.
But, and here the challenge lies, for this to work we need to have an export from the Oracle database with a consistent state from one day. (We cannot export some tables on Monday and some others on Tuesday, etc.) This means, that at least the export should be finished in less than one day.
Our first thought was to dump the schema, but I wasn't able to find a tool to import an Oracle dump file into MySQL. Exporting tables in CSV files might work, but I'm afraid it could take too long.
So my question now is:
What should I do? Is there any tool to import Oracle dump files into MySQL? Does anybody have any experience with such a large-scale migration?
PS: Please, don't suggest performance optimization techniques for Oracle, we already tried a lot :-)
Edit: We already tried some ETL tools before, only to find out, that they were not fast enough: Exporting only one table already took more than 4 hours ...
2nd Edit: Come on folks ... did nobody ever try to export a whole database as fast as possible and convert the data so that it can be imported into another database system?
Oracle does not supply an out-of-the-box unload utility.
Keep in mind without comprehensive info about your environment (oracle version? server platform? how much data? what datatypes?) everything here is YMMV and you would want to give it a go on your system for performance and timing.
My points 1-3 are just generic data movement ideas. Point 4 is a method that will reduce downtime or interruption to minutes or seconds.
1) There are 3rd party utilities available. I have used a few of these but best for you to check them out yourself for your intended purpose. A few 3rd party products are listed here: OraFaq . Unfortunately a lot of them run on Windows which would slow down the data unload process unless your DB server was on windows and you could run the load utility directly on the server.
2) If you don't have any complex datatypes like LOBs then you can roll your own with SQLPLUS. If you did a table at a time then you can easily parallelize it. Topic has been visited on this site probably more than once, here is an example: Linky
3) If you are 10g+ then External Tables might be a performant way to accomplish this task. If you create some blank external tables with the same structure as your current tables and copy the data to them, the data will be converted to the external table format (a text file). Once again, OraFAQ to the rescue.
4) If you must keep systems in parallel for days/weeks/months then use a change data capture/apply tool for near-zero downtime. Be prepared to pay $$$. I have used Golden Gate Software's tool that can mine the Oracle redo logs and supply insert/update statements to a MySQL Database. You can migrate the bulk of the data with no downtime the week before go-live. Then during your go-live period, shut down the source database, have Golden Gate catch up the last remaining transactions, then open up access to your new target database. I have used this for upgrades and the catch up period was only a few minutes. We already had a site licenses for Golden Gate so it wasn't anything out of pocket for us.
And I'll play the role of Cranky DBA here and say if you can't get Oracle performing well I would love to see a write up of how MySQL fixed your particular issues. If you have an application where you can't touch the SQL, there are still lots of possible ways to tune Oracle. /soapbox
I have built a C# application that can read an Oracle dump (.dmp) file and pump it's tables of data into a SQL Server database.
This application is used nightly on a production basis to migrate a PeopleSoft database to SQL Server. The PeopleSoft database has 1100+ database tables and the Oracle dump file is greater than 4.5GB in size.
This application creates the SQL Server database and tables and then loads all 4.5GB of data in less than 55 minutes running on a dual-core Intel server.
I don't believe it would be too difficult to modify this application to work with other databases provided they have an ADO.NET provider.
yeah, Oracle is pretty slow. :)
You can use any number of ETL tools to move data from Oracle into MySQL. My favourite is SQL Server Integration Services.
If you have Oracle9i or higher, you can implement Change Data Capture. Read more here http://download-east.oracle.com/docs/cd/B14117_01/server.101/b10736/cdc.htm
Then you can take a delta of changes from Oracle to your MySQL or Infobright using any ETL technologies.
We had the same issue. Needed to get tables and data from oracle dbms to mysql dbms.
We used this tool we found online... It worked well.
http://www.sqlines.com/download
This tool will basically help you:
Connect to your source DBMS(ORACLE)
Connect to destination DBMS(MySQL)
Specify schema and tables in the ORACLE DBMS you want to migrate
Press a "Transfer" button to Run the migration process(running inbuilt migration queries)
Get a transfer log, which will tell how many records were READ from SOURCE and WRITTEN on the destination database, what queries failed.
Hope this will help others that will land on this question.
I've used Pentaho Data Integration to migrate from Oracle to MySql (I also migrated the same data to Postresql, which was about 50% quicker, which I guess was largely due to the different JDBC drivers being used). I followed Roland Bouman's instructions here, almost to the letter, and was very pleasantly suprised at how easy it was:
Copy Table data from one DB to another
I don't know whether it will be appropriate for your data load, but it's worth a shot.
I recently released etlalchemy to accomplish this task. It is an open-sourced solution which allows migration between any 2 SQL databases with 4 lines of Python, and was initially designed to migrate from Oracle to MySQL. Support has been added for MySQL, PostgreSQL, Oracle, SQLite and SQL Server.
This will take care of migrating schema (arguably the most challenging), data, indexes and constraints, with many more options available.
To install:
$ pip install etlalchemy
On El Capitan: pip install --ignore-installed etlalchemy
To run:
from etlalchemy import ETLAlchemySource, ETLAlchemyTarget
orcl_db_source = ETLAlchemySource("oracle+cx_oracle://username:password#hostname/ORACLE_SID")
mysql_db_target = ETLAlchemyTarget("mysql://username:password#hostname/db_name", drop_database=True)
mysql_db_target.addSource(orcl_db_source)
mysql_db_target.migrate()
Concerning performance, this tool utilizes BULK import tools across various RDBMS such as mysqlimport and COPY FROM (postgresql) to carry out migrations efficiently. I was able to migrate a 5GB SQL Server database with 33,105,951 rows into MySQL in 40 minutes, and a 3GB 7,000,000 row Oracle database to MySQL in 13 minutes.
To get more background on the origins of the project, check out this post. If you get any errors running the tool, open an issue on the github repo and I'll patch it up in less than a week!
(To install the "cx_Oracle" Python driver, follow these instructions)
You can use Python, SQL*Plus and mysql.exe (MySQL client) script to copy whole table of just query results.
It will be portable because all those tools exist on Windows and Linux.
When I had to do it I implemented following steps using Python:
Extract data into CSV file using SQL*Plus.
Load dump file into MySQL
using mysql.exe.
You can improve performance by performing parallel load using Tables/Partitions/Sub-partitions.
Disclosure: Oracle-to-MySQL-Data-Migrator is the script I wrote for data integration between Oracle and MySQL on Windows OS.