Pulling data from MySQL into Hadoop

Pulling data from MySQL into Hadoop - mysql

I'm just getting started with learning Hadoop, and I'm wondering the following: suppose I have a bunch of large MySQL production tables that I want to analyze.
It seems like I have to dump all the tables into text files, in order to bring them into the Hadoop filesystem -- is this correct, or is there some way that Hive or Pig or whatever can access the data from MySQL directly?
If I'm dumping all the production tables into text files, do I need to worry about affecting production performance during the dump? (Does it depend on what storage engine the tables are using? What do I do if so?)
Is it better to dump each table into a single file, or to split each table into 64mb (or whatever my block size is) files?

Importing data from mysql can be done very easily. I recommend you to use Cloudera's hadoop distribution, with it comes program called 'sqoop' which provides very simple interface for importing data straight from mysql (other databases are supported too).
Sqoop can be used with mysqldump or normal mysql query (select * ...).
With this tool there's no need to manually partition tables into files. But for hadoop it's much better to have one big file.
Useful links:
Sqoop User Guide

2)
Since I dont know your environment I will aire on the safe, side - YES, worry about affecting production performance.
Depending on the frequency and quantity of data being written, you may find that it processes in an acceptable amount of time, particularly if you are just writing new/changed data. [subject to complexity of your queries]
If you dont require real time or your servers have typically periods when they are under utilized (overnight?) then you could create the files at this time.
Depending on how you have your environment setup, you could replicate/log ship to specific db server(s) who's sole job is to create your data file(s).
3)
No need for you to split the file, HDFS will take care of partitioning the data file into bocks and replicating over the cluster. By default it will automatically split into 64mb data blocks.
see - Apache - HDFS Architecture
re: Wojtek answer - SQOOP clicky (doesn't work in comments)
If you have more questions or specific environment info, let us know
HTH
Ralph

Related

How can I dump mysql table in parts?

I have Linux server and a huge mysql table that I need to dump. The thing is, the sever is production and I don’t want to crash it by dumping all at once. Also I intend to pipe it over ssh to another server. Because I don’t want to fill up the disk space. I know about the mysqldump —where clause but I don’t want to script those IDs. Is there any native functionality in mysql that allows dumping in parts? It doesn’t have to be a mysqldump but it needs to be in parts so I don’t crash the server and I’ll need to pipe this over ssh.
Additional info: records are never updated in this table. They are only added
MySQL documentation: as outlined in their docs, mysqldump in not suited for large databases. They suggest to backup raw data files.

If your concern really is the load and not crashing the production, then maybe you should take a look at this post : How can I slow down a MySQL dump as to not affect current load on the server?
about how to backup large production databases, using the right mysqldump args.
Slicing a production database may end up more dangerous in the end.
Also I don't know how often entries get updated in the db, but slicing the export would give you an inconsistent dump regarding the data, having slices of the same table, from different times

How to restore a MySQL database "safely"

How should you go about restoring (and backing) up a MySQL database "safely"? By "safely" I mean: the restore should create/overwrite a desired database, but not risk altering anything outside that database.
I have already read https://dev.mysql.com/doc/refman/5.7/en/backup-types.html.
I have external users. They & I may want to exchange backups for restore. We do not have a commercial MySQL Enterprise Backup, and are not looking for a third-party commercial offering.
In Microsoft SQL Server there are BACKUP and RESTORE commands. BACKUP creates a file containing just the database you want; both its rows and all its schema/structure are included. RESTORE accepts such a file, and creates or overwrites its structure. The user can restore to a same-named database, or specify a different database name. This kind of behaviour is just what I am looking for.
In MySQL I have come across 3 possibilities:
Most people seem to use mysqldump to create a "dump file", and mysql to read that back in. The dump file contains a list of arbitrary MySQL statements, which are simply executed by mysql. This is quite unacceptable: the file could contain any SQL statements. (Limiting access rights of restoring user to try to ensure it cannot do anything "naughty" is not acceptable.) There is also the issue that the user may have created the dump file with the "Include CREATE Schema" option (MySQL Workbench), which hard-codes the original database name for recreation. This "dump" approach is totally unsuitable to me, and I find it surprising that anyone would use it in a production environment.
I have come across MySQL's SELECT ... INTO OUTFILE and LOAD DATA INFILE statements. At least they do not contain SQL code to execute. However, they look like a lot of work, deal with a table at time not the whole database, and don't deal with the structure of the tables, you have to know that yourself for restoring. There is a mysqlimport helper command-line utility, but I don't see anything for the export side, and I don't see it for restoring a complete database.
The last is to use what MySQL refers to as "Physical (Raw)" rather than "Logical" Backups. This works on the database directories and files themselves. It is the equivalent of SQL Server's detach/attach method for backing up/restoring. But, as per https://dev.mysql.com/doc/refman/5.7/en/backup-types.html, it has all sorts of caveats, e.g. "Backups are portable only to other machines that have identical or similar hardware characteristics." (I have no idea, e.g. some users are Windows versus Windows, I have no idea about their architecture) and "Backups can be performed while the MySQL server is not running. If the server is running, it is necessary to perform appropriate locking so that the server does not change database contents during the backup." (let alone restores).
So can anything satisfy (what I regard as) my modest requirements, as outlined above, for MySQL backup/restore? Am I really the only person who finds the above 3 as the only, yet unacceptable, possible solutions?

1 - mysqldump - I use this quite a bit, usually in environments where I am handling all the details myself. I do have one configuration where I use that to send copies of a development database - to be dumped/restored in its entirety - to other developers. It is probably the fastest solution, has some reasonable configuration options (e.g., to include/exclude specific tables) and generates very functional SQL code (e.g., each INSERT batch is small enough to avoid locking/speed issues). For a "replace entire database" or "replace key tables in a specific database" solution, it works very well. I am not too concerned about the "arbitrary SQL commands" problem - if that is an issue then you likely have other issues with users trying to "do their own thing".
2 - SELECT ... INTO OUTFILE and LOAD DATA INFILE - The problem with these is that if you have any really big tables then the LOAD DATA INFILE statement can cause problems because it is trying to load everything all at once. You also have to add code to create (if needed) or empty the tables before LOAD DATA.
3 - Physical (raw) file transfer. This can work but under limited circumstances. I had one situation with a multi-gigabyte database and decided to compress the raw files, move them to the new machine, uncompress and just tell MySQL "everything is already there". It mostly worked well. But I would not recommend it for any unattended/end-user process due to the MANY possible problems.
What do I recommend?
1 - mysqldump - live with its limitations and risks, set up a script to call mysqldump and compress the file (I am pretty sure there are options in mysqldump to do the compression automatically), include the date in the file name so that there is less confusion as the files are sent around, and make a simple script for users to load the file.
2 - Write your own program. I have done this a few times. This is more work initially but allows you to control every aspect of the process and transfer a file that only contains data without any actual SQL code. You can control the specific database, tables, etc. One catch is that if you make any changes to the table structure, indexes, etc. you will need to make sure that information is somehow transmitted to the receiving problem so that it can change the structures as needed - that is not a problem with mysqldump as it normally replaces the tables, creating the new structures, indexes, etc. This can be written in any language that can connect to MySQL - it does not have to be the same language as your application.

If you're not going to use third party tools (like innobackupex for example) then you're limited to use ... mysqldump, which is in the mysql package.
I can't understand why it is not acceptable for you, why you don't like sql commands in those dumps. Best practice,when restoring a single db into the server, which already contains other databases, is to have a separated user, with rights only to write into the restored db. Then even if the user performing restore, would change the sql commands and tried to write to another db, they will not be able to.
When doing raw backup (physical copy of database files) you need to have all the instances down, mysql server not running. Similar hardware means you need to have the same directories as the source server (unless you would change my.cnf before starting the server, and putting all the files to right directories).
When coming into mysql, try to not compare it to sql server - it's totally different approach and philosophy.
But if you would convinced yourself anyhow to use third party tool - I recommend innobackex from Percona, which is free btw.

The export tool that complements mysqlimport is mysqldump --tab. This outputs CSV files like SELECT...INTO OUTFILE. It also outputs the table structure in much smaller .sql files. So there are two files for each table.
Once you recreate your tables from the .sql files, you can use mysqlimport to import all the data files. You can even use the mysqlimport --use-threads option to make it load multiple data files in parallel.
This way you have more control over which schema to load the data into, and it should run a lot faster than loading a large SQL dump.

Replicating data from mySQL to Hbase using flume: how?

I have a large mySQL database with heavy load and would like to replicate the data in this database to Hbase in order to do analytical work on it.
edit: I want the data to replicate relatively quickly, and without any schema changes (no timestamped rows, etc)
I've read that this can be done using flume, with mySQL as a source, possibly the mySQL bin logs, and Hbase as a sink, but haven't found any detail (high or low level). What are the major tasks to make this work?
Similar question were asked and answered earlier but didn't really explain how or point to resources that would:
Flume to migrate data from MySQL to Hadoop
Continuous data migration from mysql to Hbase

You are better off using SQOOP for this purpose, IMHO. It was developed for exactly this purpose. Flume was made for a rather different purpose, like aggregating log data, data generated from sensors etc.
See this for more details.

So far there are three options worth considering:
Sqoop: After initial bulk import, it supports two types of incremental udpates import: APPEND, LAST-MODFIED. But being said, It won't give you Real-Time or even near Real-Time replication. It's not because Sqoop can't run that fast, it's because you don't want to plug in a Sqoop pipe to your Mysql server and puling data every 1 or 2 mins.
Trigger: This is a quick-dirty solution, by adding triggers to the source RDBMS, and update your HBase according. This one gives you Real-Time satisfaction. But you have to mess up the source DB by adding triggers. It might be ok as a temporal solution, but long term, it just won't do.
Flume: This one, you will need to put in the most development effort. It doesn't need to touch the DB, it doesn't add in Reading traffic to the DB neither(It tails the transaction logs).
Personally I'd go for flume, not only it channels the data from RDBMS to your HBase, but also can you do something with the data while they are streaming through your flume pipe. (e.g. transformation, notification, alerting etc etc)

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?

You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.

There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.

How to bring files in a filesystem in/out MySQL DB?

The application that I am working on generates files dynamically with use. This makes backup and syncronization between staging,development and production a real big challenge. One way that we might get smooth solution (if feasable) is to have a script that at the moment of backing up the database can backup the dynamically generated files inside the database and in restore time can bring those file out of the database and in the filesystem again.
I am wondering if there are any available (pay or free) application that could be use as scripts to make this happen.
Basically if I have
/usr/share/appname/server/dynamicdir
/usr/share/appname/server/otherdir/etc/resource.file
Then taking the examples above and with the script put them on the mysql database.
Please let me know if you need more information.

Do you mean that the application is storing a files as blobs in the MySQL database, and/or creating lots of temporary tables? Or that you just want temporary files - themselves unrelated to a database - to be stored in MySQL as a backup?
I'm not sure that trying to use MySQL as an net-new intermediary for backups of files is a good idea. If the app already uses it, thats one thing, if not, MySQL isn't the right tool here.
Anyway. If you are interested in capturing a filesystem at point-in-time, the answer is to utilize LVM snapshots. You would likely have to rebuild your server to get your filesystems onto LVM, and have enough free storage there for as many snapshots as you think you'd need.
I would recommend having a new mount point just for this apps temporary files. If your MySQL tables are using InnoDB, a simple script to run mysqldump --single-transaction in the background, and then the lvm snapshot process, you could get these synced up to less then a second.

the should be trivial to accomplish using PHP, perl, python, etc. are you looking for someone to write this for you?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008