Import csv file to Hadoop Hive table without schema - mysql

I want to import csv file without providing its header or datatype information in Hive metastore.
Someone suggested load data into a Hive directly from the db using Sqoop without defining table in Hive by:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import
However, each row of the table may have different number of columns, and also I would like to search the whole table afterwards. Any suggestions?

Related

Sqoop - Manipulate a Mysql table before importing to HDFS

Can we edit the table by selecting specific columns or other conditions in MYSQL to save as a new table in the MYSQL database before import to HDFS?
Yes, we could save a new table to MySQL before exporting it to HDFS. Also we could edit the file with vi editor and export it to HDFS. But it would be much easier to use sqoop.
You can use sqoop evel before sqoop import for some purpose.

How to import encrypted column data from mysql to hdfs using sqoop?

Imgine i have a table containing student ID which is encrypted on mysql, how can I import that data using sqoop to hdfs?

Import large amount of MySQL data to Hadoop

I am planning to use a stack which uses Hadoop, Hive and Impala for analysing big data. I have the setup ready and now I am trying to import data from a MySQL table. The table size is more than 500 GB and I am planning to use Sqoop as follows :
sqoop import --connect jdbc:mysql://remote_host_ip/database_name --username user_name -P --table table_name --hive-import --compression-codec=snappy --as-parquetfile --warehouse-dir=/user/hive/warehouse -m 1
Is there any other better method for doing this import as this involves transferring 500 GB of data over the network. Is it possible to compress the data anyway and import it to Hive, so Impala can be used to query it ?
Sqoop is the best approach. Its very efficient in bulk loading.
Do read about the MySQL Hadoop Applier which is designed to perform real-time replication of events between MySQL and Hadoop.
You can set "-m 4" instead of "-m 1". This would allow MySql data to be imported in parallel fashion i.e. instead of using 1 mapper transferring 500GB, 4 mappers will be used to transfer the data in parallel(125 GB using each mapper).
SQOOP will be better to import 500 GB of data into columnar HDFS format which is Parquet file format. But you can use '-m 12' which makes more parallel mappers to import.

How does hadoop handle changes to rows ingested from an RDBMS

I have a scenario where data is ingested into hadoop from a MYSQL database everyday into a dated folder. Few rows will be edited everyday and there might also be some schema changes. How do we handle this in hadoop if I am only interested in the latest data and schema
Here is the documentation for incremental imports in Sqoop. Also, Sqoop can takes the table name while importing the data, so if the schema changes the Sqoop command should be the same.
bin/sqoop import --connect jdbc:mysql://localhost/bigdata --table widgets -m 1

Exporting HBase table to mysql

I am using hbase-0.90.6. I want to export the data from HBase to mysql. I know two-step process , first by running a mapreduce job to pull Hbase data into flat files, then exports flat file data into mysql.
Is their any other tool which I can use to reduce this two-step to one. Or can we use sqoop to do the same in one step. Thanks.
I'm afraid that Sqoop do not support exports directly from HBase at the moment. Sqoop can help you in the two-step process with the second step - e.g. Sqoop can take data from HDFS and export them to MySQL.
Yes Sqoop is the tool that can be used for both importing as well as exporting ur data from/to mysql and HBase
You can know more about Sqoop #
http://sqoop.apache.org