Import from MySQL to Hive using Sqoop - mysql

I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date.
The versions :
Apache Hadoop - 1.0.4
Apache Hive - 0.9.0
Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0
As per my knowledge, there are 3 approaches:
MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table
MySQL -> Partitioned Hive table
MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION
is the current painful one that I’m following
I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example
The syntax dictates to specify partitions as key value pairs – not feasible in case of millions of records where one cannot think of all the partition key-value pairs
3.
Can anyone provide inputs for approaches 2 and 3?

I guess you can create a hive partitioned table.
Then write the sqoop import code for it.
for example:
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table \
--connect jdbc< mysql path>/DATABASE=xxxx\
--table --username xxxx --password xxxx --num-mappers 1 --hive-partition-key --hive-partition-value --hive-import \
--fields-terminated-by ',' --lines-terminated-by '\n'

You have to create a partitioned table structure first, before you move your data to table into partitioned table. While sqoop, no need to specify --hive-partition-key and --hive-partition-value, use --hcatalog-table instead of --hive-table.
Manu

If this is still something people wanted to understand, they can use
sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date
Notes from the patch:
sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"
Some limitations:
It only allows for one partition key/value
hardcoded the type for the partition key to be a string
With auto partitioning in hive 0.7 we may want to adjust this to just have one command line option for the key name and use that column from the db table to partition.

Related

Updating hive table with sqoop from mysql table

I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:`
sqoop import --connect jdbc:mysql://nn01.itversity.com/retail_export --username retail_dba --password itversity \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles
Unfortunately, that only insert the new data but I can't update the record that already exits. before you ask a couple of statements:
the table doesn't have a PK
if i dont specify --last-value as a parameter I will get duplicated records for those who already exist.
How could I figure it out without applying a truncate table or recreate the table using a PK? exist the way?
thanks guys.
Hive does not operate update queries.
You have to drop/truncate old table and reload again.

How to automatically sync a MySQL table with a Hive external table using Sqoop?

I'm already having a MySQL table in my local machine (Linux) it self, and I have a Hive external table with the same schema as the MySQL table.
I'm trying to import data from MySQL table to my Hive external table and I'm using Sqoop for this.
But then the problem is, whenever a new record is being added to the MySQL table, it doesn't update the Hive external table automatically?
This is the Sqoop import command I'm using:
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name,age,salary --table customer --target-dir /user/chamith/mysqlhivetest/ --fields-terminated-by "," --hive-import --hive-table test.customers
Am I missing something over here? Or how can this be done?
Any help could be appreciated.
In your case a new row appended to the table.
So you need to use incremental append approach.
When to use append mode?
Works for numerical data that is incrementing over time, such as
auto-increment keys
When importing a table where new rows are continually being added
with increasing row id values
Now what you need to add in command
-check-column Specifies the column to be examined when determining which rows to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value Specifies the maximum value of the check column from the previous import
Ideal to perform this is using sqoop job as in this case sqoop metastore remembers the last value automatically
Step 1 :Intially load data with normal import command.
Step 2:
sqoop job --create incrementalImportJob -- import \
--connect jdbc:mysql://localhost:3306/sqoop
--username root
-P
--split-by id
--columns id,name,age,salary
--table customer
--incremental append \
--check-column id \
--last-value 5
--fields-terminated-by ","
--target-dir hdfs://ip:8020/path/to/table/;
Hope this helps..

Optimizing Sqoop data import from MySQL to Hive using import-all-tables

I am using Sqoop 1.4.6 to import data from MySQL to Hive using the import-all-tables option. The result is ok, but the import process itself is quite slow. For example one of the databases contains 40-50 tables with well under 1 million rows in total, and takes around 25-30 minutes to complete. Upon investigating, it seems most of the time is spent initialising Hive for each imported table. Testing a plain mysqldump on the same database completes in under 1 minute. So the question is how to reduce this initialisation time, if that is the case, for example using a single Hive session?
The import command is:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true --compress --compression-codec=snappy --num-mappers 1 --connect "jdbc:mysql://..." --username ... --password ... --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --hive-import --hive-overwrite --hive-database ... --as-textfile --exclude-tables ... --warehouse-dir=...
Update:
Sqoop version: 1.4.6.2.5.3.0-37
Hive version: 1.2.1000.2.5.3.0-37
Could be related to:
https://issues.apache.org/jira/browse/HIVE-10319
remove option --num-mappers 1 to run import with default 4 mappers OR change it to some higher number --num-mappers 8 (if hardware allows) - this is to run import with more parellel jobs for tables having primary key, AND use --autoreset-to-one-mapper option - it will use 1 mapper for table not having primary key. Also use --direct mode:
sqoop import-all-tables \
--connect "jdbc:mysql://..." --username ... \
--password ... \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--compress --compression-codec=snappy \
--num-mappers 8 \
--autoreset-to-one \
--direct \
--null-string '\\N'
...
let us know if this improve the performance...
Update:
--fetch-size=<n> - Where represents the number of entries that
Sqoop must fetch at a time. Default is 1000.
Increase the value of the fetch-size argument based on the volume of
data that need to read. Set the value based on the available memory
and bandwidth.
increasing mapper memory from current value to some higher number:
example: sqoop import-all-tables -D mapreduce.map.memory.mb=2048 -D mapreduce.map.java.opts=-Xmx1024m <sqoop options>
Sqoop Performance Tuning Best Practices
Tune the following Sqoop arguments in JDBC connection or Sqoop mapping to optimize performance
batch (for export)
split-by and boundary-query (not needed since we
are suing --autoreset-to-one-mapper, can't be use with import-all-tables)
direct
fetch-size
num-mapper

Sqoop import all tables not syncing with Hive database

I imported MySQL database tables to Hive using sqoop tool by using below script.
sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username=retail_dba --password=cloudera --hive-import --hive-overwrite --create-hive-table --warehouse-dir=/user/hive/warehouse/
but when I check the database in hive, there is no retail.db.
If you want to import all tables in a specific hive database (already created). Use:
--hive-database retail
in your sqoop command.
as dev said if you want to sqoop everything in a particular db then use
--hive-database retail_db else every tables will be sqooped under default warehouse dir/tablename
Your command sqoops everything into this directory: /user/hive/warehouse/retail.db/
To import into hive use this argument: --hive-import and why are you using --as-textfile?
If you want to store as textfile then use --as-textfile and then use hive external table command to create external tables in Hive.

How to transfer mysql table to hive?

I have a large mysql table that I would like to transfer to a Hadoop/Hive table. Are there standard commands or techniques to transfer a simple (but large) table from Mysql to Hive? The table stores mostly analytics data.
First of all download mysql-connector-java-5.0.8 and put the jar to lib and bin folder of Sqoop
Create the table definition in Hive with exact field names and types as in mysql
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table employee --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table employee
test - Database name
employee - Table name (present in test)
/user/hive/warehouse - Directory in HDFS where the data has to be imported
--split-by id - id can be the primary key of the table 'employee'
--hive-table employee - employee table whose definition is present in Hive
Sqoop User Guide (One of the best guide for learning Sqoop)
Apache Sqoop is a tool that solves this problem:
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.