I have a large mysql table that I would like to transfer to a Hadoop/Hive table. Are there standard commands or techniques to transfer a simple (but large) table from Mysql to Hive? The table stores mostly analytics data.
First of all download mysql-connector-java-5.0.8 and put the jar to lib and bin folder of Sqoop
Create the table definition in Hive with exact field names and types as in mysql
sqoop import --verbose --fields-terminated-by ',' --connect jdbc:mysql://localhost/test --table employee --hive-import --warehouse-dir /user/hive/warehouse --fields-terminated-by ',' --split-by id --hive-table employee
test - Database name
employee - Table name (present in test)
/user/hive/warehouse - Directory in HDFS where the data has to be imported
--split-by id - id can be the primary key of the table 'employee'
--hive-table employee - employee table whose definition is present in Hive
Sqoop User Guide (One of the best guide for learning Sqoop)
Apache Sqoop is a tool that solves this problem:
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Related
I am new to Apache Hudi,Please let me know if there any configuration is provided in apache hudi for writing data on mysql database.
If you want to write data from an Hudi table to MySQL, when you read your Hudi table as a dataframe, you can use spark jdbc driver to write the data to MySQL:
# read hudi table
df = spark.read.format("hudi").load("/path/to/table")
# write dataframe to MySQL
df.write.format('jdbc').options(
url='jdbc:mysql://<MySQL server url>/<database name>',
driver='com.mysql.jdbc.Driver',
dbtable='<table name>',
user='<user name>',
password='<password>'
).mode('append').save()
But if you want to use MySQL as a file system for your Hudi tables, the answer is no. Hudi manages the storage layer of the datasets on HCFS (Hadoop Compatible File System), and where MySQL is not a HCFS, Hudi cannot use it.
You can tryout options like Sqoop which is an interface for ETL of data from Hadoop (HDFS, Hive) to SQL tables and viceversa.
If we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS with commands like below:
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names
Feel free to check out my question post on Sqoop here
I'm already having a MySQL table in my local machine (Linux) it self, and I have a Hive external table with the same schema as the MySQL table.
I'm trying to import data from MySQL table to my Hive external table and I'm using Sqoop for this.
But then the problem is, whenever a new record is being added to the MySQL table, it doesn't update the Hive external table automatically?
This is the Sqoop import command I'm using:
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name,age,salary --table customer --target-dir /user/chamith/mysqlhivetest/ --fields-terminated-by "," --hive-import --hive-table test.customers
Am I missing something over here? Or how can this be done?
Any help could be appreciated.
In your case a new row appended to the table.
So you need to use incremental append approach.
When to use append mode?
Works for numerical data that is incrementing over time, such as
auto-increment keys
When importing a table where new rows are continually being added
with increasing row id values
Now what you need to add in command
-check-column Specifies the column to be examined when determining which rows to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value Specifies the maximum value of the check column from the previous import
Ideal to perform this is using sqoop job as in this case sqoop metastore remembers the last value automatically
Step 1 :Intially load data with normal import command.
Step 2:
sqoop job --create incrementalImportJob -- import \
--connect jdbc:mysql://localhost:3306/sqoop
--username root
-P
--split-by id
--columns id,name,age,salary
--table customer
--incremental append \
--check-column id \
--last-value 5
--fields-terminated-by ","
--target-dir hdfs://ip:8020/path/to/table/;
Hope this helps..
I imported MySQL database tables to Hive using sqoop tool by using below script.
sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username=retail_dba --password=cloudera --hive-import --hive-overwrite --create-hive-table --warehouse-dir=/user/hive/warehouse/
but when I check the database in hive, there is no retail.db.
If you want to import all tables in a specific hive database (already created). Use:
--hive-database retail
in your sqoop command.
as dev said if you want to sqoop everything in a particular db then use
--hive-database retail_db else every tables will be sqooped under default warehouse dir/tablename
Your command sqoops everything into this directory: /user/hive/warehouse/retail.db/
To import into hive use this argument: --hive-import and why are you using --as-textfile?
If you want to store as textfile then use --as-textfile and then use hive external table command to create external tables in Hive.
I was able to use sqoop to import a mysql table "titles" to hdfs using command like this:
sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table=titles --target-dir=titles --m=1
Now I want to import to hive, if I use the following command:
sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table titles --hive-import
I will be prompted that:
Output directory hdfs://localhost:9000/user/root/titles already exists
In hive, if I do a show tables I get the following:
hive> show tables;
OK
dept_emp
emp
myfirsthivetable
parted1emp
partitionedemp
You can see there is no table called titles in hive
I am confused at this, for the sqoop imported data, is there any 1 to 1 relationship between hdfs and hive? What's the meaning of the prompt?
Thank you for your enlighening.
As Amit has pointed out, since you already created the HDFS directory in your first command, Sqoop refuses to overwrite the folder titles since it already contains data.
In your second command, you are telling Sqoop to import (once again) the whole table (which was already imported in the first command) into Hive. Since you are not specifying the --target-dir with the HDFS destination, Sqoop will try to create the folder titles under /user/root/. SInce this folder already exists, an error was raised.
When you tell Hive to show the tables, titles doesn't appear because the second command (with hive-import) was not successful, and Hive doesn't know anything about the data. When you add the flag --hive-import, what Sqoop does under the hood is update the Hive metastore which is a database that has the metadata of the Hive tables, partitions and HDFS location.
You could do the data import using just one Sqoop command instead of using two different ones. If you delete the titles HDFS folder and you perform something like this:
sqoop import --connect jdbc:mysql://localhost/employees --username=root
-P --table=titles --target-dir /user/root/titles --hive-import --m=1
This way, you are pulling the data from Mysql, creating the /user/root/titles HDFS directory and updating the metastore, so that Hive knows where the table (and the data) is.
But what if you wouldn't want to delete the folder with the data that you already imported? In that case, you could create a new Hive table titles and specify the location of the data using something like this:
CREATE [TEMPORARY] [EXTERNAL] TABLE title
[(col_name data_type [COMMENT col_comment], ...)]
(...)
LOCATION '/user/root/titles'
This way, you wouldn't need to re-import the whole data again, since it's already in HDFS.
When you create a table on hive it eventually creates a directory on HDFS, as you already ran the hadoop import first hence a directory named "titles" already been created on HDFS.
Either can you delete the /user/root/titles directory from HDFS and ran the hive import command again or use --hive-table option while import.
You can refer to the sqoop documentation.
Hope this helps.
I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date.
The versions :
Apache Hadoop - 1.0.4
Apache Hive - 0.9.0
Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0
As per my knowledge, there are 3 approaches:
MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table
MySQL -> Partitioned Hive table
MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION
is the current painful one that I’m following
I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example
The syntax dictates to specify partitions as key value pairs – not feasible in case of millions of records where one cannot think of all the partition key-value pairs
3.
Can anyone provide inputs for approaches 2 and 3?
I guess you can create a hive partitioned table.
Then write the sqoop import code for it.
for example:
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table \
--connect jdbc< mysql path>/DATABASE=xxxx\
--table --username xxxx --password xxxx --num-mappers 1 --hive-partition-key --hive-partition-value --hive-import \
--fields-terminated-by ',' --lines-terminated-by '\n'
You have to create a partitioned table structure first, before you move your data to table into partitioned table. While sqoop, no need to specify --hive-partition-key and --hive-partition-value, use --hcatalog-table instead of --hive-table.
Manu
If this is still something people wanted to understand, they can use
sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date
Notes from the patch:
sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"
Some limitations:
It only allows for one partition key/value
hardcoded the type for the partition key to be a string
With auto partitioning in hive 0.7 we may want to adjust this to just have one command line option for the key name and use that column from the db table to partition.