How to export data from Hadoop to MySQL /any DB? - mysql

Most of tutorials I've researched point me that I have to use Sqoop for export/import and a lot of the manuals show how I can export data from DB to HDFS, but how I can do backwards case?
Let's say, I have company DB on localhost, it has an empty users table with columns: id, user and I have hadoop that provides me with data like (id, user) but saves this to some hadoop-output.txt not to MySQL.
Are there some commands for command line to make import from HDFS to MySQL via Sqoop?

sqoop-export does this.
sqoop-export --connect jdbc:mysql://localhost/company
--username user --password passwd
--table users
--export-dir /path/to/HDFS_Source
--input-fields-terminated-by ','
Refer SqoopUserGuide.html#sqoop_export

Related

Can I use mysql database as destination storage for apache hudi

I am new to Apache Hudi,Please let me know if there any configuration is provided in apache hudi for writing data on mysql database.
If you want to write data from an Hudi table to MySQL, when you read your Hudi table as a dataframe, you can use spark jdbc driver to write the data to MySQL:
# read hudi table
df = spark.read.format("hudi").load("/path/to/table")
# write dataframe to MySQL
df.write.format('jdbc').options(
url='jdbc:mysql://<MySQL server url>/<database name>',
driver='com.mysql.jdbc.Driver',
dbtable='<table name>',
user='<user name>',
password='<password>'
).mode('append').save()
But if you want to use MySQL as a file system for your Hudi tables, the answer is no. Hudi manages the storage layer of the datasets on HCFS (Hadoop Compatible File System), and where MySQL is not a HCFS, Hudi cannot use it.
You can tryout options like Sqoop which is an interface for ETL of data from Hadoop (HDFS, Hive) to SQL tables and viceversa.
If we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS with commands like below:
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names
Feel free to check out my question post on Sqoop here

Export Data from HiveQL to MySQL using Oozie with Sqoop

I have a table (regularly updated) in Hive that I want to have in one of my tool that has a MySQL database. I can't just connect my application to the Hive database, so I want to export those data directly in the MySQL database.
I've searched a bit and found out that it was possible with Sqoop, and I've been told to use Oozie since I want to regularly update the table and export it.
I've looked around for a while and tried some stuff but so far I can't succeed, and I just don't understand what I'm doing.
So far, the only code I understand but doesn't work looks like that :
export --connect jdbc:mysql://myserver
--username username
--password password
--table theMySqlTable
--hive-table cluster.hiveTable
I've seen people using temporary table and export it on a txt file to then export it, but I'm not sure I can do it.
Should Oozie have specific parameters too ? I'm not the administrator so I'm not sure if I'm able to do it...
Thank you !
Try this.
sqoop export \
--connect "jdbc:sqlserver://servername:1433;databaseName=EMP;" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username userid \
-P \
--table theMySqlTable\
--input-fields-terminated-by '|' \
--export-dir /hdfs path location of file/part-m-00000 \
--num-mappers 1 \

Force sqoop to re-create hive schemas when importing

We are importing databases from MySQL to Hive using Sqoop (1.4.6). Everything works ok, except when table schemas get updated (mainly columns being added) in the source databases. The modifications do not end up in Hive. It seems that the Hive schema is created only once, and not verified in each import. The rows are loaded fine, but of course missing the new columns. We can work around this by first dropping the databases to force a schema re-creation in Hive, but my question is, that is there a way to do this from Sqoop directly?
Our import script resembles:
sqoop import-all-tables
--compress
--compression-codec=snappy
--connect "jdbc:mysql://HOST:PORT/DB"
--username USER
--password PASS
--hive-import
--hive-overwrite
--hive-database DB
--as-textfile
you can use hcatalog table instead of hive, it will work.

Sqoop import all tables not syncing with Hive database

I imported MySQL database tables to Hive using sqoop tool by using below script.
sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username=retail_dba --password=cloudera --hive-import --hive-overwrite --create-hive-table --warehouse-dir=/user/hive/warehouse/
but when I check the database in hive, there is no retail.db.
If you want to import all tables in a specific hive database (already created). Use:
--hive-database retail
in your sqoop command.
as dev said if you want to sqoop everything in a particular db then use
--hive-database retail_db else every tables will be sqooped under default warehouse dir/tablename
Your command sqoops everything into this directory: /user/hive/warehouse/retail.db/
To import into hive use this argument: --hive-import and why are you using --as-textfile?
If you want to store as textfile then use --as-textfile and then use hive external table command to create external tables in Hive.

what's the difference of sqoop import to hdfs and to hive?

I was able to use sqoop to import a mysql table "titles" to hdfs using command like this:
sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table=titles --target-dir=titles --m=1
Now I want to import to hive, if I use the following command:
sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table titles --hive-import
I will be prompted that:
Output directory hdfs://localhost:9000/user/root/titles already exists
In hive, if I do a show tables I get the following:
hive> show tables;
OK
dept_emp
emp
myfirsthivetable
parted1emp
partitionedemp
You can see there is no table called titles in hive
I am confused at this, for the sqoop imported data, is there any 1 to 1 relationship between hdfs and hive? What's the meaning of the prompt?
Thank you for your enlighening.
As Amit has pointed out, since you already created the HDFS directory in your first command, Sqoop refuses to overwrite the folder titles since it already contains data.
In your second command, you are telling Sqoop to import (once again) the whole table (which was already imported in the first command) into Hive. Since you are not specifying the --target-dir with the HDFS destination, Sqoop will try to create the folder titles under /user/root/. SInce this folder already exists, an error was raised.
When you tell Hive to show the tables, titles doesn't appear because the second command (with hive-import) was not successful, and Hive doesn't know anything about the data. When you add the flag --hive-import, what Sqoop does under the hood is update the Hive metastore which is a database that has the metadata of the Hive tables, partitions and HDFS location.
You could do the data import using just one Sqoop command instead of using two different ones. If you delete the titles HDFS folder and you perform something like this:
sqoop import --connect jdbc:mysql://localhost/employees --username=root
-P --table=titles --target-dir /user/root/titles --hive-import --m=1
This way, you are pulling the data from Mysql, creating the /user/root/titles HDFS directory and updating the metastore, so that Hive knows where the table (and the data) is.
But what if you wouldn't want to delete the folder with the data that you already imported? In that case, you could create a new Hive table titles and specify the location of the data using something like this:
CREATE [TEMPORARY] [EXTERNAL] TABLE title
[(col_name data_type [COMMENT col_comment], ...)]
(...)
LOCATION '/user/root/titles'
This way, you wouldn't need to re-import the whole data again, since it's already in HDFS.
When you create a table on hive it eventually creates a directory on HDFS, as you already ran the hadoop import first hence a directory named "titles" already been created on HDFS.
Either can you delete the /user/root/titles directory from HDFS and ran the hive import command again or use --hive-table option while import.
You can refer to the sqoop documentation.
Hope this helps.