We are importing databases from MySQL to Hive using Sqoop (1.4.6). Everything works ok, except when table schemas get updated (mainly columns being added) in the source databases. The modifications do not end up in Hive. It seems that the Hive schema is created only once, and not verified in each import. The rows are loaded fine, but of course missing the new columns. We can work around this by first dropping the databases to force a schema re-creation in Hive, but my question is, that is there a way to do this from Sqoop directly?
Our import script resembles:
sqoop import-all-tables
--compress
--compression-codec=snappy
--connect "jdbc:mysql://HOST:PORT/DB"
--username USER
--password PASS
--hive-import
--hive-overwrite
--hive-database DB
--as-textfile
you can use hcatalog table instead of hive, it will work.
Related
I am new to Apache Hudi,Please let me know if there any configuration is provided in apache hudi for writing data on mysql database.
If you want to write data from an Hudi table to MySQL, when you read your Hudi table as a dataframe, you can use spark jdbc driver to write the data to MySQL:
# read hudi table
df = spark.read.format("hudi").load("/path/to/table")
# write dataframe to MySQL
df.write.format('jdbc').options(
url='jdbc:mysql://<MySQL server url>/<database name>',
driver='com.mysql.jdbc.Driver',
dbtable='<table name>',
user='<user name>',
password='<password>'
).mode('append').save()
But if you want to use MySQL as a file system for your Hudi tables, the answer is no. Hudi manages the storage layer of the datasets on HCFS (Hadoop Compatible File System), and where MySQL is not a HCFS, Hudi cannot use it.
You can tryout options like Sqoop which is an interface for ETL of data from Hadoop (HDFS, Hive) to SQL tables and viceversa.
If we have a CSV/Textfile Table in Hive, Sqoop can easily export the table from HDFS to MySQL table RDS with commands like below:
export --table <mysql_table_name> --export-dir hdfs:///user/hive/warehouse/csvdata --connect jdbc:mysql://<host>:3306/<db_name> --username <username> --password-file hdfs:///user/test/mysql.password --batch -m 1 --input-null-string "\\N" --input-null-non-string "\\N" --columns <column names to be exported, without whitespace in between the column names
Feel free to check out my question post on Sqoop here
I am new to cassandra. Here I am tring to transfer whole my MYSQL database to cassandra using sqoop. But after all setup, when i execute following command.
bin/dse sqoop import-all-tables -m 1 --connect jdbc:mysql://127.0.0.1:3306/ABCDatabase --username root --password root --cassandra-thrift-host localhost --cassandra-create-schema --direct
I have received following error.
Sqoop functionality has been removed from DSE.
It said that sqoop functionality is removed from datastax. can you please if it removed then is there any other way to do that?
Thanks
You can use Spark to transfer data - it should be easy, something like:
val table = spark.read.jdbc(jdbcUrl, "table", connectionProperties)
table.write.format("org.apache.spark.sql.cassandra").options(
Map("table" -> "TBL", "keyspace" -> "KS")).save()
Examples of jdbc URLs, options, etc. are described in Databrick's documentation as they could be different for different databases.
Most of tutorials I've researched point me that I have to use Sqoop for export/import and a lot of the manuals show how I can export data from DB to HDFS, but how I can do backwards case?
Let's say, I have company DB on localhost, it has an empty users table with columns: id, user and I have hadoop that provides me with data like (id, user) but saves this to some hadoop-output.txt not to MySQL.
Are there some commands for command line to make import from HDFS to MySQL via Sqoop?
sqoop-export does this.
sqoop-export --connect jdbc:mysql://localhost/company
--username user --password passwd
--table users
--export-dir /path/to/HDFS_Source
--input-fields-terminated-by ','
Refer SqoopUserGuide.html#sqoop_export
I imported MySQL database tables to Hive using sqoop tool by using below script.
sqoop import-all-tables --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username=retail_dba --password=cloudera --hive-import --hive-overwrite --create-hive-table --warehouse-dir=/user/hive/warehouse/
but when I check the database in hive, there is no retail.db.
If you want to import all tables in a specific hive database (already created). Use:
--hive-database retail
in your sqoop command.
as dev said if you want to sqoop everything in a particular db then use
--hive-database retail_db else every tables will be sqooped under default warehouse dir/tablename
Your command sqoops everything into this directory: /user/hive/warehouse/retail.db/
To import into hive use this argument: --hive-import and why are you using --as-textfile?
If you want to store as textfile then use --as-textfile and then use hive external table command to create external tables in Hive.
I was able to use sqoop to import a mysql table "titles" to hdfs using command like this:
sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table=titles --target-dir=titles --m=1
Now I want to import to hive, if I use the following command:
sqoop import --connect jdbc:mysql://localhost/employees --username=root -P --table titles --hive-import
I will be prompted that:
Output directory hdfs://localhost:9000/user/root/titles already exists
In hive, if I do a show tables I get the following:
hive> show tables;
OK
dept_emp
emp
myfirsthivetable
parted1emp
partitionedemp
You can see there is no table called titles in hive
I am confused at this, for the sqoop imported data, is there any 1 to 1 relationship between hdfs and hive? What's the meaning of the prompt?
Thank you for your enlighening.
As Amit has pointed out, since you already created the HDFS directory in your first command, Sqoop refuses to overwrite the folder titles since it already contains data.
In your second command, you are telling Sqoop to import (once again) the whole table (which was already imported in the first command) into Hive. Since you are not specifying the --target-dir with the HDFS destination, Sqoop will try to create the folder titles under /user/root/. SInce this folder already exists, an error was raised.
When you tell Hive to show the tables, titles doesn't appear because the second command (with hive-import) was not successful, and Hive doesn't know anything about the data. When you add the flag --hive-import, what Sqoop does under the hood is update the Hive metastore which is a database that has the metadata of the Hive tables, partitions and HDFS location.
You could do the data import using just one Sqoop command instead of using two different ones. If you delete the titles HDFS folder and you perform something like this:
sqoop import --connect jdbc:mysql://localhost/employees --username=root
-P --table=titles --target-dir /user/root/titles --hive-import --m=1
This way, you are pulling the data from Mysql, creating the /user/root/titles HDFS directory and updating the metastore, so that Hive knows where the table (and the data) is.
But what if you wouldn't want to delete the folder with the data that you already imported? In that case, you could create a new Hive table titles and specify the location of the data using something like this:
CREATE [TEMPORARY] [EXTERNAL] TABLE title
[(col_name data_type [COMMENT col_comment], ...)]
(...)
LOCATION '/user/root/titles'
This way, you wouldn't need to re-import the whole data again, since it's already in HDFS.
When you create a table on hive it eventually creates a directory on HDFS, as you already ran the hadoop import first hence a directory named "titles" already been created on HDFS.
Either can you delete the /user/root/titles directory from HDFS and ran the hive import command again or use --hive-table option while import.
You can refer to the sqoop documentation.
Hope this helps.