Sqoop copy data from MySQL table to a partitioned Hive table - mysql

I have written a Sqoop script:
HADOOP_USER_NAME=hdfs sqoop import --connect jdbc:mysql://cmsmaster.cy9mnipcdof2.us-east-1.rds.amazonaws.com/db --username user -password-file /user/password/dbpass.txt --fields-terminated-by ',' --target-dir /user/db/sqoop_internal --delete-target-dir --hive-import --hive-overwrite --hive-table sqoop_internal --query '
SOME_QUERY where $CONDITIONS' --split-by id
This copies the result of the query and moves it to a Hive table, overwriting its previous content.
Now what I need is to modify this script so that it doesn't overwrite the whole Hive table. Instead, it should overwrite a partition of that Hive table. How to do that?

From your question i understand that you might need to do a sqoop merge.
You need to remove :
--delete-target-dir and --hive-overwrite
And add :
--incremental lastmodified --check-column modified --last-value '2018-03-08 00:00:00' --merge-key yourPrimaryKey
You can find more information from the official documentation.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_merge_literal

Related

Export data from HDFS to MySQL but data should be updated before going to MySQL

I need to import data from MySQL to HDFS, and I'm doing that with Apache Sqoop. But the thing is I also need to export data from HDFS to MySQL and I need to update one column of these data (that is in HDFS) before moving that data to MySQL, how can I do this?
You can update the column directly from hdfs and can store the hive output to HDFS using INSER OVERWRITE DIRECTORY "path" then go with the below sqoop command
sqoop export \
--connect jdbc:mysql://master/poc \
--username root \
--table employee \
--export-dir /user/hdfs/mysql/export.txt \
--update-key id \
--update-mode allowinsert \
--fields-terminated-by '\t' \
-m 1
Hope this helps..

How to use 'create-hive-table' with Sqoop correctly?

I'm trying to import data from MySQL table to Hive using Sqoop. From what I understood there are 2 ways of doing that.
Import data into HDFS and then create External Table in Hive and load data into that table.
Use create-hive-table while running Sqoop query to create a new table in Hive and directly load data into that. I am trying to do this but can't do it for some reason
This is my code
sqoop import \
--connect jdbc:mysql://localhost/EMPLOYEE \
--username root \
--password root \
--table emp \
--m 1 \
--hive-database sqoopimport \
--hive-table sqoopimport.employee \
--create-hive-table \
--fields-terminated-by ',';
I tried using --hive-import as well but got error.
When I ran the above query, job was successful but there was no table created in hive as well as data was stored in \user\HDFS\emp\ location where \HDFS\emp was created during the job.
PS: Also I could not find any reason for using --m 1 with Sqoop. It's just there in all queries.
I got the import working with following query. There is no need to write create-hive-table we can just write new table name with hive-table and that table will be created. Also if there is any issue then go to hive-metastore location and run rm *.lck then try import again.
sqoop import \
--connect jdbc:mysql://localhost/EMPLOYEE \
--username root \
--password root \
--table emp4 \
--hive-import \
--hive-table sqoopimport.emp4 \
--fields-terminated-by "," ;

Updating hive table with sqoop from mysql table

I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:`
sqoop import --connect jdbc:mysql://nn01.itversity.com/retail_export --username retail_dba --password itversity \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles
Unfortunately, that only insert the new data but I can't update the record that already exits. before you ask a couple of statements:
the table doesn't have a PK
if i dont specify --last-value as a parameter I will get duplicated records for those who already exist.
How could I figure it out without applying a truncate table or recreate the table using a PK? exist the way?
thanks guys.
Hive does not operate update queries.
You have to drop/truncate old table and reload again.

How to automatically sync a MySQL table with a Hive external table using Sqoop?

I'm already having a MySQL table in my local machine (Linux) it self, and I have a Hive external table with the same schema as the MySQL table.
I'm trying to import data from MySQL table to my Hive external table and I'm using Sqoop for this.
But then the problem is, whenever a new record is being added to the MySQL table, it doesn't update the Hive external table automatically?
This is the Sqoop import command I'm using:
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name,age,salary --table customer --target-dir /user/chamith/mysqlhivetest/ --fields-terminated-by "," --hive-import --hive-table test.customers
Am I missing something over here? Or how can this be done?
Any help could be appreciated.
In your case a new row appended to the table.
So you need to use incremental append approach.
When to use append mode?
Works for numerical data that is incrementing over time, such as
auto-increment keys
When importing a table where new rows are continually being added
with increasing row id values
Now what you need to add in command
-check-column Specifies the column to be examined when determining which rows to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value Specifies the maximum value of the check column from the previous import
Ideal to perform this is using sqoop job as in this case sqoop metastore remembers the last value automatically
Step 1 :Intially load data with normal import command.
Step 2:
sqoop job --create incrementalImportJob -- import \
--connect jdbc:mysql://localhost:3306/sqoop
--username root
-P
--split-by id
--columns id,name,age,salary
--table customer
--incremental append \
--check-column id \
--last-value 5
--fields-terminated-by ","
--target-dir hdfs://ip:8020/path/to/table/;
Hope this helps..

Sqoop Import replace special characters of mysql

I have 1000 tables with more than 100000 records in each table in mysql. The tables have 300-500 columns.
Some of tables have columns with special characters like .(dot) and space in the column names.
Now I want to do sqoop import and create a hive table in HDFS in a single shot query like below
sqoop import --connect ${domain}:${port}/$(database) --username ${username} --password ${password}\
--table $(table) -m 1 --hive-import --hive-database ${hivedatabase} --hive-table $(table) --create-hive-table\
--target-dir /user/hive/warehouse/${hivedatabase}.db/$(table)
After this the hive table is created but when I query the table it shows error as
This error output is a sample output.
Error while compiling statement: FAILED: RuntimeException java.lang.RuntimeException: cannot find field emp from [0:emp.id, 1:emp.name, 2:emp.salary, 3:emp.dno]
How can we replace the .(dot) with _(underscore) while doing sqoop import itself. I would like to do this dynamically.
Use sqoop import \ with --query option rather than --table and in query use replace function .
ie
sqoop import --connect ${domain}:${port}/$(database) --username ${username} --password ${password}\
-- query 'Select col1 ,replace(col2 ,'.','_') as col from table.
Or (not recommended) write a shell script which can do find and replace "." to "_" (Grep command)at /user/hive/warehouse/${hivedatabase}.db/$(table)