I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:`
sqoop import --connect jdbc:mysql://nn01.itversity.com/retail_export --username retail_dba --password itversity \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles
Unfortunately, that only insert the new data but I can't update the record that already exits. before you ask a couple of statements:
the table doesn't have a PK
if i dont specify --last-value as a parameter I will get duplicated records for those who already exist.
How could I figure it out without applying a truncate table or recreate the table using a PK? exist the way?
thanks guys.
Hive does not operate update queries.
You have to drop/truncate old table and reload again.
Related
I have the sqoop job created with incremental append with last value
Job:
sqoop job --create myjob2 -- import --connect jdbc:mysql://host/DBnam -username user -password passwor --table savingssmal --check-column id --incremental append --last-value 0 --target-dir /user/xxxx/prac/sqoop --split-by id --as-parquetfile -m 1
My question is: I want to import the newly created record and also updated record into a mysql table?
Can you please help me in this?
You can use the lastmodified mode for incremental Sqoop import.
The append mode (used in your example) is used to import rows based on increasing row id values. So, when the job runs, it will import rows where --check-column (i.e. id) is greater than --last-value (i.e. 0). If a row is updated, the id will generally remain the same and the updated row won't be imported.
The lastmodified mode is used to import rows based on a timestamp column (for example, last_modified_time). When the job runs, it will import rows where --check-column is more recent than specified via --last-value. The application writing to the table should update the last_modified_time column on inserts and updates. This way, both newly inserted and updated rows will be imported when the Sqoop job runs.
A sample invocation based on your example with the lastmodified mode would be as below:
sqoop job --create myjob2 -- import --connect jdbc:mysql://host/DBnam -username user -password passwor --table savingssmal --check-column last_update_time --incremental lastmodified --last-value "2018-02-03 04:38:39.0" --target-dir /user/xxxx/prac/sqoop --as-parquetfile -m 1
I have written a Sqoop script:
HADOOP_USER_NAME=hdfs sqoop import --connect jdbc:mysql://cmsmaster.cy9mnipcdof2.us-east-1.rds.amazonaws.com/db --username user -password-file /user/password/dbpass.txt --fields-terminated-by ',' --target-dir /user/db/sqoop_internal --delete-target-dir --hive-import --hive-overwrite --hive-table sqoop_internal --query '
SOME_QUERY where $CONDITIONS' --split-by id
This copies the result of the query and moves it to a Hive table, overwriting its previous content.
Now what I need is to modify this script so that it doesn't overwrite the whole Hive table. Instead, it should overwrite a partition of that Hive table. How to do that?
From your question i understand that you might need to do a sqoop merge.
You need to remove :
--delete-target-dir and --hive-overwrite
And add :
--incremental lastmodified --check-column modified --last-value '2018-03-08 00:00:00' --merge-key yourPrimaryKey
You can find more information from the official documentation.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_merge_literal
I'm already having a MySQL table in my local machine (Linux) it self, and I have a Hive external table with the same schema as the MySQL table.
I'm trying to import data from MySQL table to my Hive external table and I'm using Sqoop for this.
But then the problem is, whenever a new record is being added to the MySQL table, it doesn't update the Hive external table automatically?
This is the Sqoop import command I'm using:
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name,age,salary --table customer --target-dir /user/chamith/mysqlhivetest/ --fields-terminated-by "," --hive-import --hive-table test.customers
Am I missing something over here? Or how can this be done?
Any help could be appreciated.
In your case a new row appended to the table.
So you need to use incremental append approach.
When to use append mode?
Works for numerical data that is incrementing over time, such as
auto-increment keys
When importing a table where new rows are continually being added
with increasing row id values
Now what you need to add in command
-check-column Specifies the column to be examined when determining which rows to import.
--incremental Specifies how Sqoop determines which rows are new.
--last-value Specifies the maximum value of the check column from the previous import
Ideal to perform this is using sqoop job as in this case sqoop metastore remembers the last value automatically
Step 1 :Intially load data with normal import command.
Step 2:
sqoop job --create incrementalImportJob -- import \
--connect jdbc:mysql://localhost:3306/sqoop
--username root
-P
--split-by id
--columns id,name,age,salary
--table customer
--incremental append \
--check-column id \
--last-value 5
--fields-terminated-by ","
--target-dir hdfs://ip:8020/path/to/table/;
Hope this helps..
I have to import > 400 million rows from a MySQL table(having a composite primary key) into a PARTITIONED Hive table Hive via Sqoop. The table has data for two years with a column departure date ranging from 20120605 to 20140605 and thousands of records for one day. I need to partition the data based on the departure date.
The versions :
Apache Hadoop - 1.0.4
Apache Hive - 0.9.0
Apache Sqoop - sqoop-1.4.2.bin__hadoop-1.0.0
As per my knowledge, there are 3 approaches:
MySQL -> Non-partitioned Hive table -> INSERT from Non-partitioned Hive table into Partitioned Hive table
MySQL -> Partitioned Hive table
MySQL -> Non-partitioned Hive table -> ALTER Non-partitioned Hive table to add PARTITION
is the current painful one that I’m following
I read that the support for this is added in later(?) versions of Hive and Sqoop but was unable to find an example
The syntax dictates to specify partitions as key value pairs – not feasible in case of millions of records where one cannot think of all the partition key-value pairs
3.
Can anyone provide inputs for approaches 2 and 3?
I guess you can create a hive partitioned table.
Then write the sqoop import code for it.
for example:
sqoop import --hive-overwrite --hive-drop-import-delims --warehouse-dir "/warehouse" --hive-table \
--connect jdbc< mysql path>/DATABASE=xxxx\
--table --username xxxx --password xxxx --num-mappers 1 --hive-partition-key --hive-partition-value --hive-import \
--fields-terminated-by ',' --lines-terminated-by '\n'
You have to create a partitioned table structure first, before you move your data to table into partitioned table. While sqoop, no need to specify --hive-partition-key and --hive-partition-value, use --hcatalog-table instead of --hive-table.
Manu
If this is still something people wanted to understand, they can use
sqoop import --driver <driver name> --connect <connection url> --username <user name> -P --table employee --num-mappers <numeral> --warehouse-dir <hdfs dir> --hive-import --hive-table table_name --hive-partition-key departure_date --hive-partition-value $departure_date
Notes from the patch:
sqoop import [all other normal command line options] --hive-partition-key ds --hive-partition-value "value"
Some limitations:
It only allows for one partition key/value
hardcoded the type for the partition key to be a string
With auto partitioning in hive 0.7 we may want to adjust this to just have one command line option for the key name and use that column from the db table to partition.
I am trying to import data from Mysql to Hbase using sqoop.
I am running following command.
sqoop import --connect jdbc:mysql://localhost/database --table users --columns "loginid,email" --username tester -P -m 8 --hbase-table hbaseTable --hbase-row-key user_id --column-family user_info --hbase-create-table
But i am getting below error :-
13/05/08 10:42:10 WARN hbase.ToStringPutTransformer: Could not insert
row with null value for row-key column: user_id
please help here
Got the solution.
I was not including my rowKey i.e. user_id in the columns list.
After including it , it worked like a charm.
Thanks..
Your columns should be upper status, not seq_id but SEQ_ID.
I think sqoop considering it as a different column.which is null(Of course).