Moving Large and frequently updating MySQL table with composite key to HDFS - mysql

I have MySQL Inventory table which don't have Auto Increment Id but has composite key and last modified date(YYYY-mm-DD HH:MM:SS) and will update very frequently.It has last 3 years data around 10 million records.
I want to move this data to HDFS by using Sqoop or some other way. Please suggest some approach.

Check this sqoop code below (that I use in similar tasks) based on --lastmodified: I want to assume here that you may have a date-like column to use with the --check-column argument.
sqoop import \
--connect jdbc:mysql://<server>:3306/db \
--username=your_username \
-P \
--table=your_table \
--append \
--incremental lastmodified \
--check-column creation_date \
--last-value "YYYY-mm-DD HH:MM:SS.x" \
--split-by some_numeric_id_column \
--target-dir /user/dir \
--num-mappers <MAPPER#>

Related

Sqoop unable to import data from MYSQL to HBASE

Hi I am new in bigdata and I am trying to import data from mysql to hbase using sqoop.
sqoop import –connect jdbc:mysql://xxx.xxx.xx.xx:3306/FBI_DB –table FBI_CRIME –hbase-table H_FBI_CRIME –column-family cd –hbase-row-key ID –m 1 –username root -P
ERROR tool.ImportTool: Import failed: java.io.IOException: No columns
to generate for ClassWriter.
Once I had used ––driver com.mysql.jdbc.Driver but still didn’t get success.
Please help, What is wrong.
The problem is that you have to specify the names of the columns that you want to import, like this:
sqoop import --connect jdbc:mysql://xxx.xxx.xx.xx:3306/FBI_DB \
--table FBI_CRIME \
--hbase-table H_FBI_CRIME \
--columns "columnA,columnB" \
--column-family cd \
--hbase-row-key ID \
-m 1 \
--username root -P

Hive import using sqoop from Mysql taking too long

I'm using hive and sqoop on top of hadoop in Ubuntu 18.04.
Hadoop, sqoop and Hive are working as expected but whenever I'm trying to import a data into Hive database I created, the job is halting for too long.
Sqoop command used:
sqoop import \
--connect jdbc:mysql://localhost/project? \
--zeroDateTimeBehavior=CONVERT_TO_NULL \
--username hiveuser \
-P \
--table rooms \
-- hive-import \
--hive-database sqoop \
--hive-table room_info
you can expedite the process using multiple mappers. for that you need to find out the column which has evenly distributed data and use that column as --split-by <column_name> and increase your mappers using -m <count> option.
sqoop import \
--connect jdbc:mysql://localhost/project? \
--zeroDateTimeBehavior=CONVERT_TO_NULL \
--username hiveuser \
-P \
--table rooms \
-- hive-import \
--hive-database sqoop \
--hive-table room_info
--split-by <column_name>
-m 5
Please read the following page to understand more in details.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
particularly this topic: 7.2.4. Controlling Parallelism

Sqoop Import: Command --password: command not found, --table: command not found

I have to write a scoop import script which pulls certain columns from a data la_crime only from the year 2016
My script is below"
sqoop import \
--connect jdbc:mysql://XXXXXXXXXXX \
--driver com.mysql.jdbc.Driver \
--username XXXX \
--password XXXX \
--table la_crime \
--query "SELECT DR_NUMBER,DATE_REPORTED,DATE_OCCURED,TIME_OCCURED,AREA_ID,AREA_N AME,REPORTING_DISTRICT,CRIME_CODE,CRIME_CODE_DESC,VICTIM_AGE,VICTIM _GENDER,VICTIM_DESCENT,ADDRESS,CROSS_STREET,AUTO_ID FROM la_crime WHERE\$YEAR=2016\
--target-dir /user/sqoop_script \
-m 1
Could you tell me if my code is wrong somewhere? What changes do I have to make?
You may use the following syntax to import:
sqoop import \
--connect jdbc:mysql://localhost/yourdatabase \
--driver com.mysql.jdbc.Driver \
--username XXXX \
--password XXXX \
--table la_crime \
--query ‘SELECT DR_NUMBER,DATE_REPORTED,DATE_OCCURED,TIME_OCCURED,AREA_ID,AREA_N AME,REPORTING_DISTRICT,CRIME_CODE,CRIME_CODE_DESC,VICTIM_AGE,VICTIM _GENDER,VICTIM_DESCENT,ADDRESS,CROSS_STREET,AUTO_ID FROM la_crime WHERE $YEAR=2016' \
--target-dir /user/sqoop_script \
-m 1
For more details, refer “Sqoop User Guide”.
The command you're trying has some things wrong in the query option, first you need to close the double quotes at the end. Second, it seems weird to me that you're using a variable to specify the column to filter the year.
And third, if you use the query option it's mandatory to include the $CONDITIONS token and since you're using double quotes to issuing the query you need to \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable.
Also, if you're using the query option you shouldn't use the table option.
I think this would be the command you're looking for:
sqoop import \
--connect jdbc:mysql://XXXXXXXXXXX \
--driver com.mysql.jdbc.Driver \
--username XXXX \
--password XXXX \
--query "SELECT DR_NUMBER,DATE_REPORTED,DATE_OCCURED,TIME_OCCURED,AREA_ID,AREA_N AME,REPORTING_DISTRICT,CRIME_CODE,CRIME_CODE_DESC,VICTIM_AGE,VICTIM _GENDER,VICTIM_DESCENT,ADDRESS,CROSS_STREET,AUTO_ID FROM la_crime WHERE YEAR = 2016 AND $CONDITIONS" \
--target-dir /user/sqoop_script \
-m 1

Error during export: Mixed update/insert is not supported against the target database yet

sqoop export --connect "jdbc:mysql://localhost:3306/retail_db" \
--driver com.mysql.jdbc.Driver \
--username root \
--table departments \
--export-dir /user/root/departments_export \
--batch \
--outdir java_files \
-m 1 \
--update-key department_id \
--update-mode allowinsert
Upsert function is not working with MySQL database, regarding this issue got the reason (link) but need a solution to resolve this issue.
You can try to get rid of --driver. My code worked by this way.

TopoJson makefile ignoring external properties file

I am trying to make a topojson file with csv data embedded using a makefile. I am using Mike Bostock's us-atlas as a guide.
topo/us-counties-10m-ungrouped.json: shp/us/counties.shp
mkdir -p $(dir $#)
topojson \
-o us_counties.json \
--no-pre-quantization \
--post-quantization=1e6 \
--external-properties=output.csv \
--id-property=FIPS \
--properties="County=County" \
--properties="PerChildrenPos=+PerChildrenPos" \
--simplify=7e-7 \
-- $<
It creates the topojson I need but completely ignores the output.csv file.
Here is a glimpse at what it returns.
{"type":"Polygon","id":"53051","properties":{"code":"53051"},"arcs":[[-22,79,80,-75,81]]}
Here's what I need it to return.
{"type":"Polygon","id":"53051","properties":{"code":"53051", "County":"Los Angeles", "PerChildrenPos": 10},"arcs":[[-22,79,80,-75,81]]}
Any ideas why it might be ignoring the csv file, I've tested moving it around to see if perhaps it was unaccessible or something?
Thanks in advance.
According to the documentation here: https://github.com/mbostock/topojson/wiki/Command-Line-Reference#external-properties
(If your CSV file uses a different column name for the feature identifier, you can specify multiple id properties, such as --id-property=+FIPS,+id.)
It seems that you need to change your --id-property=FIPS \ to something that corresponds to your CSV column names.
Also for --properties="County=County" \
I think it should be --properties County=+County \
Same for --properties="PerChildrenPos=+PerChildrenPos" \
Should be --properties PerChildrenPos=+PerChildrenPos \
For --external-properties=output.csv \
it should be --external-properties output.csv \
Basically the parameters do not need to be prefixed by an = sign.