I have a hadoop environment with 1 master and 4 nodes where I am saving all the data of a mysql application with sqoop
I need to access this data saved in hadoop through the web of the application, in other words: If the user makes a record with a date before 6 months I want the application to select in hadoop data.
They are relational data, mysql. I do not have to do any great analysis.
Is this viable?
What's the best way to do it?
What tool do you use?
Is sending the data in the hdfs not feasible for this case?
Thank you in advance
As I understand your question, you are importing data from MySQL to HDFS using sqoop.
Now you want to perform some query over this data in HDFS.
You can do this using Hive. You can perform HQL (similar to SQL) on your data.
You can import your data directly from MySQL to Hive using sqoop. Now you have table in Hive similar to MySQL. You can perform any query over it.
Sample command:
sqoop import \
--connect 'jdbc:mysql://myhost:3306/classicmodels' \
--driver com.mysql.jdbc.Driver \
--username root \
--password root \
--table abc \
--target-dir /user/dev/db/sqoop/temp_81323/ \
--hive-import \
--hive-table hive_abc \
--null-string '\\N' \
--null-non-string '\\N' \
--verbose
Check sqoop documentation for more details.
Related
I have a table (regularly updated) in Hive that I want to have in one of my tool that has a MySQL database. I can't just connect my application to the Hive database, so I want to export those data directly in the MySQL database.
I've searched a bit and found out that it was possible with Sqoop, and I've been told to use Oozie since I want to regularly update the table and export it.
I've looked around for a while and tried some stuff but so far I can't succeed, and I just don't understand what I'm doing.
So far, the only code I understand but doesn't work looks like that :
export --connect jdbc:mysql://myserver
--username username
--password password
--table theMySqlTable
--hive-table cluster.hiveTable
I've seen people using temporary table and export it on a txt file to then export it, but I'm not sure I can do it.
Should Oozie have specific parameters too ? I'm not the administrator so I'm not sure if I'm able to do it...
Thank you !
Try this.
sqoop export \
--connect "jdbc:sqlserver://servername:1433;databaseName=EMP;" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username userid \
-P \
--table theMySqlTable\
--input-fields-terminated-by '|' \
--export-dir /hdfs path location of file/part-m-00000 \
--num-mappers 1 \
I need to import data from MySQL to HDFS, and I'm doing that with Apache Sqoop. But the thing is I also need to export data from HDFS to MySQL and I need to update one column of these data (that is in HDFS) before moving that data to MySQL, how can I do this?
You can update the column directly from hdfs and can store the hive output to HDFS using INSER OVERWRITE DIRECTORY "path" then go with the below sqoop command
sqoop export \
--connect jdbc:mysql://master/poc \
--username root \
--table employee \
--export-dir /user/hdfs/mysql/export.txt \
--update-key id \
--update-mode allowinsert \
--fields-terminated-by '\t' \
-m 1
Hope this helps..
I'm trying to import data from MySQL table to Hive using Sqoop. From what I understood there are 2 ways of doing that.
Import data into HDFS and then create External Table in Hive and load data into that table.
Use create-hive-table while running Sqoop query to create a new table in Hive and directly load data into that. I am trying to do this but can't do it for some reason
This is my code
sqoop import \
--connect jdbc:mysql://localhost/EMPLOYEE \
--username root \
--password root \
--table emp \
--m 1 \
--hive-database sqoopimport \
--hive-table sqoopimport.employee \
--create-hive-table \
--fields-terminated-by ',';
I tried using --hive-import as well but got error.
When I ran the above query, job was successful but there was no table created in hive as well as data was stored in \user\HDFS\emp\ location where \HDFS\emp was created during the job.
PS: Also I could not find any reason for using --m 1 with Sqoop. It's just there in all queries.
I got the import working with following query. There is no need to write create-hive-table we can just write new table name with hive-table and that table will be created. Also if there is any issue then go to hive-metastore location and run rm *.lck then try import again.
sqoop import \
--connect jdbc:mysql://localhost/EMPLOYEE \
--username root \
--password root \
--table emp4 \
--hive-import \
--hive-table sqoopimport.emp4 \
--fields-terminated-by "," ;
I am using Sqoop 1.4.6 to import data from MySQL to Hive using the import-all-tables option. The result is ok, but the import process itself is quite slow. For example one of the databases contains 40-50 tables with well under 1 million rows in total, and takes around 25-30 minutes to complete. Upon investigating, it seems most of the time is spent initialising Hive for each imported table. Testing a plain mysqldump on the same database completes in under 1 minute. So the question is how to reduce this initialisation time, if that is the case, for example using a single Hive session?
The import command is:
sqoop import-all-tables -Dorg.apache.sqoop.splitter.allow_text_splitter=true --compress --compression-codec=snappy --num-mappers 1 --connect "jdbc:mysql://..." --username ... --password ... --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --hive-import --hive-overwrite --hive-database ... --as-textfile --exclude-tables ... --warehouse-dir=...
Update:
Sqoop version: 1.4.6.2.5.3.0-37
Hive version: 1.2.1000.2.5.3.0-37
Could be related to:
https://issues.apache.org/jira/browse/HIVE-10319
remove option --num-mappers 1 to run import with default 4 mappers OR change it to some higher number --num-mappers 8 (if hardware allows) - this is to run import with more parellel jobs for tables having primary key, AND use --autoreset-to-one-mapper option - it will use 1 mapper for table not having primary key. Also use --direct mode:
sqoop import-all-tables \
--connect "jdbc:mysql://..." --username ... \
--password ... \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--compress --compression-codec=snappy \
--num-mappers 8 \
--autoreset-to-one \
--direct \
--null-string '\\N'
...
let us know if this improve the performance...
Update:
--fetch-size=<n> - Where represents the number of entries that
Sqoop must fetch at a time. Default is 1000.
Increase the value of the fetch-size argument based on the volume of
data that need to read. Set the value based on the available memory
and bandwidth.
increasing mapper memory from current value to some higher number:
example: sqoop import-all-tables -D mapreduce.map.memory.mb=2048 -D mapreduce.map.java.opts=-Xmx1024m <sqoop options>
Sqoop Performance Tuning Best Practices
Tune the following Sqoop arguments in JDBC connection or Sqoop mapping to optimize performance
batch (for export)
split-by and boundary-query (not needed since we
are suing --autoreset-to-one-mapper, can't be use with import-all-tables)
direct
fetch-size
num-mapper
I have my data store into hive table.
i want to transfer hive tables selected data to mysql table using sqoop.
Please guide me how to do this?
check out the sqoop guide here
You need to use sqoop export, here is the example
sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_rpt_db" \
--username retail_dba \
--password cloudera \
--table departments \
--export-dir /user/hive/warehouse/retail_ods.db/departments \
--input-fields-terminated-by '|' \
--input-lines-terminated-by '\n' \
--num-mappers 2
sqoop export to export data to mysql from Hadoop.
--connect JDBC url
--username mysql username
--password password for mysql user
--table mysql table name
--export-dir valid hadoop directory
--input-fields-terminated-by column delimiter in Hadoop
--input-lines-terminated-by row delimiter in Hadoop
--num-mappers number of mappers to process the data