I am planning to use a stack which uses Hadoop, Hive and Impala for analysing big data. I have the setup ready and now I am trying to import data from a MySQL table. The table size is more than 500 GB and I am planning to use Sqoop as follows :
sqoop import --connect jdbc:mysql://remote_host_ip/database_name --username user_name -P --table table_name --hive-import --compression-codec=snappy --as-parquetfile --warehouse-dir=/user/hive/warehouse -m 1
Is there any other better method for doing this import as this involves transferring 500 GB of data over the network. Is it possible to compress the data anyway and import it to Hive, so Impala can be used to query it ?
Sqoop is the best approach. Its very efficient in bulk loading.
Do read about the MySQL Hadoop Applier which is designed to perform real-time replication of events between MySQL and Hadoop.
You can set "-m 4" instead of "-m 1". This would allow MySql data to be imported in parallel fashion i.e. instead of using 1 mapper transferring 500GB, 4 mappers will be used to transfer the data in parallel(125 GB using each mapper).
SQOOP will be better to import 500 GB of data into columnar HDFS format which is Parquet file format. But you can use '-m 12' which makes more parallel mappers to import.
Related
How can I import a large csv in hdfs via aws rds mysql?
I had tried with MYSQL workbench, but it took more than 3 hours to insert just 1000 rows out of 2.5m. Is there a faster way to import csv to hdfs? Also how much time it should take?
I am a new bie to Sqoop. As per my understanding, Sqoop commands are for importing data from database like MySql to HDFs and viceversa and HDFS commands are for dealing with data in HDFS, such as getting data from HDFS to local file system and viceversa. Cant we use sqoop commands to deal with data in HDFS - to get the data from local file system to hdfs and viceversa. Please let me know the exact differences between Sqoop and HDFS commands. Why do we have two separate things. Why they did not put all these commands into one set. Apologies, if my question does not make sense.
Sqoop commands serves below purposes:
1)Import/export data from any database to hdfs/hive/hbase and vice versa. Its not restrict only to hdfs import and export.
2)data can be sqooped at one go if we need to move a whole database/list of tables.
3)only incremental data can be imported via sqoop commands.
4) It also required connection driver to connect to databases
In short it deals with tables/databases.
hdfs commands:
1) It only used to transfer any type(csv,text,xls) of file from local to hdfs or vice versa. Its just serve basic functionality of moving or copying data from one system to other just like unix commands.
Sqoop only functionality to import and export data from RDBMS (Structured) to Hadoop. It does not provide any other HDFS inside activities. Once if you get the data using Sqoop to HDFS, HDFS commands will be used to process the data (copy, move,etc)
For more Sqoop functionalities http://hortonworks.com/apache/sqoop/
Yes your understanding is correct.
Sqoop commands are for :
importing data from any relational database(like mysql) to HDFS/Hive/Hbase
exporting data from HDFS/Hive/Hbase to any relational database(like mysql)
hdfs commands are for :
Copying/transferring any files (like :.txt,.csv,.xls,..etc) from local to hdfs or vice versa.
for :
Why do we have two separate things. Why they did not put all these commands into one set.
answer :
Sqoop commands
(for copying structured data b/w two different systems)
Hdfs commands
(for copying files b/w local and hdfs)
using sqoop we cannot copy files from local to hdfs and viceversa
and also
using hdfs commands we cannot copy data from hdfs to any other external databases (like mysql) and viceversa.
Hi I am trying to upload a CSV file that has 7 columns and 4,946,642,530 rows to a my SQL table for data analysis. Its taking forever to upload the file. It took 5 hrs to get to 261897 rows.
Both the DB and the file are on the same machine.
I am to use windows 7 with mysql server 5.6 and I am importing the file via workbench.
Need a faster approach if you could suggest any or suggest an alternative solution for implementing the Data Base so it could up load and handle it faster.
You should better import it via command line
Coming from http://dev.mysql.com/doc/refman/5.7/en/mysqlimport.html#c5680
mysqlimport --fields-optionally-enclosed-by=""" --fields-terminated-by=, --lines-terminated-by="\r\n" --user=YOUR_USERNAME --password YOUR_DATABASE YOUR_TABLE.csv
I have a scenario where data is ingested into hadoop from a MYSQL database everyday into a dated folder. Few rows will be edited everyday and there might also be some schema changes. How do we handle this in hadoop if I am only interested in the latest data and schema
Here is the documentation for incremental imports in Sqoop. Also, Sqoop can takes the table name while importing the data, so if the schema changes the Sqoop command should be the same.
bin/sqoop import --connect jdbc:mysql://localhost/bigdata --table widgets -m 1
I am using hbase-0.90.6. I want to export the data from HBase to mysql. I know two-step process , first by running a mapreduce job to pull Hbase data into flat files, then exports flat file data into mysql.
Is their any other tool which I can use to reduce this two-step to one. Or can we use sqoop to do the same in one step. Thanks.
I'm afraid that Sqoop do not support exports directly from HBase at the moment. Sqoop can help you in the two-step process with the second step - e.g. Sqoop can take data from HDFS and export them to MySQL.
Yes Sqoop is the tool that can be used for both importing as well as exporting ur data from/to mysql and HBase
You can know more about Sqoop #
http://sqoop.apache.org