Importing CSV file into Hadoop - csv

I am new with Hadoop, I have a file to import into hadoop via command line (I access the machine through SSH)
How can I import the file in hadoop?
How can I check afterward (command)?

2 steps to import csv file
move csv file to hadoop sanbox (/home/username) using winscp or cyberduck.
use -put command to move file from local location to hdfs.
hdfs dfs -put /home/username/file.csv /user/data/file.csv

There are three flags that we can use for load data from local machine into HDFS,
-copyFromLocal
We use this flag to copy data from the local file system to the Hadoop directory.
hdfs dfs –copyFromLocal /home/username/file.csv /user/data/file.csv
If the folder is not created as HDFS or root user we can create the folder:
hdfs dfs -mkdir /user/data
-put
As #Sam mentioned in the above answer we also use -put flag to copy data from the local file system to the Hadoop directory.
hdfs dfs -put /home/username/file.csv /user/data/file.csv
-moveFromLocal
we also use -moveFromLocal flag to copy data from the local file system to the Hadoop directory. But this will remove the file from the local directory
hdfs dfs -moveFromLocal /home/username/file.csv /user/data/file.csv

Related

How a HDFS directory by year month and day is created?

Following the question in this link, there is another question about the creating the directory on Hadoop HDFS.
I am new to Hadoop/Flume and I have picked up a project which use Flume to save csv data into HDFS. The setting for the Flume sink will be as follows:
contract-snapshot.sinks.hdfs-sink-contract-snapshot.hdfs.path = /dev/wimp/contract-snapshot/year=%Y/month=%n/day=%e/snapshottime=%k%M
With this Flume setting, the corresponding csv file will be saved into the HDFS, under the folder:
"/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv"
I am trying to setup the whole system locally, I have hadoop installed locally on my windows pc, how can I create a directory of "/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/" on the local hdfs?
In the cmd terminal, the code:
hadoop fs -mkdir /wimp/contract-snapshot
can create a folder /wimp/contract-snapshot. However the following code does not work in the cmd terminal
hadoop fs -mkdir /wimp/contract-snapshot/year=2020
How to create hdfs directory by year, month, day?
hadoop fs -mkdir "/wimp/contract-snapshot/year=2020"
Adding quotation solves the problem.

cloudera quick start load csv table hdfs with terminal

I am new to all this as I am only in my second semester and I just need help understanding a command I need to do. I am trying to load a local csv file to hdfs on cloudera using the terminal. I have to use that data and work with Pig for an assignment. I have tried everything and it still gives me 'no such file or directory'. I have turned off safe mode, checked the directories and even made sure the file could be read. Here are the commands I have tried to load the data:
hadoop fs -copyFromLocal 2008.csv
hdfs dfs -copyFromLocal 2008.csv
hdfs dfs -copyFromLocal 2008.csv /user/root
hdfs dfs -copyFromLocal 2008.csv /home/cloudera/Desktop
Nothing at all has worked and keeps giving me
'2008.csv' no such file or directory
. What could I do to fix this? Thank you very much.
I have to use that data and work with Pig for an assignment
You can run Pig without HDFS.
pig -x local
I have tried everything and it still gives me 'no such file or directory'
Well, that error is not from HDFS, it seems to be from your local shell.
ls shows you the files available to use in the current directory for -copyFromLocal or -put to work without an absolute path.
For complete assurance for what you are copying, as well as to where, use full paths in both arguments. The second path is always HDFS if using those two flags.
Try this
hadoop fs -mkdir -p /user/cloudera # just in case
hadoop fs -copyFromLocal ./2008.csv /user/cloudera/
Or even
hadoop fs -copyFromLocal /home/cloudera/Desktop/2008.csv /user/cloudera/
What I think you are having issues with, is that /user/root is not correct unless you are running commands as the root user, and neither is /home/cloudera/Desktop because HDFS has no concept of a Desktop.
The default behavior without the second path is
hadoop fs -copyFromLocal <file> /user/$(whoami)/
(Without the trailing slash, or a pre-existing directory, it'll copy <file> literally as a file, which can be unexpected in certain situations, for example, when trying to copy a file into a user directory, but the directory doesn't exist yet)
I believe you already check and made yourself sure that 2008.csv exists. That's why I think the permissions on this file not allowing you to copy it.
try: sudo -u hdfs cat 2008.csv
If you get permission denied error, this is your issue. Arrange permissions of the file or create a new one if so. If again you get "no file" error, try to use whole path for the file like:
hdfs dfs -copyFromLocal /user/home/csvFiles/2008.csv /user/home/cloudera/Desktop

Hadoop load CSV file into HDFS - error Name node is in safe mode

I'm trying to load a .csv file into HDFS. For that I do this:
hdfs dfsadmin -safemode leave
sudo -u hdfs hadoop fs -copyFromLocal /home/cloudera/reduced.csv /user/cloudera
But when I submit this I'm getting:
copyFromLocal: Cannot create file/user/cloudera/reduced.csv._COPYING_. Name node is in safe mode.
I already see this post:
Name node is in safe mode. Not able to leave
But I'm still get this error...

Mysql Import Finding File in Directory

I'm trying to import a csv file from the command line after connecting to my RDS database. Unfortunately I'm fairly new to the mysql command and file navigation commands. I'm looking to navigate to my directory where the csv I'm importing is located. This is the directory path ~/Desktop/images.csv. I know I use mysqlimport, but can't figure out the command to change directory

Accessing csv file placed in hdfs using spark

I have placed a csv file into the hdfs filesystem using hadoop -put command. I now need to access the csv file using pyspark csv. Its format is something like
`plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')`
I am a newbie to hdfs. How do I find the address to be placed in hdfs://x.x.x.x?
Here's the output when I entered
hduser#remus:~$ hdfs dfs -ls /input
Found 1 items
-rw-r--r-- 1 hduser supergroup 158 2015-06-12 14:13 /input/test.csv
Any help is appreciated.
you need to provide the full path of your files in HDFS and the url will be mentioned in your hadoop configuration core-site or hdfs-site where you mentioned.
Check your core-site.xml & hdfs-site.xml for get the details about
url.
Easy way to find any url is access your hdfs from your browser and get the path.
If you are using absolute path in your file system use file:///<your path>
Try to specify absolute path without hdfs://
plaintext_rdd = sc.textFile('/input/test.csv')
Spark while running on the same cluster with HDFS use hdfs:// as default FS.
Start the spark shell or the spark-submit by pointing to the package which can read csv files, like below:
spark-shell --packages com.databricks:spark-csv_2.11:1.2.0
And in the spark code, you can read the csv file as below:
val data_df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.schema(<pass schema if required>)
.load(<location in HDFS/S3>)