Read file from Cloudera CDSW Project with PySpark

Read file from Cloudera CDSW Project with PySpark - json

I have a file sitting in my Cloudera project under "/home/cdsw/npi.json". I've tried using the following commands to use PySpark for reading from my "local" CDSW project, but can't get at it with any of the following commands. They all throw the "Path does not exist: " error
npi = sc.read.format("json").load("file:///home/cdsw/npi.json")
npi = sc.read.format("json").load("file:/home/cdsw/npi.json")
npi = sc.read.format("json").load("home/cdsw/npi.json")

As per this documentation, Accessing Data from HDFS
From terminal, copy the file from local file system to HDFS. Either use -put or -copyFromLocal.
hdfs dfs -put /home/cdsw/npi.json /destination
where, /destination is in HDFS.
Then, read the file in PySpark.
npi = sc.read.format("json").load("/destination/npi.json")
For more information:
put
put [-f] [-p] [-l] <localsrc> ... <destination>
Copy files from the local file system into fs. Copying fails if the file already
exists, unless the -f flag is given.

Related

Exception handling in Bulk Copying to Amazon Redshift Using Avro files

I am trying to load avro files in S3 to a table in Redshift. one of the Avro files doesn't have a correct format. the problem is when copy command tries to load that file, it throws an exception and doesn't run the copy for correct files. how can I skip the wrong-formatted file and c
opy the correct files? here is my code for loading file:
COPY tmp.table
FROM 's3://{BUCKET}/{PREFIX}'
IAM_ROLE '{ROLE}'
FORMAT AVRO 's3://{BUCKET}/{AVRO_PATH}'
the error that I am getting is:
code: 8001
context: Cannot init avro reader from s3 file Incorrect Avro container file magic number
query: 19308992
location: avropath_request.cpp:438
process: query0_125_19308992 [pid=23925]

You can preprocess the s3://{BUCKET}/{PREFIX} files and create a manifest file with only the Avro files that have the right format/schema. Redshift can't do this for you and will try to process all files on the s3://{BUCKET}/{PREFIX} path.

how to load mysql data from hdfs into pyspark

I used sqoop to import table from mysql to hdfs location /user/cloudera/table1, now what should be the command to load this table into pyspark code. I am just writing simple code as below.. I am using cloudera CDH 5.13. Thanks
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Spark Count")
sc = SparkContext(conf=conf)
data = code here to import table from hdfs

Sqoop by default imports data as text format or we can explicitly set it using (--as-textfile) option.
Refer : Sqoop Documentation
Read 7.2.8. File Formats for better understanding.
To see file format manually use hdfs cat command.
ls for listing files under path.
cat for opening selected files.
You can use hdfs as well instead of hadoop in below commands.
hadoop fs -ls /user/cloudera/table1
hadoop fs -cat /user/cloudera/table1/samplefile.txt
Note : If data is in readable format then it is text format.
For importing data from hdfs in pyspark you can use textFile option.
textFile = sc.textFile("hdfs://namenodehost/user/cloudera/table1/samplefile.txt")
textFile.first()
Refer : reading-a-file-in-hdfs-from-pyspark

Databrick csv cannot find local file

In a program I have csv extracted from excel, I need to upload the csv to hdfs and save it as parquet format, doesn't matter with python version or spark version, no scala please.
Almost all discussions I came across are about databrick, however, it seems cannot find the file, here is the code and error:
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load("file:///home/rxie/csv_out/wamp.csv")
Error:
java.io.FileNotFoundException: File file:/home/rxie/csv_out/wamp.csv
does not exist
The file path:
ls -la /home/rxie/csv_out/wamp.csv
-rw-r--r-- 1 rxie linuxusers 2896878 Nov 12 14:59 /home/rxie/csv_out/wamp.csv
Thank you.

I found the issue now!
The reason why it errors out of file not found is actually correct, because I was using Spark Context with setMaster("yarn-cluster"), that means all worker nodes will look for the csv file, of course all worker nodes (except the one starting the program where the csv resides) do not have this file and hence error out. What I really should do is to use setMaster("local").
FIX:
conf = SparkConf().setAppName('test').setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
csv = "file:///home/rxie/csv_out/wamp.csv"
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load(csv)

Yes, you are right, the file should be present at all worker nodes.
well. you can still read a local file in yarn cluster mode. you just need to add your file using addFile.
spark.sparkContext.addFile("file:///your local file path ")
spark will copy the file to each node where executor will be created and can be able to process your file in cluster mode as well.
I am using spark 2.3 version so you can change your spark context accordingly but addFile method remains same.
try this with your yarn (cluster mode) and let me know if it works for you.

Neo.ClientError.Statement.ExternalResourceFailed on ubuntu

I'm trying to import a csv file in NEO4j db using script :
LOAD CSV FROM "file:///dataframe6.txt" AS line
RETURN count(*)
But I get following error:
Neo.ClientError.Statement.ExternalResourceFailed
Couldn't load the external resource at: file:/home/gaurav/sharing/dataframe6.txt
P.S. : I'm using ubuntu machine
added this line dbms.directories.import=/home/gaurav/sharing/
and dbms.security.allow_csv_import_from_file_urls=true

I had to change folder permission for user NEO4j, it works now

Hadoop InvalidInput Exception

root#priyal-Inspiron-N5030:/home/priyal# hadoop dfs -copyFromLocal in /in
root#priyal-Inspiron-N5030:/home/priyal# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount in out
INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1424175893740_0008
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/root/in
can someone plz suggest what exactly the problem is and how do i resolve ? I am a new user

Error
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/root/in
says that the destination location /user/root/in is not prsrent in your HDFS. You can brows your file system using simple web UI.
http://localhost:50070/
then click on Utilities drop down -> Brows the file System. and check your foledr hirerchy i.e. /user/root/in
if it does not exist you can create folder using command
hadoop dfs -mkdir hdfs://localhost:9000/user/root/in
now try to execute your copy command
hadoop dfs -copyFromLocal /local/path/of/source/file hdfs://localhost:9000/user/root/in
hope this will help you :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Read file from Cloudera CDSW Project with PySpark - json

Related

Exception handling in Bulk Copying to Amazon Redshift Using Avro files

how to load mysql data from hdfs into pyspark

Databrick csv cannot find local file

Neo.ClientError.Statement.ExternalResourceFailed on ubuntu

Hadoop InvalidInput Exception

Categories

Resources