Databrick csv cannot find local file - csv

In a program I have csv extracted from excel, I need to upload the csv to hdfs and save it as parquet format, doesn't matter with python version or spark version, no scala please.
Almost all discussions I came across are about databrick, however, it seems cannot find the file, here is the code and error:
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load("file:///home/rxie/csv_out/wamp.csv")
Error:
java.io.FileNotFoundException: File file:/home/rxie/csv_out/wamp.csv
does not exist
The file path:
ls -la /home/rxie/csv_out/wamp.csv
-rw-r--r-- 1 rxie linuxusers 2896878 Nov 12 14:59 /home/rxie/csv_out/wamp.csv
Thank you.

I found the issue now!
The reason why it errors out of file not found is actually correct, because I was using Spark Context with setMaster("yarn-cluster"), that means all worker nodes will look for the csv file, of course all worker nodes (except the one starting the program where the csv resides) do not have this file and hence error out. What I really should do is to use setMaster("local").
FIX:
conf = SparkConf().setAppName('test').setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
csv = "file:///home/rxie/csv_out/wamp.csv"
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema","true").option("delimiter",",").load(csv)

Yes, you are right, the file should be present at all worker nodes.
well. you can still read a local file in yarn cluster mode. you just need to add your file using addFile.
spark.sparkContext.addFile("file:///your local file path ")
spark will copy the file to each node where executor will be created and can be able to process your file in cluster mode as well.
I am using spark 2.3 version so you can change your spark context accordingly but addFile method remains same.
try this with your yarn (cluster mode) and let me know if it works for you.

Related

Read file from Cloudera CDSW Project with PySpark

I have a file sitting in my Cloudera project under "/home/cdsw/npi.json". I've tried using the following commands to use PySpark for reading from my "local" CDSW project, but can't get at it with any of the following commands. They all throw the "Path does not exist: " error
npi = sc.read.format("json").load("file:///home/cdsw/npi.json")
npi = sc.read.format("json").load("file:/home/cdsw/npi.json")
npi = sc.read.format("json").load("home/cdsw/npi.json")
As per this documentation, Accessing Data from HDFS
From terminal, copy the file from local file system to HDFS. Either use -put or -copyFromLocal.
hdfs dfs -put /home/cdsw/npi.json /destination
where, /destination is in HDFS.
Then, read the file in PySpark.
npi = sc.read.format("json").load("/destination/npi.json")
For more information:
put
put [-f] [-p] [-l] <localsrc> ... <destination>
Copy files from the local file system into fs. Copying fails if the file already
exists, unless the -f flag is given.

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar
I downloaded the jar to my local due to proxy issue.
Can anyone tell me what is the right usage of referring to a local jar:
Here is the code I use:
pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar
it will take me to the pyspark shell as expected, however, when I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')
the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat
It gives me error:
: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
If I run:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')
I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.lang.NoClassDefFoundError: Could not initialize class
com.databricks.spark.csv.package$
Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.
The correct way to do this would be to add the options (say if you are starting a spark shell)
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar
I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option
https://github.com/SparkTC/spark-netezza

Neo.ClientError.Statement.ExternalResourceFailed on ubuntu

I'm trying to import a csv file in NEO4j db using script :
LOAD CSV FROM "file:///dataframe6.txt" AS line
RETURN count(*)
But I get following error:
Neo.ClientError.Statement.ExternalResourceFailed
Couldn't load the external resource at: file:/home/gaurav/sharing/dataframe6.txt
P.S. : I'm using ubuntu machine
added this line dbms.directories.import=/home/gaurav/sharing/
and dbms.security.allow_csv_import_from_file_urls=true
I had to change folder permission for user NEO4j, it works now

Compress text file (CSV data) using LZ4 and read it in Spark

I was using the linux command line lz4 to compress the csv file.
example:-
lz4 input.csv
which results in input.csv.lz4 as output
But when I try to read the lz4 file in spark shell using following commands it always results in empty result.
val output = sparkSession.read.format("com.databricks.spark.csv").option("delimiter", "\t").load("s3:///input.csv.lz4")
output.count
res: Long = 0
I found somewhere that lz4 commandline tool might incompatible with spark
https://forums.databricks.com/questions/7957/how-can-i-read-in-lz4-compressed-json-files.html
Has anyone got it working on reading lz4 files in spark. If yes how was lz4 file created ?

Spark Read.json cant find file

Hey I all I have 1 Master and 1 Slave Node Standalone Spark Cluster on AWS. I have a folder my home directory called ~/Notebooks. This is were I launch jupyter notebooks and connect jupyter in my browser. I also have a file in there called people.json (simple json file).
I try running this code
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('Practice').setMaster('spark://ip-172-31-2-186:7077')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
df = sqlContext.read.json("people.json")
I get this error when i run that last line. I don't get it the file is right there... Any Ideas?-
Py4JJavaError: An error occurred while calling o238.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 37, ip-172-31-7-160.us-west-2.compute.internal): java.io.FileNotFoundException: File file:/home/ubuntu/Notebooks/people.json does not exist
Make sure the file is available on the worker nodes. Best way is to use a shared files system (NFS, HDFS). Read External Datasets documentation