Why is pyspark unable to read this csv file? - csv

I was unable to find this problem in the numerous Stack Overflow similar questions "how to read csv into a pyspark dataframe?" (see list of similar sounding but different questions at end).
The CSV file in question resides in the tmp directory of the driver of the cluster, note that this csv file is intentionally NOT in the Databricks DBFS cloud storage. Using DBFS will not work for the use case that led to this question.
Note I am trying to get this working on Databricks runtime 10.3 with Spark 3.2.1 and Scala 2.12.
y_header = ['fruit','color','size','note']
y = [('apple','red','medium','juicy')]
y.append(('grape','purple','small','fresh'))
import csv
with (open('/tmp/test.csv','w')) as f:
w = csv.writer(f)
w.writerow(y_header)
w.writerows(y)
Then use python os to verify the file was created:
import os
list(filter(lambda f: f == 'test.csv',os.listdir('/tmp/')))
Now verify that the databricks Spark API can see the file, have to use file:///
dbutils.fs.ls('file:///tmp/test.csv')
Now, optional step, specify a dataframe schema for Spark to apply to the csv file:
from pyspark.sql.types import *
csv_schema = StructType([StructField('fruit', StringType()), StructField('color', StringType()), StructField('size', StringType()), StructField('note', StringType())])
Now define the PySpark dataframe:
x = spark.read.csv('file:///tmp/test.csv',header=True,schema=csv_schema)
Above line runs no errors, but remember, due to lazy execution, the spark engine still has not read the file. So next we will give Spark a command that forces it to execute the dataframe:
display(x)
And the error is:
FileReadException: Error while reading file file:/tmp/test.csv. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. If Delta cache is stale or the underlying files have been removed, you can invalidate Delta cache manually by restarting the cluster.
Caused by: FileNotFoundException: File file:/tmp/test.csv does not exist. . .
and digging into the error I found this: java.io.FileNotFoundException: File file:/tmp/test.csv does not exist. And I already tried restarting the cluster, restart did not clear the error.
But I can prove the file does exist, only for some reason Spark and Java are unable to access it, because I can read in the same file with pandas no problem:
import pandas as p
p.read_csv('/tmp/test.csv')
So how do I get spark to read this csv file?
appendix - list of similar spark read csv questions I searched through that did not answer my question: 1 2 3 4 5 6 7 8

I guess databricks file loader doesn't seem to recognize the absolute path /tmp/.
you can try the following work around.
Read the file using path using Pandas Dataframe
Pass the pandas dataframe to Spark using CreateDataFrame function
Code :
df_pd = pd.read_csv('File:///tmp/test.csv')
sparkDF=spark.createDataFrame(df_pd)
sparkDF.display()
Output :

I made email contact with a Databricks architect, who confirmed that Databricks can only read locally (from the cluster) in a single node setup.
So DBFS is the only option for random writing/reading of text data files in a typical cluster which contains >1 node.

Related

Best Approach to read large number of JSON Files (1 JSON per file) in Databricks

Hope everyone is doing well.
I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.
val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
.schema(schema_variable)
.json(file_path)
.withColumn("file_name", input_file_name())
Is there any approach or option we can try to make the reads faster.
We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.
I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.
Environment - Azure Databricks 10.4 Runtime.
Thank you for all the help.

Read Array Of Jsons From File to Spark Dataframe

I have a gzipped JSON file that contains Array of JSON, something like this:
[{"Product":{"id"1,"image":"/img.jpg"},"Color":"black"},{"Product":{"id"2,"image":"/img1.jpg"},"Color":"green"}.....]
I know this is not the ideal data format to read into scala, however there is no other alternative but to process the feed in this manner.
I have tried :
spark.read.json("file-path")
which seems to take a long time (processes very quickly if you have data in MBs, however takes way long for GBs worth of data ), probably because spark is not able to split the file and distribute accross to other executors.
Wanted to see if there is a any way out to preprocess this data and load it into spark context as a dataframe.
Functionality I want seems to be similar to: Create pandas dataframe from json objects . But I wanted to see if there is any scala alternative which could do similar and convert the data to spark RDD / dataframe .
You can read the "gzip" file using spark.read().text("gzip-file-path"). Since Spark API's are built on top of HDFS API , Spark can read the gzip file and decompress it to read the files.
https://github.com/mesos/spark/blob/baa30fcd99aec83b1b704d7918be6bb78b45fbb5/core/src/main/scala/spark/SparkContext.scala#L239
However, gzip is non-splittable so spark creates an RDD with single partition. Hence, reading gzip files using spark doe not make sense.
You may decompress the gzip file and read the decompressed files to get most out of the distributed processing architecture.
Appeared like a problem with the data format being given to spark for processing. I had to pre-process the data to change the format to a spark friendly format, and run spark processes over that. This is the preprocessing I ended up doing: https://github.com/dipayan90/bigjsonprocessor/blob/master/src/main/java/com/kajjoy/bigjsonprocessor/Application.java

Spark CSV Handle Corrupt GZip Files

I have a spark 2.0 java application that is using sparks csv reading utilities to read a CSV file into a dataframe. The problem is that sometimes 1 out of 100 input files may be invalid ( corrupt gzip ) which causes the job to fail with:
java.lang.IllegalStateException: Error reading from input
When I used to read the files as text files and manually parse the CSV I was able to write a custom TextInputFormat to handle exceptions. I can't figure out how to specify a customer TextInputFormat when using spark's CSV reader. Any help would be appreciated.
Current code for reading CSV:
Dataset<Row> csv = sparkSession.read()
.option("delimiter", parseSettings.getDelimiter().toString())
.option("quote", parseSettings.getQuote())
.option("parserLib", "UNIVOCITY")
.csv(paths);
Thanks,
Nathan

Spark S3 CSV read returns org.apache.hadoop.mapred.InvalidInputException

I see several posts here and in a Google search for org.apache.hadoop.mapred.InvalidInputException
but most deal with HDFS files or trapping errors. My issue is that while I can read a CSV file from spark-shell, running it from a compiled JAR constantly returns an org.apache.hadoop.mapred.InvalidInputException error.
The rough process of the jar:
1. read from JSON documents in S3 (this works)
2. read from parquet files in S3 (this also succeeds)
3. write a result of a query against #1 and #2 to a parquet file in S3 (also succeeds)
4. read a configuration csv file from the same bucket #3 is written to. (this fails)
These are the various approaches that I have tried in code:
1. val osRDD = spark.read.option("header","true").csv("s3://bucket/path/")
2. val osRDD = spark.read.format("com.databricks.spark.csv").option("header", "true").load("s3://bucket/path/")
All variations of the two above with s3, s3a and s3n prefixes work fine from the REPL but inside a JAR they return this:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/path/eventsByOS.csv
So, it found the file but can't read it.
Thinking this was a permissions issue, I have tried:
a. export AWS_ACCESS_KEY_ID=<access key> and export AWS_SECRET_ACCESS_KEY=<secret> from the Linux prompt. With Spark 2 this has been sufficient to provide us access to the S3 folders up until now.
b. .config("fs.s3.access.key", <access>)
.config("fs.s3.secret.key", <secret>)
.config("fs.s3n.access.key", <access>)
.config("fs.s3n.secret.key", <secret>)
.config("fs.s3a.access.key", <access>)
.config("fs.s3a.secret.key", <secret>)
Before this failure, the code reads from parquet files located in the same bucket and writes parquet files to the same bucket. The CSV file is only 4.8 KB in size.
Any ideas why this is failing?
Thanks!
Adding stack trace:
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:253)
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
scala.Option.getOrElse(Option.scala:121)
org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1332)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.take(RDD.scala:1326)
org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1367)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.first(RDD.scala:1366)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.findFirstLine(CSVFileFormat.scala:206)
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
scala.Option.orElse(Option.scala:289)
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:352)
nothing springs out when I paste that stack into the IDE, but I'm looking at a later version of Hadoop and can't currently switch to older ones.
Have a look at these instructions
That landsat gz file is actually a CSV file you can try to read in; it's the one we generally use for testing because its there and free to use. Start by seeing if you can work with it.
If using spark 2.0, use spark's own CSV package.
Do use S3a, not the others.
I solved this problem by adding the specific Hadoop configuration for the appropriate method (s3 in the example here). The odd thing is that the above security works for everything in Spark 2.0 EXCEPT reading the CSV.
This code solved my problem using S3.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", p.aws_accessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey",p.aws_secretKey)

How to load jar dependenices in IPython Notebook

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.
You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)
I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.