I am unable to load a CSV file directly from Azure Blob Storage into a RDD by using PySpark in a Jupyter Notebook.
I have read through just about all of the other answers to similar problems but I haven't found specific instructions for what I am trying to do. I know I could also load the data into the Notebook by using Pandas, but then I would need to convert the Panda DF into an RDD afterwards.
My ideal solution would look something like this, but this specific code give me the error that it can't infer a schema for CSV.
#Load Data
source = <Blob SAS URL>
elog = spark.read.format("csv").option("inferSchema", "true").option("url",source).load()
I have also taken a look at this answer: reading a csv file from azure blob storage with PySpark
but I am having trouble defining the correct path.
Thank you very much for your help!
Here is my sample code with Pandas to read a blob url with SAS token and convert a dataframe of Pandas to a PySpark one.
First, to get a Pandas dataframe object via read a blob url.
import pandas as pd
source = '<a csv blob url with SAS token>'
df = pd.read_csv(source)
print(df)
Then, you can convert it to a PySpark one.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("testDataFrame").getOrCreate()
spark_df = spark.createDataFrame(df)
spark_df.show()
Or, the same result with the code below.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
spark_df.show()
Hope it helps.
Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work
I have an app that receives a CSV formatted data by 256 bytes chunks.
My goal is to have the file imported to MySQL. As for now, I am saving the entire CSV to a temp file and then import it to MySQL using IMPORT DATA INFILE ....
It works as expected my only doubt is that maybe there is a better solution to import the data in a more efficient way.
How to convert json to parquet in streaming with Spark?
Acutually i have to ssh from a server, recieve a big json file, convert it to parquet, and upload it on hadoop.
I there a way to do this in a pipelined way?
They are backup files so I have a directory with a predefined amount of files that don't change in size in time
Something like:
scp host /dev/stdout | spark-submit myprogram.py | hadoop /dir/
edit:
Actually I'm working on this:
sc = SparkContext(appName="Test")
sqlContext = SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec.", "gzip")
#Since i couldn't get the stdio, went for a pipe:
with open("mypipe", "r") as o:
while True:
line = o.readline()
print "Processing: " + line
lineRDD = sc.parallelize([line])
df = sqlContext.jsonRDD(lineRDD)
#Create and append
df.write.parquet("file:///home/user/spark/test", mode="append")
print "Done."
This is working fine, but the resulting parquet is very large (280kb for 4 lines 2 columns json). Any improvements?
If anyone is interested, I managed to resolve this using the .pipe() method.
https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe
I am trying to create a DataFrame from a CSV source that is on S3 on an EMR Spark cluster, using the Databricks spark-csv package and the flights dataset:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://h2o-airlines-unpacked/allyears.csv')
df.first()
This does not terminate on a cluster of 4 m3.xlarges. I am looking for suggestions to create a DataFrame from a CSV file on S3 in PySpark. Alternatively, I have tried putting the file on HDFS and reading from HFDS as well, but that also does not terminate. The file is not overly large (12 GB).
For reading a well-behaved csv file that is only 12GB, you can copy it onto all of your workers and the driver machines, and then manually split on ",". This may not parse any RFC4180 csv, but it parsed what I had.
Add at least 12GB extra space for worker disk space for each worker when you requisition the cluster.
Use a machine type that has at least 12GB RAM, such as c3.2xlarge. Go bigger if you don't intend to keep the cluster around idle and can afford the larger charges. Bigger machines means less disk file copying to get started. I regularly see c3.8xlarge under $0.50/hour on the spot market.
copy the file to each of your workers, in the same directory on each worker. This should be a physically attached drive, i.e. different physical drives on each machine.
Make sure you have that same file and directory on the driver machine as well.
raw = sc.textFile("/data.csv")
print "Counted %d lines in /data.csv" % raw.count()
raw_fields = raw.first()
# this regular expression is for quoted fields. i.e. "23","38","blue",...
matchre = r'^"(.*)"$'
pmatchre = re.compile(matchre)
def uncsv_line(line):
return [pmatchre.match(s).group(1) for s in line.split(',')]
fields = uncsv_line(raw_fields)
def raw_to_dict(raw_line):
return dict(zip(fields, uncsv_line(raw_line)))
parsedData = (raw
.map(raw_to_dict)
.cache()
)
print "Counted %d parsed lines" % parsedData.count()
parsedData will be a RDD of dicts, where the keys of the dicts are the CSV field names from the first row, and the values are the CSV values of the current row. If you don't have a header row in the CSV data, this may not be right for you, but it should be clear that you could override the code reading the first line here and set up the fields manually.
Note that this is not immediately useful for creating data frames or registering a spark SQL table. But for anything else, it is OK, and you can further extract and transform it into a better format if you need to dump it into spark SQL.
I use this on a 7GB file with no issues, except I've removed some filter logic to detect valid data that has as a side effect the removal of the header from the parsed data. You might need to reimplement some filtering.