Spark: Streaming json to parquet - json

How to convert json to parquet in streaming with Spark?
Acutually i have to ssh from a server, recieve a big json file, convert it to parquet, and upload it on hadoop.
I there a way to do this in a pipelined way?
They are backup files so I have a directory with a predefined amount of files that don't change in size in time
Something like:
scp host /dev/stdout | spark-submit myprogram.py | hadoop /dir/
edit:
Actually I'm working on this:
sc = SparkContext(appName="Test")
sqlContext = SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec.", "gzip")
#Since i couldn't get the stdio, went for a pipe:
with open("mypipe", "r") as o:
while True:
line = o.readline()
print "Processing: " + line
lineRDD = sc.parallelize([line])
df = sqlContext.jsonRDD(lineRDD)
#Create and append
df.write.parquet("file:///home/user/spark/test", mode="append")
print "Done."
This is working fine, but the resulting parquet is very large (280kb for 4 lines 2 columns json). Any improvements?

If anyone is interested, I managed to resolve this using the .pipe() method.
https://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe

Related

Mulitline headline for csv in Spark

I need Spark to write a compressed csv file to HDFS, but I need it to start with a few lines of version information.
Example of file content
version=2
date=2020-01-31
id,name,age
1,Alice,21
2,Bob,23
Three ideas of ways to do this
First write out to hdfs://data/tmp/file1.csv.gz, then use hadoop fs -cat
to stream it to hdfs://data/real/file1.csv.gz
Convert output
datafram to text format / RDD[String] and union real file with the
extra header lines
Change the first column name to multi line
So for approach 3:
column1 ="version=2\ndate=2020-01-31\n\nid"
Let me know if you know a more elegant way to do this.
I tried all the approaches. Here is simplified code:
Approach 1
Approach 1 is using Hadoop commands from a bash script etc.
This works, but requires a double HDFS write and a cleanup. Also it doesn't fit so well in a Scala Spark project.
(echo -e "version=2\ndate=2020-01-31\n\nid,name,age" | gzip -vc ; hadoop fs -cat "$INPUT_DIR/*" ) | hadoop fs -put - "$OUTPUT_PATH"
What is happening here is that it will
echo multi line header out to stdout
pipe that into gzip and to stdout
pipe other HDSF dir to stdout
pipe into hadoop fs -put that will combine everything
Approach 2
Code is a little more complicated there are not bad quote characters in headline, but headlines sometimes come after csv part.
import org.apache.hadoop.io.compress.GzipCodec
val heading = """version=2
date=2020-01-31
id,name,age""".split("\n", -1).toSeq
val headingRdd: RDD[String] = sc.parallelize(heading)
val mediamathRdd: RDD[String] = df.rdd.map(row => row.mkString(","))
val combinedResult: RDD[String] = (headingRdd union mediamathRdd)
combinedResult.repartition(1).saveAsTextFile(path, classOf[GzipCodec])
Approach 3
The simplest approach, but the output is slightly off
df.repartition(1)
.withColumnRenamed("id", "version=2\ndate=2020-01-31\n\nid")
.option("header", true)
.option("delimiter", ",")
.option("quoteMode", "NONE")
.option("quote", " ")
.option("codec", "gzip")
.csv(path)
Result will look like this, which might or might not be acceptable
version=2
date=2020-01-31
id ,name,age
1,Alice,21
2,Bob,23
I also tried with:
.option("quote", "\u0000")
It actually prints the ascii charter zero and while this did not show up in my HDFS viewer this was not part of the spec.
Best Approach
None of them are perfect for what seems like a very simple task. Maybe there is a small fix to make approach 2 work perfectly.

Is there a simple way to load parquet files directly into Cassandra?

I have got a parquet file / folder (about 1GB) that I would like to load into my local Cassandra DB. Unfortunately I could not find any way (except via SPARK (in Scala)) to directly load this file into CDB. If I blow out the parquet file into CSV it'll just get way too huge for my laptop.
I am setting up a Cassandra DB for a big data analytics case (I've got about 25TB in raw data that we need to get searchable fast). Right now I am running some local tests on how to optimally design the keyspaces, indices and tables before move to Cassandra as a Service on a Hyperscaler. Converting the data to CSV is not an option as this blows up too much.
COPY firmographics.company (col1,col2,col3.....) FROM 'C:\Users\Public\Downloads\companies.csv' WITH DELIMITER='\t' AND HEADER=TRUE;
Turns out, like Alex Ott said, it's easy enough to just write this up in SPARK. Below my code:
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession\
.builder\
.appName("Spark Exploration App")\
.config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.11:2.3.2')\
.getOrCreate()
import pandas as pd
df = spark.read.parquet("/PATH/TO/FILE/")
import time
start = time.time()
df2.drop('filename').write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="few_com", keyspace="bmbr")\
.save()
end = time.time()
print(end - start)

Unable To Read local JSON File using spark submit

I am reading JSON file in scala spark using
val df=spark.read.json(properties.getProperty("jsonFilePath"))
This code works when I run from my IDE(Intellij)
But when I try to execute using spark-submit command, it gives the below message
INFO FileScanRDD: Reading File path: file:///Users/path/to/json/file.json, range: 0-8822, partition values: [empty row]
I am not able to process the JSON data due to this. Any idea what could be happening here?
Here is my spark submit command:
${SPARK_HOME_DIR}/bin/spark-submit --master local ${SCALA_JAR_LOC}/<JARNAME>.jar
I tried providing it as part of spark-submit using --files option as well. Need help
spark.read.json supports reading data from a filesystem that's supported by Hadoop. If the JSON is part of the jar that contains your jar you can use the getClass.getResourceAsStream java API to read the json from the classpath of the job.
To read the JSON file from your classpath
def read(file: String): String = {
val stream = getClass.getResourceAsStream(s"/$file")
scala.io.Source.fromInputStream(stream)
.getLines
.toList
.mkString(" ")
.trim
.replaceAll("\\s+", " ")
}
Since you want to read that JSON as a dataframe in your code, you might have to convert the String to an RDD and then to a single record dataframe.
val inputAsRDD = sparkCtxt.parallelize(List(read("/path/within/resources/folder")))
val df = sparkSession.read.json(inputAsRDD)

Spark: load file being written

In a Spark batch I load a CSV file, in a usual way:
val offerDf = spark.read
.option("header", "true")
.option("delimiter", ";")
.csv("myfile.csv")
In another linux batch (that is not in my business), the file may be writen - at the same time as I read it as both are periodic tasks.
So is there a way to be sure that the CSV file won't be modified when I read it (except scheduling the tasks, as their duration is not known)?

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work