Moving a json file from databricks to blob storage - json

I have created a mount in databricks which connects to my blob storage and I am able to read files from blob to databricks using a notebook.
I then transposed a .txt to json format using pyspark and now I would like to load it back to the blob storage. Does anyone know how I would do that?
Here are a few things I have tried:
my_json.write.option("header", "true").json("mnt/my_mount/file_name.json")
write.json(my_json, mnt/my_mount)
Neither work. I can put load a csv file from databricks to blob using:
my_data_frame.write.option("header", "true").csv("mnt/my_mount_name/file name.csv")
This works fine but I can't find a solution for moving a json.
Any ideas?

Disclaimer: I am new to pySpark but this is what I have done.
This is what I did after referencing the docs pyspark.sql.DataFrameWriter.json
# JSON
my_dataframe.write.json("/mnt/my_mount/my_json_file_name.json")
# For a single JSON file
my_dataframe.repartition(1).write.json("/mnt/my_mount/my_json_file_name.json")
# Parquet
my_dataframe.write.mode("Overwrite").partitionBy("myCol").parquet("/mnt/my_mount/my_parquet_file_name.parquet")

Related

How would i save a doc/docx/docm file into directory or S3 bucket using Pyspark

I am trying to save a data frame into a document but it returns saying that the below error
java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html
My code is below:
#f_data is my dataframe with data
f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
display(f_data)
Note that i could save files of CSV, text and JSON format but is there any way to save a docx file using pyspark?
My question here. Do we have the support for saving data in the format of doc/docx?
if not, Is there any way to store the file like writing a file stream object into particular folder/S3 bucket?
In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer:
A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.
If you want to write DOCX files programmatically, you can:
Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?
Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.

React Native: Read CSV File

I want to upload records of students of a university using CSV file. I have upload the CSV file using react-native-document-picker. Now the problem is that, I am unable to read CSV Data. My main motive is to upload CSV data to firebase. How to read CSV data in React Native or covert CSVtoJSON?
You need to convert CSV to JSON before pushing data to Firebase. There're numerous utility libraries for that. You can try https://www.npmjs.com/package/csvtojson

Dump dictionary as json file to Google Cloud storage from Jupyter Notebook on Data Proc

I am using spark on Google dataproc cluster. I have created a dictionary in Jupyter notebook which I want to dump in my GCS bucket. However, it seems the usual way of dumping to json using fopen() does not work in case of gcp. So, how can I write my dictionary as .json file to GCS. Or, is there any other way to get the dictionary?
It's funny, I could write spark dataframe to gcs without any hassle, but apparently, I can't load JSON on gcs unless I have it on my local system!
Please help!
Thank you.
The file in GCS is not in your local file system so that's why you cannot call "fopen" on it. You can either save to GCS by directly using a GCS client (for example, this tutorial), or treat the GCS location as an HDFS destination (for example, saveAsTextFile("gs://...")

Read Array Of Jsons From File to Spark Dataframe

I have a gzipped JSON file that contains Array of JSON, something like this:
[{"Product":{"id"1,"image":"/img.jpg"},"Color":"black"},{"Product":{"id"2,"image":"/img1.jpg"},"Color":"green"}.....]
I know this is not the ideal data format to read into scala, however there is no other alternative but to process the feed in this manner.
I have tried :
spark.read.json("file-path")
which seems to take a long time (processes very quickly if you have data in MBs, however takes way long for GBs worth of data ), probably because spark is not able to split the file and distribute accross to other executors.
Wanted to see if there is a any way out to preprocess this data and load it into spark context as a dataframe.
Functionality I want seems to be similar to: Create pandas dataframe from json objects . But I wanted to see if there is any scala alternative which could do similar and convert the data to spark RDD / dataframe .
You can read the "gzip" file using spark.read().text("gzip-file-path"). Since Spark API's are built on top of HDFS API , Spark can read the gzip file and decompress it to read the files.
https://github.com/mesos/spark/blob/baa30fcd99aec83b1b704d7918be6bb78b45fbb5/core/src/main/scala/spark/SparkContext.scala#L239
However, gzip is non-splittable so spark creates an RDD with single partition. Hence, reading gzip files using spark doe not make sense.
You may decompress the gzip file and read the decompressed files to get most out of the distributed processing architecture.
Appeared like a problem with the data format being given to spark for processing. I had to pre-process the data to change the format to a spark friendly format, and run spark processes over that. This is the preprocessing I ended up doing: https://github.com/dipayan90/bigjsonprocessor/blob/master/src/main/java/com/kajjoy/bigjsonprocessor/Application.java

Apache nifi issue with saving data from json to orc

I am using NIFI jsontoavro->avrotoorc->puthdfs. But facing following issues.
1)Single ORC file is being saved on HDFS. I am not using any compression.
2) when i try to access these files they are giving errors like buffer memory.
Thanks for help in advance.
You should be merging together many Avro records before ConvertAvroToORC.
You could do this by using MergeContent with the mode set to Avro right before ConvertAvroToORC.
You could also do this by merging your JSON together using MergeContent, and then sending the merged JSON to ConvertJsonToAvro.
Using PutHDFS to append to ORC files that are already in HDFS will not work. The HDFS processor does not know anything about the format of the data and is just writing additional raw bytes on to the file and will likely create an invalid ORC file.