How to write multiline json using pyspark - json

We are easy read multiline json using below command
df =spark.read.option("multiline","True").json("any multiline.json")
but not able to write easily any multiline json using write command
Example of multiline json
[{
"RecordNumber": 2,
"Zipcode": 704,
"ZipCodeType": "STANDARD",
"City": "PASEO COSTA DEL SUR",
"State": "PR"
},
{
"RecordNumber": 10,
"Zipcode": 709,
"ZipCodeType": "STANDARD",
"City": "BDA SAN LUIS",
"State": "PR"
}]
I tried using below solution but here aggregating all value and save in text format
How to save a dataframe into a json file with multiline option in pyspark
could you please suggest any other solution without aggregation and save directly in. json extension

Since Spark does not have options to prettify an output JSON, you could convert the result to string JSON using toJSON and then use the python json library to save a properly indented json file.
For example :
jsonResults = resultDF.toJSON().collect()
jsonResultList = [json.loads(x) for x in jsonResults]
with open('result.json', 'w', encoding='utf-8') as f:
json.dump(jsonResultList, f, ensure_ascii=False, indent=4)

Related

Import JSON content from CSV file using spark

Currently, I’m working with the following architecture.
I do have a DocumentDB database that has data exported to S3 using DMS (CDC task), once this data is landed on S3 I need to load it into Databricks.
I’m already able to read the CSV content (which has a lot of JSONS), but I don't how to parse/insert it into a Databricks table.
Following my JSON payload which is exported to S3.
{
"_id": {
"$oid": "12332334"
},
"processed": false,
"col1": "0000000122",
"startedAt": {
"$date": 1635667097000
},
"endedAt": {
"$date": 1635667710000
},
"col2": "JFHFGALJF-DADAD",
"col3": 2.75,
"created_at": {
"$date": 1635726018693
},
"updated_at": {
"$date": 1635726018693
},
"__v": 0
}
To extract the data into Daframe I'm using the following spark command:
df = spark.read \
.option("header", "true") \
.option("delimiter", "|") \
.option("inferSchema", "false" ) \
.option("lineterminator", "\n" ) \
.option("encoding", "ISO-8859-1") \
.option("ESCAPE quote", '"') \
.option("escape", "\"") \
.csv("dbfs:/mnt/s3-data-2/folder_name/LOAD00000001.csv")
Thank you Alex Ott as per your Suggestion and as per this document. you can use from_json in your file to read JSON to CSV
In order to read a JSON string from a CSV file, first, we need to read a CSV file into Spark Dataframe using spark.read.csv("path") and then parse the JSON string column and convert it to columns using from_json() function. This function takes the first argument as a JSON column name and the second argument as JSON schema.

Json data as dict for cerberus validations

I am trying to read Json file as string and using the data as validator for cerberus. I am using a custom function with check_with. The json works fine if I use the code from within my python test script.
abc.json
{
"rows": {
"type": "list",
"schema": {
"type": "dict",
"schema": {
"amt": {"type": "integer"},
"amt": {"check_with": util_cls.amt_gt_than}
}
}
}
}
python test code
with open("abc.json") as f:
s = f.read()
#s = ast.literal_eval(s)
v = Validator()
r = v.validate(json_data, s)
Cerberus requires the variable s to be a dict and I can't convert abc.json file contents to json using json.load as the json is not a valid format. Any ideas on how to convert the string to dict?

Read JSON files from multiple line file in spark scala

I'm learning spark in Scala. I have a JSON file as follows:
[
{
"name": "ali",
"age": "13",
"phone": "09123455737",
"sex": "m"
},{
"name": "amir",
"age": "24",
"phone": "09123475737",
"sex": "m"
}
]
and there is just this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonFile = sqlContext.read.json("path-to-json-file")
I just receive corrupted_row : String nothing else
but when put every person(or objects) in single row, code works fine
How can I read from multiple lines for a JSON sqlContext in spark?
You will have to read it into an RDD yourself and then convert it to a Dataset:
spark.read.json(sparkContext.wholeTextFiles(...).values)
This problem is getting caused because you have multiline json row. Although by default spark.read.json expect a row to be in a single line but this is configurable:
You can set option spark.read.json("path-to-json-file").option("multiLine", true)

Format JSON file from 2 columns

I would like to create a JSON file for a Python script to parse.
My data is currently in a text file in the format of:
url1,string1
url2,string2
url3,string3
url4,string4
I would like to manually create a JSON file that I could input against a Python script to scrape for a string.
Thank you, I used your example to build something like it and it worked!
{"url": "url1", "string": "string1"} {"url": "url2", "string": "string2"} {"url": "url3", "string": "string3"}
Thanks
Something like the following should work
import csv
import json
csv_file = open('file.csv', 'r')
json_file = open('file.json', 'w')
field_names = ("url", "string")
reader = csv.DictReader(csv_file, field_names)
for row in reader:
json.dump(row, json_file)
json_file.write('\n')
I may misunderstand your question, if it's converting this CSV into a JSON manually, it would be :
[
[
"url1",
"string1"
],
[
"url2",
"string2"
],
[
"url3",
"string3"
],
[
"url4",
"string4"
]
]
If you prefer you can use CSV to JSON converter online

Jquery parse xhr.responseText

var data = xhr.responseText;
When I output this console.log(xhr.responseText). Below is my output
["{id:1,name\":\"JOHN\",\"city\":\"null\"}"
,"{\"id\":2,\"name\":\"MICHEAL\,\"city\":\"null\"}"]
How do I get id, name. I tried like this data.id but I get this error
jquery JSON.parse: unexpected end of data.
Update
I am using code igniter with data mapper so my data mapper is giving that json response. Do you know, how I can resolve it.
You've already been told what the problem is in the comments: the JSON generated by the server is invalid. You are probably not using a library to encode your JSON, don't ever encode it by hand.
Your JSON should probably look like the following (when pretty printed) http://jsfiddle.net/7FKWr/
[
{"id": 1, "name": "JOHN", "city": null},
{"id": 2, "name": "MICHEAL", "city": null}
]