Nested JSON to dataframe in Scala - json

I am using Spark/Scala to make an API Request and parse the response into a dataframe. Following is the sample JSON response I am using for testing purpose:
API Request/Response
However, I tried to use the following answer from StackOverflow to convert to JSON but the nested fields are not being processed. Is there any way to convert the JSON string to a dataframe with columns??

I think the problem is that the json that you have attached, if we read it as a df, it is giving a single row(and it is very huge) and hence spark might be truncating the result.
If this is what you want then you can try to use the spark property spark.debug.maxToStringFields to a higher value(default is 25)
spark.conf().set("spark.debug.maxToStringFields", 100)
However, if you want to process the Results from json, then it would be better to get it as data frame and then do the processing. Here is how you can do it
val results = JsonParser.parseString(<json content>).getAsJsonObject().get("Results").getAsJsonArray.toString
import spark.implicits._
val df = spark.read.json(Seq(results).toDS)
df.show(false)

Related

Converting a JSON Dictionary (currently a String) to a Pandas Dataframe

I am using Python's request library to retrieve data from a web API. When viewing my data using requests.text, it returns a string of a large JSON object, e.g.,
'{"Pennsylvania": {"Sales": [{"date":"2021-12-01", "id": "Metric67", ... '
Naturally, the type of this object is currently a string. What is the best way to cover this string/JSON to a Pandas Dataframe?
r.text returns json as text.
You can use r.json to get json as dictionary from requests:
import requests
r=requests.get(YOUR_URL)
res=r.json()

How structured streaming dynamically parses kafka's json data

I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format.
My code is as follows:
in the code I use the from_json function to convert the json to a dataframe for further processing.
val **schema**: StructType = new StructType()
.add("time", LongType)
.add(id", LongType)
.add("properties",new StructType()
.add("$app_version", StringType)
.
.
)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","...")
.option("subscribe","...")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(from_json(col("value"), **schema**))
My problem is that if the field is increased,
I can't stop the spark program to manually add these fields,
then how can I parse these fields dynamically, I tried schema_of_json(),
it can only take the first line to infer the field type and it not suitable for multi-level nested structures json data.
My problem is that if the field is increased, I can't stop the spark program to manually add these fields, then how can I parse these fields dynamically
It is not possible in Spark Structured Streaming (or even Spark SQL) out of the box. There are a couple of solutions though.
Changing Schema in Code and Resuming Streaming Query
You simply have to stop your streaming query, change the code to match the current schema, and resume it. It is possible in Spark Structured Streaming with data sources that support resuming from checkpoint. Kafka data source does support it.
User-Defined Function (UDF)
You could write a user-defined function (UDF) that would do this dynamic JSON parsing for you. That's also among the easiest options.
New Data Source (MicroBatchReader)
Another option is to create an extension to the built-in Kafka data source that would do the dynamic JSON parsing (similarly to Kafka deserializers). That requires a bit more development, but is certainly doable.

Parsing Complex JSON in Scala

I have the below JSON field in my data.
val myjsonString = """[{"A":[{"myName":"Sheldon""age":30"Qualification":"Btech"}]"B":"UnitedStates"},{"A":[{"myName":"Raj""age":35"Qualification":"BSC"}]"B":"UnitedKIngDom"},{"A":[{"myName":"Howard""age":40"Qualification":"MTECH"}]"B":"Australia"}] """
The parse method gives the following structure:
scala > val json = parse(myjsonString)
json: org.json4s.JValue = JArray(List(JObject(List((A,JArray(List(JObject(List((myName,JString(Sheldon)), (age,JInt(30)), (Qualification,JString(Btech))))))), (B,JString(UnitedStates)))), JObject(List((A,JArray(List(JObject(List((myName,JString(Raj)), (age,JInt(35)), (Qualification,JString(BSC))))))), (B,JString(UnitedKIngDom)))), JObject(List((A,JArray(List(JObject(List((myName,JString(Howard)), (age,JInt(40)), (Qualification,JString(MTECH))))))), (B,JString(Australia))))))
I am trying to parse it using Scala json4s. Visited almost all previously asked questions related to this, however, could not get proper solution to this. The O/P should be something like this:-
UnitedStates 30
UnitedKIngDom 35
Australia 40
or only the age in 30#35#45 format.
The JSON you posted is invalid, there are missing commas between your object fields.
In order to get the output that you want you will need to extract the data from the parsed AST that Json4s will create upon successful parsing of the data. Json4s provides a number of ways with which you can manipulate and extract data from a parsed AST.
You could map over the list of objects inside the JArray and extract the country and age from each object. I don't wish to provide code to do this, as you haven't provided an example of what you have tried to do other than simply parsing the JSON string.

how to convert nested json file into csv in scala

I want to convert my nested json into csv ,i used
df.write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
But it can use to normal json but not nested json. Anyway that I can convert my nested json to csv?help will be appreciated,Thanks!
When you ask Spark to convert a JSON structure to a CSV, Spark can only map the first level of the JSON.
This happens because of the simplicity of the CSV files. It is just asigning a value to a name. That is why {"name1":"value1", "name2":"value2"...} can be represented as a CSV with this structure:
name1,name2, ...
value1,value2,...
In your case, you are converting a JSON with several levels, so Spark exception is saying that it cannot figure out how to convert such a complex structure into a CSV.
If you try to add only a second level to your JSON, it will work, but be careful. It will remove the names of the second level to include only the values in an array.
You can have a look at this link to see the example for json datasets. It includes an example.
As I have no information about the nature of the data, I can't say much more about it. But if you need to write the information as a CSV you will need to simplify the structure of your data.
Read json file in spark and create dataframe.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
Save the dataframe using spark-csv
people.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
Source :
read json
save to csv

Spark exception handling for json

I am trying to catch/ignore a parsing error when I'm reading a json file
val DF = sqlContext.jsonFile("file")
There are a couple of lines that aren't valid json objects, but the data is too large to go through individually (~1TB)
I've come across exception handling for mapping using import scala.util.Tryand in.map(a => Try(a.toInt)) referencing:
how to handle the Exception in spark map() function?
How would I catch an exception when reading a json file with the function sqlContext.jsonFile?
Thanks!
Unfortunately you are out of luck here. DataFrameReader.json which is used under the hood is pretty much all-or-nothing. If your input contains malformed lines you have to filter these manually. A basic solution could look like this:
import scala.util.parsing.json._
val df = sqlContext.read.json(
sc.textFile("file").filter(JSON.parseFull(_).isDefined)
)
Since above validation is rather expensive you may prefer to drop jsonFile / read.json completely and to use parsed JSON lines directly.