Reading JSON Array with Apache Spark - json

I have a json array file, which looks like this:
["{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}",{"meta":{"headers":{"app":"music"},"customerId":"2"}}]
I have a json file, nodes that looks like this:
I am trying to read this file in scala through the spark-shell.
val s1 = spark.read.json("path/to/file/file.json")
However, this results in a corrupt_record error:
org.apache.spark.sql.DataFrame = [_corrupt_record: string]
I have also tried reading it like this:
val df = spark.read.json(spark.sparkContext.wholeTextFiles("path.json").values)
val df = spark.read.option("multiline", "true").json("<file>")
But still the same error.
As the json array contains string and json objects may be thats why I am not able to read it.
Can anyone shed some light on this error? How can we read it via spark udf?

Yes the reason is the mix of text and actual json object. To me it looks though as if the two entries belong together so why not change the schema to sth like this:
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
New line means also new record so for multiple events your file would look like this:
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"1\",\"events\":[{\"event_type\":\"ON\"}]}"}
{"meta":{"headers": {"app": "music"},"customerId": "2"},"data": "{\"timestamp\":1616549396892,\"id\":\"2\",\"events\":[{\"event_type\":\"ON\"}]}"}

Related

Is there a way to remove a sub-attribute from a json column through spark sql

I have a table in which there is column with nested json data. Need to remove a attribute from that json through spark sql
Checked on basic spark json function but not getting a way to do it
Assuming you read in a JSON file and print the schema you are showing us like this:
val df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
val app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
val appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
val trimmedDF = appName.drop("firstname")

Write JSON to a file in pretty print format in Django

I need to create some JSON files for exporting data from a Django system to Google Big Query.
The problem is that Google BQ imposes some characteristics in the JSON file, for example, that each object must be in a different line.
json.dumps writes a stringified version of the JSON, so it is not useful for me.
Django serializes writes better JSON, but it put all in one line. All the information I found about pretty-printing is about json.dumps, which I cannot use.
I will like to know if anyone knows a way to create a JSON file in the format required by Big Query.
Example:
JSONSerializer = serializers.get_serializer("json")
json_serializer = JSONSerializer()
data_objects = DataObject.objects.all()
with open("dataobjects.json", "w") as out:
json_serializer.serialize(data_objects, stream=out)
json.dumps is OK. You have to use indent like this.
import json
myjson = '{"latitude":48.858093,"longitude":2.294694}'
mydata = json.loads(myjson)
print(json.dumps(mydata, indent=4, sort_keys=True))
Output:
{
"latitude": 48.858093,
"longitude": 2.294694
}

Convert Array[Byte] to JSON format using Spark Scala

I'm reading an .avro file where the data of a particular column is in binary format. I'm currently converting the binary format to string format with the help of UDF for a readable purpose and then finally i will need to convert it into JSON format for further parsing the data. Is there a way i can convert string object to JSON format using Spark Scala code.
Any help would be much appreciated.
val avroDF = spark.read.format("com.databricks.spark.avro").
load("file:///C:/46.avro")
import org.apache.spark.sql.functions.udf
// Convert byte object to String format
val toStringDF = udf((x: Array[Byte]) => new String(x))
val newDF = avroDF.withColumn("BODY",
toStringDF(avroDF("body"))).select("BODY")
Output of newDF is shown below:
BODY |
+---------------------------------------------------------------------------------------------------------------+
|{"VIN":"FU74HZ501740XXXXX","MSG_TYPE":"SIGNAL","TT":0,"RPM":[{"E":1566800008672,"V":1073.75},{"E":1566800002538,"V":1003.625},{"E":1566800004084,"V":1121.75}
My desired output should be like below:
I do not know if you want a generic solution but in your particular case, you can code something like this:
spark.read.json(newDF.as[String])
.withColumn("RPM", explode(col("RPM")))
.withColumn("E", col("RPM.E"))
.withColumn("V", col("RPM.V"))
.drop("RPM")
.show()

Cut off the field from JSON string

I have a JSON string like this:
{"id":"111","name":"abc","ids":["740"],"data":"abc"}
I want to cut off the field "ids", however I don't know apriori the values like ["740"]. So, it might be e.g. ["888,222"] or whatever. The goal is to get the json string without the field "ids".
How to do it? Should I use JackMapper?
EDIT:
I tried to use JackMapper as JacksMapper.readValue[Map[String, String]](jsonString)to get only fields that I need. But the problem is that"ids":["740"]` throws the parsing error because it's an array. So, I decided to cut off this field before parsing, though it's an ugly solution and ideally I just want to parse the json string into Map.
Not sure what JackMapper is, but if other libraries are allowed, my personal favourites would be:
Play-JSON:
val jsonString = """{"id":"111","name":"abc","ids":["740"],"data":"abc"}"""
val json = Json.parse(jsonString).as[JsObject]
val newJson = json - "ids"
Circe:
import io.circe.parser._
val jsonString = """{"id":"111","name":"abc","ids":["740"],"data":"abc"}"""
val json = parse(jsonString).right.get.asObject.get // not handling errors
val newJson = json.remove("ids")
Note that this is the minimal example to get you going which doesn't handle bad input etc.

Loading json data into hive using spark sql

I am Unable to push json data into hive Below is the sample json data and my work . Please suggest me the missing one
json Data
{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
},
{
"userId":"thanks",
"jobTitleName":"Program Directory",
"firstName":"Tom",
"lastName":"Hanks",
"preferredFullName":"Tom Hanks",
"employeeCode":"E3",
"region":"CA",
"phoneNumber":"408-2222222",
"emailAddress":"tomhanks#gmail.com"
}
]
}
I tried to use sqlcontext and jsonFile method to load which is failing to parse the json
val f = sqlc.jsonFile("file:///home/vm/Downloads/emp.json")
f.show
error is : java.lang.RuntimeException: Failed to parse a value for data type StructType() (current token: VALUE_STRING)
I tried in different way and able to crack and get the schema
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
sqlc.jsonRDD(jsonData).registerTempTable("employee")
val emp= sqlc.sql("select Employees[1].userId as ID,Employees[1].jobTitleName as Title,Employees[1].firstName as FirstName,Employees[1].lastName as LastName,Employees[1].preferredFullName as PeferedName,Employees[1].employeeCode as empCode,Employees[1].region as Region,Employees[1].phoneNumber as Phone,Employees[1].emailAddress as email from employee")
emp.show // displays all the values
I am able to get the data and schema seperately for each record but I am missing an idea to get all the data and load into hive.
Any help or suggestion is much appreaciated.
Here is the Cracked answer
val files = sc.wholeTextFiles("file:///home/vm/Downloads/emp.json")
val jsonData = files.map(x => x._2)
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.HiveContext
val hc=new HiveContext(sc)
hc.jsonRDD(jsonData).registerTempTable("employee")
val fuldf=hc.jsonRDD(jsonData)
val dfemp=fuldf.select(explode(col("Employees")))
dfemp.saveAsTable("empdummy")
val df=sql("select * from empdummy")
df.select ("_c0.userId","_c0.jobTitleName","_c0.firstName","_c0.lastName","_c0.preferredFullName","_c0.employeeCode","_c0.region","_c0.phoneNumber","_c0.emailAddress").saveAsTable("dummytab")
Any suggestion for optimising the above code.
SparkSQL only supports reading JSON files when the file contains one JSON object per line.
SQLContext.scala
/**
* Loads a JSON file (one object per line), returning the result as a [[DataFrame]].
* It goes through the entire dataset once to determine the schema.
*
* #group specificdata
* #deprecated As of 1.4.0, replaced by `read().json()`. This will be removed in Spark 2.0.
*/
#deprecated("Use read.json(). This will be removed in Spark 2.0.", "1.4.0")
def jsonFile(path: String): DataFrame = {
read.json(path)
}
Your file should look like this (strictly speaking, it's not a proper JSON file)
{"userId":"rirani","jobTitleName":"Developer","firstName":"Romin","lastName":"Irani","preferredFullName":"Romin Irani","employeeCode":"E1","region":"CA","phoneNumber":"408-1234567","emailAddress":"romin.k.irani#gmail.com"}
{"userId":"nirani","jobTitleName":"Developer","firstName":"Neil","lastName":"Irani","preferredFullName":"Neil Irani","employeeCode":"E2","region":"CA","phoneNumber":"408-1111111","emailAddress":"neilrirani#gmail.com"}
{"userId":"thanks","jobTitleName":"Program Directory","firstName":"Tom","lastName":"Hanks","preferredFullName":"Tom Hanks","employeeCode":"E3","region":"CA","phoneNumber":"408-2222222","emailAddress":"tomhanks#gmail.com"}
Please have a look at the outstanding JIRA issue. Don't think it is that much of priority, but just for record.
You have two options
Convert your json data to the supported format, one object per line
Have one file per JSON object - this will result in too many files.
Note that SQLContext.jsonFile is deprecated, use SQLContext.read.json.
Examples from spark documentation