I am trying to convert a spark dataset into JSON. I tried .toJSON() method but its not much of a help.
I have a dataset which looks like this
| ord_status|count|
+--------------------+-----+
| Fallout| 3374|
| Flowthrough|12083|
| In-Progress| 3804|
I am trying to convert to it to a JSON like this:
"overallCounts": {
"flowthrough": 2148,
"fallout": 4233,
"inprogress": 1300
}
My question is that is there any way through which we can parse column values side by side and show them as JSON.
Update: I converted dataset in given json format by converting it into a list and then parsing each value and putting it into a string. Although that's lot of manual work. Are there any built in methods which can convert datasets into such json format?
Please find the below solution. Dataset has to iterate using mapPartitions then generate the final string which contains the only JSON elements.
val list = List(("Fallout",3374), ("Flowthrough", 12083), ("In-Progress", 3804))
val ds = list.toDS()
ds.show
val rows = ds.mapPartitions(itr => {
val string =
s"""
|"%s" : %d
|""".stripMargin
val pairs = itr.map(ele => string.format(ele._1, ele._2)).mkString
Iterator(pairs)
})
val text = rows.collect().mkString
val finalJson = """
|"overallCounts": {
| %s
| }
|""".stripMargin
println(finalJson.format(text))
Related
I have a table in which there is column with nested json data. Need to remove a attribute from that json through spark sql
Checked on basic spark json function but not getting a way to do it
Assuming you read in a JSON file and print the schema you are showing us like this:
val df = sqlContext.read().json("/path/to/file").toDF();
df.registerTempTable("df");
df.printSchema();
Then you can select nested objects inside a struct type like so...
val app = df.select("app");
app.registerTempTable("app");
app.printSchema();
app.show();
val appName = app.select("element.appName");
appName.registerTempTable("appName");
appName.printSchema();
appName.show();
val trimmedDF = appName.drop("firstname")
I have some data that needs to be written as a JSON string after some transformations in a spark (+scala) job.
I'm using the to_json function along with struct and/or array function in order to build the final json that is requested.
I have one piece of the json that looks like:
"field":[
"foo",
{
"inner_field":"bar"
}
]
I'm not an expert in JSON, so I don't know if this structure is usual or not, all I know is that this is a valid JSON format.
I'm having trouble to create a dataframe column with this format and I want to know what is the best way to create this type of data columns.
Thanks in advance
If you have a dataframe with a bunch of columns you want to turn into a json string column, you can make use of the to_json and the struct functions. Something like this:
import org.apache.spark.sql.types._
val df = Seq(
(1, "string1", Seq("string2", "string3")),
(2, "string4", Seq("string5", "string6"))
).toDF("colA", "colB", "colC")
df.show
+----+-------+------------------+
|colA| colB| colC|
+----+-------+------------------+
| 1|string1|[string2, string3]|
| 2|string4|[string5, string6]|
+----+-------+------------------+
val newDf = df.withColumn("jsonString", to_json(struct($"colA", $"colB", $"colC")))
newDf.show(false)
+----+-------+------------------+--------------------------------------------------------+
|colA|colB |colC |jsonString |
+----+-------+------------------+--------------------------------------------------------+
|1 |string1|[string2, string3]|{"colA":1,"colB":"string1","colC":["string2","string3"]}|
|2 |string4|[string5, string6]|{"colA":2,"colB":"string4","colC":["string5","string6"]}|
+----+-------+------------------+--------------------------------------------------------+
struct makes a single StructType column from multiple columns and to_json turns them into a json string.
Hope this helps!
I am working in Scala programming language. I have a big json string. I want to extract some fields from it based on its path. I want to extract bunch of fields based on input collection Seq[Field]. So I need to call the function get_json_object multiple times. How can I do it using get_json_object? Is there any other way to achieve this?
When I do
var json: JSONObject = null
val fields = Seq[Field]() // this has field name and path
var fieldMap = Map[String, String]()
for (field<- fields) {
fieldMap += (field.Field -> get_json_object(data, field.Path))
}
I get Type mismatch, expected: Column, actual: String
And when I do lit(data) in above code. e.g.
get_json_object(lit(data), field.Path))
then I get
found : org.apache.spark.sql.Column
[ERROR] required: String
I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format. I use a sample json to create the schema and later in the code I use the from_json function to convert the json to a dataframe for further processing. The problem I am facing is with the nested schema and multi-values. The sample schema defines a tag (say a) as a struct. The json data read from kafka can have either one or multiple values for the same tag ( in two different values).
val df0= spark.read.format("json").load("contactSchema0.json")
val schema0 = df0.schema
val df1 = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "node1:9092").option("subscribe", "my_first_topic").load()
val df2 = df1.selectExpr("CAST(value as STRING)").toDF()
val df3 = df2.select(from_json($"value",schema0).alias("value"))
contactSchema0.json has a sample tag as follows:
"contactList": {
"contact": [{
"id": 1001
},
{
"id": 1002
}]
}
Thus contact is inferred as a struct. But the JSON data read from Kafka can also have data as follows:
"contactList": {
"contact": {
"id": 1001
}
}
So if I define the schema as a struct, spark.json is unable to infer single values and in case if I define the schema as string spark.json is unable to infer multi-values.
Can't find such feature in Spark JSON Options but Jackson has DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY as described in this answer.
So we can get around with something like this
case class MyModel(contactList: ContactList)
case class ContactList(contact: Array[Contact])
case class Contact(id: Int)
val txt =
"""|{"contactList": {"contact": [{"id": 1001}]}}
|{"contactList": {"contact": {"id": 1002}}}"""
.stripMargin.lines.toSeq.toDS()
txt
.mapPartitions[MyModel] { it: Iterator[String] =>
val reader = new ObjectMapper()
.registerModule(DefaultScalaModule)
.enable(DeserializationFeature.ACCEPT_SINGLE_VALUE_AS_ARRAY)
.readerFor(classOf[MyModel])
it.map(reader.readValue[MyModel])
}
.show()
Output:
+-----------+
|contactList|
+-----------+
| [[[1001]]]|
| [[[1002]]]|
+-----------+
Note that to get a Dataset in your code, you could use
val df2 = df1.selectExpr("CAST(value as STRING)").as[String]
instead and then call mapPartitions for df2 like before.
I have an RDD which contains a String and JSON object (as String). I extracted required values from the JSON object. How can I use the values to create a new RDD which stores each value in each column?
RDD
(1234,{"id"->1,"name"->"abc","age"->21,"class"->5})
From which a map was generated as shown below.
"id"->1,
"name"->"abc",
"age"->21
"id"->2,
"name"->"def",
"age"->31
How to convert this to RDD[(String, String, String)], which stores data like:
1 abc 21
2 def 31
Not in front of a compiler right now, but something like this should work:
def parse(val row: (String, JValue)) : Seq((String, String, String)) = {
// Here goes your code to parse a Json into a sequence of tuples, seems like you have this already well in hand.
}
val rdd1 = ??? // Initialize your RDD[(String, JValue)]
val rdd2: RDD[(String, String, String)] = rdd1.flatMap(parse)
flatMap does the trick, as your extraction function can extract multiple rows on each Json input (or none) and they will be seamlessly be integrated into the final RDD.