I have the below code and the output from my program. However, I could not create a List of Json (desired output) given below. What kind changes do I need to do in the existing code?
case class Uiresult(AccountNo: String, Name: String)
val json = parse(jsonString)
val elements = (json \\ "_source").children
for (acct <- elements) {
val m = acct.extract[Source]
val res = write(Uiresult(m.accountNo, (m.firstName + m.lastName))
println(res)
}
Output from current program:
{"AccountNo":"1234","Name":"Augustin John"}
{"AccountNo":"1235","Name":"Juliet Paul"}
{"AccountNo":"1236","Name":"Sebastin Arul"}
Desired output:
[
{"AccountNo":"1234","Name":"Augustin John"},
{"AccountNo":"1235","Name":"Juliet Paul"},
{"AccountNo":"1236","Name":"Sebastin Arul"}
]
To create a list for the for comprehension, use the yield keyword. This will return the values from the iterations and create a list for you, which you then can assign to a val.
val list = for (acct <- elements) yield {
val m = acct.extract[Source]
val res = write(Uiresult(m.accountNo, (m.firstName + m.lastName))
res
}
This can be written even shorter,
val list = for (acct <- elements) yield {
val m = acct.extract[Source]
write(Uiresult(m.accountNo, (m.firstName + m.lastName))
}
The type of list (Array, List, Seq, etc.) will be determined by the type of elements. Other data structures such as set dictionaries are also possible to use in this way.
To print the output into the exact format as in the "desired output" above, use mkString.
println(list.mkString("[\n", ",\n", "\n]"))
Related
My requirement is to convert two string and create a JSON file(using spray JSON) and save in a resource directory.
one input string contains the ID and other input strings contain the score and topic
id = "alpha1"
inputstring = "science 30 math 24"
Expected output JSON is
{“ContentID”: “alpha1”,
“Topics”: [
{"Score" : 30, "TopicID" : "Science" },
{ "Score" : 24, "TopicID" : "math”}
]
}
below is the approach I have taken and am stuck in the last place
Define the case class
case class Topic(Score: String, TopicID: String)
case class Model(contentID: String, topic: Array[Topic])
implicit val topicJsonFormat: RootJsonFormat[Topic] = jsonFormat2(Topic)
implicit val modelJsonFormat: RootJsonFormat[Model] = jsonFormat2(Model)
Parsing the input string
val a = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 == 0) =>
(v,i)}.map(_._1)
val b = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 != 0) =>
(v,i)}.map(_._1)
val result = a.zip(b)
And finally transversing through result
paired foreach {case (x,y) =>
val tClass = Topic(x, y)
val mClassJsonString = Topic(x, y).toJson.prettyPrint
out1.write(mClassJsonString.toString)
}
And the file is generated as
{"Score" : 30, "TopicID" : "Science" }
{ "Score" : 24, "TopicID" : "math”}
The problem is I am not able to add the contentID as needed above.
Adding ContentId inside foreach is making contentID added multiple time.
You're calling toJson inside foreach creating strings and then you're appending it to buffer.
What you probably wanted to do is to create a class (ADT) hierarchy first and then serialize it:
val topics = paired.map(Topic)
//toArray might be not necessary if topics variable is already an array
val model = Model("alpha1", topics.toArray)
val json = model.toJson.prettyPrint
out1.write(json.toString)
I had a requirement to convert a JSON response into a csv file. I was able to successfully use Tim Yates' excellent code from here: Groovy code to convert json to CSV file
I now need to include the JSON's nested submap in the csv as well. The relationship between the map and submap is 1:1.
I've been unable to get the correct syntax for a collect statement that will retrieve both the parsed map and submap key/values.
Sample JSON
{items=
[
{
created_at=2019-03-27
, entity_id=1
, extension_attributes=[]
},
{
created_at=2019-03-27
, entity_id=2
, extension_attributes= { employee_id=Emp1, employee_type=CSR}//nested submap
}
]}
Groovy
import groovy.json.*
def data = new JsonSlurper().parseText( json ); //"json" is from GET request
def columns = ["created_at","entity_id","employee_id","employee_type"]
def encode = { e -> e ? /"$e"/ : "$e"}
requestFile.append( columns.collect { c -> encode( c ) }.join( ',' ) + '\n');
requestFile.append( data.items.collect { row ->columns.collect { colName -> encode( row[ colName ] ).replaceAll("null","") }.join( ',' )}.join( '\n' ) );//unsure how to use data.items.collect to fetch submap
I would like to either
1) Convert the JSON as follows to collect each key/value easily:
...
{
created_at=2019-03-27
, entity_id=2
, employee_id=Emp1
, employee_type=CSR
}
...
or 2) Find out if there's a way to use Groovy's collect method to retrieve the map/submap as a flat map.
I am unfortunately not a programmer by trade, and any help would be appreciated!
Here is the flatten closure which flattens an item recursively:
def flatten
flatten = { row ->
def flattened = [:]
row.each { k, v ->
if (v instanceof Map) {
flattened << flatten(v)
} else {
flattened[k] = v
}
}
flattened
}
You should just replace row with flatten(row) at your last line so it looks like this:
requestFile.append(data.items.collect { row ->
columns.collect {
colName -> encode(flatten(row)[colName]).replaceAll("null", "")
}.join(',')
}.join('\n'))
The result will be as follows:
"created_at","entity_id","employee_id","employee_type"
"2019-03-27","1",,
"2019-03-27","2","Emp1","CSR"
Also found that following allows for collect method to fetch nested elements:
def m = data.items.collect{[/"${it?.created_at?:''}"/,/"${it?.extension_attributes?.entity_id?:''}"/,/"${it?.extension_attributes?.employee_id?:''}"/,/"${it?.extension_attributes?.employee_type?:''}"/]}
m.each{requestFile.append(it.join(',')+'\n')}
I have an object like this:
val aa = parse(""" { "vals" : [[1,2,3,4], [4,5,6,7], [8,9,6,3]] } """)
I want to access the value '1' in the first JArray.
println(aa.values ???)
How is this done?
Thanks
One way would be :
val n = (aa \ "vals")(0)(0).extract[Int]
println(n)
Another way is to parse the whole json using a case class :
implicit val formats = DefaultFormats
case class Numbers(vals: List[List[Int]])
val numbers = aa.extract[Numbers]
This way you can access the first value of the first list however you like :
for { list <- numbers.vals.headOption; hd <- list.headOption } println(hd)
// or
println(numbers.vals.head.head)
// or ...
I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}
I think I may be missing something fundamental from the list-json xpath architecture. The smoothest way I've been able to extract and traverse a list is shown below. Can someone please show me a better technique:
class Example {
#Test
def traverseJsonArray() {
def myOperation(kid:JObject) = println("kid="+kid)
val json = JsonParser.parse("""
{ "kids":[
{"name":"bob","age":3},
{"name":"angie","age":5},
]}
""")
val list = ( json \\ "kids" ).children(0).children
for ( kid <- list ) myOperation(kid.asInstanceOf[JObject])
}
}
If at all possible you should upgrade to Lift JSON 2.3-M1 (http://www.scala-tools.org/repo-releases/net/liftweb/lift-json_2.8.1/2.3-M1/). It contains two important improvements, the other affecting the path expressions.
With 2.3 the path expressions never return JFields, instead the values of JFields are returned directly. After that your example would look like:
val list = (json \ "kids").children
for ( kid <- list ) myOperation(kid.asInstanceOf[JObject])
Lift JSON provides several styles to parse values from JSON: path expressions, query comprehensions and case class extractions. It is possible to mix and match these styles and to get the best results we often do. For completeness sake I'll give you some variations of the above example to get a better intuition of these different styles.
// Collect all JObjects from 'kids' array and iterate
val JArray(kids) = json \ "kids"
kids collect { case kid: JObject => kid } foreach myOperation
// Yield the JObjects from 'kids' array and iterate over yielded list
(for (kid#JObject(_) <- json \ "kids") yield kid) foreach myOperation
// Extract the values of 'kids' array as JObjects
implicit val formats = DefaultFormats
(json \ "kids").extract[List[JObject]] foreach myOperation
// Extract the values of 'kids' array as case classes
case class Kid(name: String, age: Int)
(json \ "kids").extract[List[Kid]] foreach println
// Query the JSON with a query comprehension
val ks = for {
JArray(kids) <- json
kid#JObject(_) <- kids
} yield kid
ks foreach myOperation