Groovy collect from map and submap - json

I had a requirement to convert a JSON response into a csv file. I was able to successfully use Tim Yates' excellent code from here: Groovy code to convert json to CSV file
I now need to include the JSON's nested submap in the csv as well. The relationship between the map and submap is 1:1.
I've been unable to get the correct syntax for a collect statement that will retrieve both the parsed map and submap key/values.
Sample JSON
{items=
[
{
created_at=2019-03-27
, entity_id=1
, extension_attributes=[]
},
{
created_at=2019-03-27
, entity_id=2
, extension_attributes= { employee_id=Emp1, employee_type=CSR}//nested submap
}
]}
Groovy
import groovy.json.*
def data = new JsonSlurper().parseText( json ); //"json" is from GET request
def columns = ["created_at","entity_id","employee_id","employee_type"]
def encode = { e -> e ? /"$e"/ : "$e"}
requestFile.append( columns.collect { c -> encode( c ) }.join( ',' ) + '\n');
requestFile.append( data.items.collect { row ->columns.collect { colName -> encode( row[ colName ] ).replaceAll("null","") }.join( ',' )}.join( '\n' ) );//unsure how to use data.items.collect to fetch submap
I would like to either
1) Convert the JSON as follows to collect each key/value easily:
...
{
created_at=2019-03-27
, entity_id=2
, employee_id=Emp1
, employee_type=CSR
}
...
or 2) Find out if there's a way to use Groovy's collect method to retrieve the map/submap as a flat map.
I am unfortunately not a programmer by trade, and any help would be appreciated!

Here is the flatten closure which flattens an item recursively:
def flatten
flatten = { row ->
def flattened = [:]
row.each { k, v ->
if (v instanceof Map) {
flattened << flatten(v)
} else {
flattened[k] = v
}
}
flattened
}
You should just replace row with flatten(row) at your last line so it looks like this:
requestFile.append(data.items.collect { row ->
columns.collect {
colName -> encode(flatten(row)[colName]).replaceAll("null", "")
}.join(',')
}.join('\n'))
The result will be as follows:
"created_at","entity_id","employee_id","employee_type"
"2019-03-27","1",,
"2019-03-27","2","Emp1","CSR"

Also found that following allows for collect method to fetch nested elements:
def m = data.items.collect{[/"${it?.created_at?:''}"/,/"${it?.extension_attributes?.entity_id?:''}"/,/"${it?.extension_attributes?.employee_id?:''}"/,/"${it?.extension_attributes?.employee_type?:''}"/]}
m.each{requestFile.append(it.join(',')+'\n')}

Related

Scala Json - List of Json

I have the below code and the output from my program. However, I could not create a List of Json (desired output) given below. What kind changes do I need to do in the existing code?
case class Uiresult(AccountNo: String, Name: String)
val json = parse(jsonString)
val elements = (json \\ "_source").children
for (acct <- elements) {
val m = acct.extract[Source]
val res = write(Uiresult(m.accountNo, (m.firstName + m.lastName))
println(res)
}
Output from current program:
{"AccountNo":"1234","Name":"Augustin John"}
{"AccountNo":"1235","Name":"Juliet Paul"}
{"AccountNo":"1236","Name":"Sebastin Arul"}
Desired output:
[
{"AccountNo":"1234","Name":"Augustin John"},
{"AccountNo":"1235","Name":"Juliet Paul"},
{"AccountNo":"1236","Name":"Sebastin Arul"}
]
To create a list for the for comprehension, use the yield keyword. This will return the values from the iterations and create a list for you, which you then can assign to a val.
val list = for (acct <- elements) yield {
val m = acct.extract[Source]
val res = write(Uiresult(m.accountNo, (m.firstName + m.lastName))
res
}
This can be written even shorter,
val list = for (acct <- elements) yield {
val m = acct.extract[Source]
write(Uiresult(m.accountNo, (m.firstName + m.lastName))
}
The type of list (Array, List, Seq, etc.) will be determined by the type of elements. Other data structures such as set dictionaries are also possible to use in this way.
To print the output into the exact format as in the "desired output" above, use mkString.
println(list.mkString("[\n", ",\n", "\n]"))

Converting csv RDD to map

I have a large CSV( > 500 MB), which I take into a spark RDD, and I want to store it to a large Map[String, Array[Long]].
The CSV has multiple columns but I require only two for the time being. The first and second column, and is of the form:
A 12312 [some_value] ....
B 123123[some_value] ....
A 1222 [some_value] ....
C 1231 [some_value] ....
I want my map to basically group by the string and store an array of long
so, for the above case, my map would be:
{"A": [12312, 1222], "B": 123123, "C":1231}
But since this map would be huge, I can't simply do this directly. tsca
I take the CSV in a sql.dataframe
My code so far(Looks incorrect though):
def getMap(df: sql.DataFrame, sc: SparkContext): RDD[Map[String, Array[Long]]] = {
var records = sc.emptyRDD[Map[String, Array[Long]]]
val rows: RDD[Row] = df.rdd
rows.foreachPartition( iter => {
iter.foreach(x =>
if(records.contains(x.get(0).toString)){
val arr = temp_map.getOrElse()
records = records + (x.get(0).toString -> (temp_map.getOrElse(x.get(0).toString) :+ x.get(1).toString.toLong))
}
else{
val arr = new Array[Long](1)
arr(0) = x.get(1).toString.toLong
records = records + (x.get(0).toString -> arr)
}
)
})
}
Thanks in advance!
If I understood your question correctly then
You could groupBy first column and collect_list for the second column
import org.apache.spark.sql.functions._
val newDF = df.groupBy("column1").agg(collect_list("column2"))
newDF.show(faslse)
val rdd = newDF.rdd.map(r => (r.getString(0), r.getAs[List[Long]](1)))
This will give you RDD[(String, List[Long])] where the string will be unique

Retrieve data by Ignoring null values and header row from csv file

Working on Groovy Script in soapui 5.3.0 and facing the below issue while extracting the values from file to a list.
Purpose of below code is, the list retrieved has to be compared with another list with valid values only.
Attaching the code snippet and the sample csv file for reference.
code to retrieve the values:
def DBvalue= context["csvfile"] //csv file containing the data
def count= context["dbrowcount"] //here the rowcount is 23
for (i=0;i<count;i++) {
def lines= ""
lines= DBvalue.text.split('\n')
list<string> rows = lines.collect{it.split(';)}
log.info "list is"+rows
}
Sample CSV file on which am working contains 600 column of data with 23 rows
abc;null;1;2;3;5;8;null
cdf;null;2;3;6;null;5;6
hgf;null;null;null;jr;null;II
Currently my code is fetching the below output:
[[abc,null,1,2,3,5,8,null]]
[[abc,null,1,2,3,5,8,null]]
[[abc,null,1,2,3,5,8,null]]
Desired output:
[1,2,3,5,8]
[2,3,6,5,6]
[jr,II]
You should be able achieve it with below, and follow in-line comments.
//Provide your file path; change if needed
def file = new File('/tmp/test.csv')
//To hold all the rows
def list = []
//Change delimiter if needed
def delimiter = ';'
file.readLines().eachWithIndex { line, index ->
if (index) {
//Get the row data by split, filter
def lineData = line.split(delimiter).findAll { 'null' != it && it}
log.info lineData
list << lineData
}
}
//Print all the row data
log.info list
Input:
Output:

Spark RDD to CSV - Add empty columns

I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}

Using lift-json, is there an easy way to extract and traverse a list?

I think I may be missing something fundamental from the list-json xpath architecture. The smoothest way I've been able to extract and traverse a list is shown below. Can someone please show me a better technique:
class Example {
#Test
def traverseJsonArray() {
def myOperation(kid:JObject) = println("kid="+kid)
val json = JsonParser.parse("""
{ "kids":[
{"name":"bob","age":3},
{"name":"angie","age":5},
]}
""")
val list = ( json \\ "kids" ).children(0).children
for ( kid <- list ) myOperation(kid.asInstanceOf[JObject])
}
}
If at all possible you should upgrade to Lift JSON 2.3-M1 (http://www.scala-tools.org/repo-releases/net/liftweb/lift-json_2.8.1/2.3-M1/). It contains two important improvements, the other affecting the path expressions.
With 2.3 the path expressions never return JFields, instead the values of JFields are returned directly. After that your example would look like:
val list = (json \ "kids").children
for ( kid <- list ) myOperation(kid.asInstanceOf[JObject])
Lift JSON provides several styles to parse values from JSON: path expressions, query comprehensions and case class extractions. It is possible to mix and match these styles and to get the best results we often do. For completeness sake I'll give you some variations of the above example to get a better intuition of these different styles.
// Collect all JObjects from 'kids' array and iterate
val JArray(kids) = json \ "kids"
kids collect { case kid: JObject => kid } foreach myOperation
// Yield the JObjects from 'kids' array and iterate over yielded list
(for (kid#JObject(_) <- json \ "kids") yield kid) foreach myOperation
// Extract the values of 'kids' array as JObjects
implicit val formats = DefaultFormats
(json \ "kids").extract[List[JObject]] foreach myOperation
// Extract the values of 'kids' array as case classes
case class Kid(name: String, age: Int)
(json \ "kids").extract[List[Kid]] foreach println
// Query the JSON with a query comprehension
val ks = for {
JArray(kids) <- json
kid#JObject(_) <- kids
} yield kid
ks foreach myOperation