how to evaluate String to identify duplicate keys - json

I have an input string in groovy which is not strictly JSON.
String str = "['OS_Node':['eth0':'1310','eth0':'1312']]"
My issue is to identify the duplicate "eth0" . I tried to convert this into map using Eval.me(), but it automatically removes the duplicate key "eth0" and gives me a Map.
What is the best way for me to identify the presence of duplicate key ?
Note: there could be multiple OS_Node1\2\3\ entries.. need to identify duplicates in each of them ?
Is there any JSON api that can be used? or need to use logic based on substring() ?

One way to solve this could be to cheat a little and replace colons with commas which would transform the maps into lists and then do a recursive search for duplicates:
def str = "['OS_Node':['eth0':'1310','eth0':'1312'], 'OS_Node':['eth1':'1310','eth1':'1312']]"
def tree = Eval.me(str.replaceAll(":", ","))
def dupes = findDuplicates(tree)
dupes.each { println it }
def findDuplicates(t, path=[], dupes=[]) {
def seen = [] as Set
t.collate(2).each { k, v ->
if (k in seen) dupes << [path: path + k]
seen << k
if (v instanceof List) findDuplicates(v, path+k, dupes)
}
dupes
}
when run, prints:
─➤ groovy solution.groovy
[path:[OS_Node, eth0]]
[path:[OS_Node]]
[path:[OS_Node, eth1]]
i.e. the method finds all paths to duplicated keys where "path" is defined as the key sequence required to navigate to the duplicate key.
The function returns a list of maps which you can then do whatever you wish with. Should be noted that the "OS_Node" key is with this logic treated as a duplicate but you could easily filter that out as a step after this function call.

First of all, the string you have there is not JSON - not only due to
the duplicate keys, but also by the use of [] for maps. This looks
a lot more like a groovy map literal. So if this is your custom format
and you can not do anything against it, I'd write a small parser for this,
because sooner or later edge cases or quoting problems come around the
corner.
#Grab("com.github.petitparser:petitparser-core:2.3.1")
import org.petitparser.tools.GrammarDefinition
import org.petitparser.tools.GrammarParser
import org.petitparser.parser.primitive.CharacterParser as CP
import org.petitparser.parser.primitive.StringParser as SP
import org.petitparser.utils.Functions as F
class MappishGrammerDefinition extends GrammarDefinition {
MappishGrammerDefinition() {
define("start", ref("map"))
define("map",
CP.of("[" as Character)
.seq(ref("kv-pairs"))
.seq(CP.of("]" as Character))
.map{ it[1] })
define("kv-pairs",
ref("kv-pair")
.plus()
.separatedBy(CP.of("," as Character))
.map{ it.collate(2)*.first()*.first() })
define("kv-pair",
ref('key')9
.seq(CP.of(":" as Character))
.seq(ref('val'))
.map{ [it[0], it[2]] })
define("key",
ref("quoted"))
define("val",
ref("quoted")
.or(ref("map")))
define("quoted",
CP.anyOf("'")
.seq(SP.of("\\''").or(CP.pattern("^'")).star().flatten())
.seq(CP.anyOf("'"))
.map{ it[1].replace("\\'", "'") })
}
// Helper for `def`, which is a keyword in groovy
void define(s, p) { super.def(s,p) }
}
println(new GrammarParser(new MappishGrammerDefinition()).parse("['OS_Node':['eth0':'1310','eth0':'1312'],'OS_Node':['eth0':'42']]").get())
// → [[OS_Node, [[eth0, 1310], [eth0, 1312]]], [OS_Node, [[eth0, 42]]]]

Related

How to use #CsvFileSource with records of arbitrary length

I want to achieve following - in a csv file there are records (lines) with comma-separated values of arbitrary length, then I want to pass to the parametrized test method first N (say, 3, but whatever) arguments a String, and the rest - as some collection. That said, I want to achieve something like this:
class Tests {
#DisplayName("Data Test")
#ParameterizedTest(name = "{0} → {1}; {2} → {3}")
#CsvFileSource(resources = ["/data.csv"], numLinesToSkip = 1)
fun runTests(spec0: String, spec1: String, input: String, outputs: List<String>) {
assertData(spec0, spec1, input, outputs)
}
}
However, I actually don't know what it the best way to do it. The current workaround I'm using is to just store dynamic length values as a single string with some separator and postprocess the last argument:
class Tests {
#DisplayName("Data Test")
#ParameterizedTest(name = "{0} → {1}; {2} → {3}")
#CsvFileSource(resources = ["/data.csv"], numLinesToSkip = 1)
fun runTests(spec0: String, spec1: String, input: String, outputs: String) {
assertData(spec0, spec1, input, outputs.split('␞'))
}
}
What would be the best (more idiomatic) way to achieve this?
I just don't want have data in csv file with this additional separator.

How do I search for a string in this JSON with Python

My JSON file looks something like:
{
"generator": {
"name": "Xfer Records Serum",
....
},
"generator": {
"name: "Lennar Digital Sylenth1",
....
}
}
I ask the user for search term and the input is searched for in the name key only. All matching results are returned. It means if I input 's' only then also both the above ones would be returned. Also please explain me how to return all the object names which are generators. The more simple method the better it will be for me. I use json library. However if another library is required not a problem.
Before switching to JSON I tried XML but it did not work.
If your goal is just to search all name properties, this will do the trick:
import re
def search_names(term, lines):
name_search = re.compile('\s*"name"\s*:\s*"(.*' + term + '.*)",?$', re.I)
return [x.group(1) for x in [name_search.search(y) for y in lines] if x]
with open('path/to/your.json') as f:
lines = f.readlines()
print(search_names('s', lines))
which would return both names you listed in your example.
The way the search_names() function works is it builds a regular expression that will match any line starting with "name": " (with varying amount of whitespace) followed by your search term with any other characters around it then terminated with " followed by an optional , and the end of string. Then applies that to each line from the file. Finally it filters out any non-matching lines and returns the value of the name property (the capture group contents) for each match.

(Python)How to process complicated json data

//Please excuse my poor English.
Hello everyone, I am doing a project which is about a facebook comment spider.
then I find the Facebook Graph GUI. It will return a json file that's so complicated for me.
The json file is include so many parts
then I use json.loads to get all the json code
finally it return a dict for me.
and i dont know how to access the Value
for example i want get all the id or comment.
but i can only get the 2 key of dict "data" and "pading"
so, how can i get the next key? like "id" or "comment"
and how to process this complicated data.
code
Thank you very much.
Two ways I can think of, either you know what you're looking for and access it directly or you loop over the keys, look at the value of the keys and nest another loop until you reach the end of the tree.
You can do this using a self-calling function and with the appropriate usage of jQuery.
Here is an example:
function get_the_stuff(url)
{
$.getJSON(url, function ( data ) {
my_parser(data) });
}
function my_parser(node)
{
$.each(node, function(key, val) {
if ( val && typeof val == "object" ) { my_parser(val); }
else { console.log("key="+key+", val="+val); }
});
}
I omitted all the error checking. Also make sure the typeof check is appropriate. You might need some other elseif's to maybe treat numbers, strings, null or booleans in different ways. This is just an example.
EDIT: I might have slightly missed that the topic said "Python"... sorry. Feel free to translate to python, same principles apply.
EDIT2: Now lets' try in Python. I'm assuming your JSON is already imported into a variable.
def my_parser(node, depth=0):
if type(node) == "list":
for val in node:
my_parser(val,depth+1)
elif type(node) == "dict":
for key in node:
printf("level=%i key=%s" % ( depth, key ))
my_parser(node[key], depth+1)
elif type(node) == "str":
pritnf("level=%i value=%s" % ( depth, node ))
elsif type(node) == "int":
printf("level=%i value=%i % ( depth, node ))
else:
printf("level=%i value_unknown_type" % ( depth ))

Groovy csv to string

I am using Dell Boomi to map data from one system to another. I can use groovy in the maps but have no experience with it. I tried to do this with the other Boomi tools, but have been told that I'll need to use groovy in a script. My inbound data is:
132265,Brown
132265,Gold
132265,Gray
132265,Green
I would like to output:
132265,"Brown,Gold,Gray,Green"
Hopefully this makes sense! Any ideas on the groovy code to make this work?
It can be elegantly solved with groupBy and the spread operator:
#Grapes(
#Grab(group='org.apache.commons', module='commons-csv', version='1.2')
)
import org.apache.commons.csv.*
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def parsed = CSVParser.parse(csv, CSVFormat.DEFAULT.withHeader('code', 'color')
parsed.records.groupBy({ it.code }).each { k,v -> println "$k,\"${v*.color.join(',')}\"" }
The above prints:
132265,"Brown,Gold,Gray,Green"
Well, I don't know how are you getting your data, but here is a general way to achieve your goal. You can use a library, such as the one bellow to parse the csv.
https://github.com/xlson/groovycsv
The example for your data would be:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''
132265,Brown
132265,Gold
132265,Gray
132265,Green
'''
def data = parseCsv(csv)
I believe you want to associate the number with various values of colors. So for each line you can create a map of the number and the colors associated with that number, splitting the line by ",":
map = [:]
for(line in data) {
number = line.split(',')[0]
colour = line.split(',')[1]
if(!map[number])
map[number] = []
map[number].add(colour)
}
println map
So map should contain:
[132265:["Brown","Gold","Gray","Green"]]
Well, if it is not what you want, you can extract the general idea.
Assuming your data is coming in as a comma separated string of data like this:
"132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
The following Groovy script code should do the trick.
def csvString = "132265,Brown 132265,Gold 132265,Gray 132265,Green 122222,Red 122222,White"
LinkedHashMap.metaClass.multiPut << { key, value ->
delegate[key] = delegate[key] ?: []; delegate[key] += value
}
def map = [:]
def csv = csvString.split().collect{ entry -> entry.split(",") }
csv.each{ entry -> map.multiPut(entry[0], entry[1]) }
def result = map.collect{ k, v -> k + ',"' + v.join(",") + '"'}.join("\n")
println result
Would print:
132265,"Brown,Gold,Gray,Green"
122222,"Red,White"
Do you HAVE to use scripting for some reason? This can be easily accomplished with out-of-the-box Boomi functionality.
Create a map function that prepends the ID field to a string of your choice (i.e. 222_concat_fields). Then use that value to set a dynamic process prop with that value.
The value of the process prop will contain the result of concatenating the name fields. Simply adding this function to your map should take care of it. Then use the final value to populate your result.
Well it depends upon the data how is it coming.
If the data which you have posted in the question is coming in a single document, then you can easily handle this in a map with groovy scripting.
If the data which you have posted in the question is coming into multiple documents i.e.
doc1: 132265,Brown
doc2: 132265,Gold
doc3: 132265,Gray
doc4: 132265,Green
In that case it cannot be handled into map. You will need to use Data Process Step with Custom Scripting.
For the code which you are asking to create in groovy depends upon the input profile in which you are getting the data. Please provide more information i.e. input profile, fields etc.

Spark RDD to CSV - Add empty columns

I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}