Compare two Json files using Apache Spark - json

I am new to Apache Spark and I am trying to compare two json files.
My requirement is to find out that which key/value is added, removed or modified and what is its path.
To explain my problem, I am sharing the code which I have tried with a small json sample here.
Sample Json 1 is:
{
"employee": {
"name": "sonoo",
"salary": 57000,
"married": true
} }
Sample Json 2 is:
{
"employee": {
"name": "sonoo",
"salary": 58000,
"married": true
} }
My code is:
//Compare two multiline json files
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Load first json file
val jsonData_1 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_1.json").values)
//Load second json file
val jsonData_2 = sqlContext.read.json(sc.wholeTextFiles("D:\\File_2.json").values)
//Compare both json files
jsonData_2.except(jsonData_1).show(false)
The output which I get on executing this code is:
+--------------------+
|employee |
+--------------------+
|{true, sonoo, 58000}|
+--------------------+
But here only one field i.e. salary was modified so output should be only the updated field with its path.
Below is the expected output details:
[
{
"op" : "replace",
"path" : "/employee/salary",
"value" : 58000
}
]
Can anyone point me in the right direction?

Assuming each json has an identifier, and that you have two json groups (e.g. folders), you need to compare b/w the jsons in the two groups:
Load the jsons from each group into a dataframe, providing a schema matching the structure of the son. After this, you have two dataframes.
Compare the jsons (by now rows in a dataframe) by joining on the identifiers, looking for mismatched values.

Related

How to traverse through all elements of json using robotframework and fetch only those that match the query

I am trying to extract a node from a json file for a json element that matches another node in same element.
To be more specific, I want the names of all students in the sample json below who has "certified":"false"
Example JSON
{
"Students": [
{
"name": "John",
"Rank": "1",
"certified":"false"
},
{
"name": "Ashley",
"Rank": "5",
"certified":"true"
}
]
}
Code i am using is (gives me empty output) :
Library JSONLibrary
JSON_Verification
[Documentation] Testing JSON load logic
${metadataJson_object}= Load JSON From File ../TestData/sample.json
Log ${metadataJson_object}
#{studentName}= Get Value From Json ${metadataJson_object} "$..[?(#.certified=='false')]#.name"
Log #{studentName}
Indeed you were very close to it, # notation is not needed after filter. Just change the json path to =
$.Students[?(#.certified=='false')].name
Here :
$ -> root element
. -> child operator or to access the property
?() -> filter expression and
# -> current node
${json}= Convert String to JSON ${Getjson}
${name}= Get Value From Json ${json} $.Students[?(#.certified=='false')].name
Output

Map nested JSON in Azure Data Factory to raw object

Since ADF (Azure Data Factory) isn't able to handle complex/nested JSON objects, I'm using OPENJSON in SQL to parse the objects. But, I can't get the 'raw' JSON from the following object:
{
"rows":[
{
"name":"Name1",
"attribute1":"attribute1",
"attribute2":"attribute2"
},
{
"name":"Name2",
"attribute1":"attribute1",
"attribute2":"attribute2"
},
{
"name":"Name3",
"attribute1":"attribute1",
"attribute2":"attribute2"
}
]
}
Config 1
When I use this config:
I get all the names listed
Name1
Name2
Name3
Result:
Config 2
When I use this config:
I get the whole JSON in one record:
[ {{full JSON}} ]
Result:
Needed config
But, what I want, is this result:
{ "name":"Name1", "attribute1":"attribute1", "attribute2":"attribute2 }
{ "name":"Name2", "attribute1":"attribute1", "attribute2":"attribute2 }
{ "name":"Name3", "attribute1":"attribute1", "attribute2":"attribute2 }
Result:
So, I need the iteration of Config 1, with the raw JSON per row. Everytime I use the $['rows'], or $['rows'][0], it seems to 'forget' to iterate.
Anyone?
Have you tried Data Flows to handle JSON structures? We have that feature built-in with data flow transformations like derived column, flatten, and sink mapping.
The copy active can help us achieve it.
For example I copy B.json fron container "backup" to another Blob container "testcontainer" .
This is my B.json source dataset:
Source:
Sink:
Mapping:
Pipeline executed successful:
Check the data in testcontainer:
Hope this helps.
Update:
Copy the nested json to SQL.
Source is the same B.json in blob.
Sink dataset:
Sink:
Mapping:
Run pipeline:
Check the data in SQL database:

How to write a splittable DoFn in python - convert json to ndjson in apache beam

I have a large dataset in GCS in json format that I need to load into BigQuery.
The problem is that the json data is not stored in NdJson but rather in a few large json files, where each key in the JSON should really be a field in json itself.
For example - the following Json:
{
"johnny": {
"type": "student"
},
"jeff": {
"type": "teacher"
}
}
should be converted into
[
{
"name": "johnny",
"type": "student"
},
{
"name": "jeff",
"type": "teacher"
}
]
I am trying to solve it via Google Data Flow an Apache Beam, but the performance is terrible since ech "Worker" has to do a lot of work:
class JsonToNdJsonDoFn(beam.DoFn):
def __init__(self, pk_field_name):
self.__pk_field_name = pk_field_name
def process(self, line):
for key, record in json.loads(line).items():
record[self.__pk_field_name] = key
yield record
I know that this can solved somehow via implementing it as a SplittableDoFn - but the implementation example in Python there is not really clear. How should I build this DoFn as splittable, and how will it be used as part of the pipeline?
You need a way to specify a partial range to process of the json file. It could be a byte range, for example.
The Avro example in the blog post is a good one. Something like:
class MyJsonReader(DoFn):
def process(filename, tracker=DoFn.RestrictionTrackerParam)
with fileio.ChannelFactory.open(filename) as file:
start, stop = tracker.current_restriction()
# Seek to the first block starting at or after the start offset.
file.seek(start)
next_record_start = find_next_record(file, start)
while start:
# Claim the position of the current record
if not tracker.try_claim(next_record_start):
# Out of range of the current restriction - we're done.
return
# start will point to the end of the record that was read
record, start = read_record(file, next_record_start)
yield record
def get_initial_restriction(self, filename):
return (0, fileio.ChannelFactory.size_in_bytes(filename))
However, json doesn't have clear record boundaries, so if your work has to start at byte 548, there's no clear way of telling how much to shift. If the file is literally what you have there, then you can skip bytes until you see the pattern "<string>": {. And then read the json object starting on the {.

Dynamically build json using groovy

I am trying to dynamically build some json based on data I retrieve from a database. Up until the opening '[' is the "root" I guess you could say. The next parts with name and value are dynamic and will be based on the number of results I get from the db. I query the db and then the idea was to iterate through the result adding to the json. Can I use jsonBuilder for the root section and then loop with jsonSlurper to add each additional section? Most of the examples I have seen deal with a root and then a one time "slurp" and then joining the two so wasn't sure if I should try a different method for looping and appending multiple sections.
Any tips would be greatly appreciated. Thanks.
{
"hostname": "$hostname",
"path": "$path",
"extPath": "$extPath",
"appName": "$appName",
"update": {"parameter": [
{
"name": "$name",
"value": "$value"
},
{
"name": "$name",
"value": "$value"
}
]}
}
EDIT: So what I ended up doing was just using StringBuilder to create the initial block and then append the subsequent sections. Maybe not the most graceful way to do it, but it works!
//Create the json string
StringBuilder json = new StringBuilder("""{
"hostname": "$hostname",
"path": "$path",
"extPath": "$extPath",
"appName": "$appName",
"update": {"parameter": ["""
)
//Append
sql.eachRow("""<query>""",
{ params ->
json.append("""{ "name": "params.name", "value": "params.value" },""");
}
)
//Add closing json tags
json.append("""]}}""")
If I got your explanation correctly and if the data is not very big (it can live in memory), I'd build a Map object (which is very easy to work with in groovy) and convert it to JSON afterwards. Something like this:
def data = [
hostname: hostname,
path: path,
extPath: extPath,
appName: appName,
update: [parameter: []]
]
sql.eachRow(sqlStr) { row ->
data.update.parameter << [name: row.name, value: row.value]
}
println JsonOutput.toJson(data)
If you're using Grails and Groovy you can utilize grails.converters.JSON.
First, define a JSON named config:
JSON.createNamedConfig('person') {
it.registerObjectMarshaller(Person) {
Person person ->
def output = [:]
output['name'] = person.name
output['address'] = person.address
output['age'] = person.age
output
}
}
This will result in a statically defined named configuration for the Object type of person. Now, you can simply call:
JSON.use('person') {
Person.findAll() as JSON
}
This will return every person in the database with their name, address and age all in one JSON request. I don't know if you're using grails as well in this situation though, for pure Groovy go with another answer here.

Groovy compare two json with unknown nodes names and values

I have a rest API to test and I have to compare two json responses. Below you can find a structure of the file. Both files to compare should contains the same elements but order might be different. Unfortunately the names, the type (simple, array) and the number of keys (root, nodeXYZ) are also not known.
{"root": [{
"node1": "value1",
"node2": "value1",
"node3": [
{
"node311": "value311",
"node312": "value312"
},
{
"node321": "value321",
"node322": "value322"
}
],
"node4": [
{
"node411": "value411",
"node412": "value413",
"node413": [ {
"node4131": "value4131",
"node4132": "value4131"
}],
"node414": []
}
{
"node421": "value421",
"node422": "value422",
"node423": [ {
"node4231": "value4231",
"node4232": "value4231"
}],
"node424": []
}]
"node5": [
{"node51": "value51"},
{"node52": "value52"},
]
}]}
I have found some useful information in
Groovy - compare two JSON objects (same structure) and return ArrayList containing differences
Getting node from Json Response
Groovy : how do i search json with key's value and find its children in groovy
but I could not combine it to an solution.
I thought the solution might look like this:
take root
get root children names
check if child has children and get their names
do it to the lowest leve child
With all names in place comparing should be easy (I guess)
Unfortunately I did not manage to get keys under root
Just compare the slurped maps:
def map1 = new JsonSlurper().parseText(document1)
def map2 = new JsonSlurper().parseText(document2)
assert map1 == map2
Try the JSONassert library: https://github.com/skyscreamer/JSONassert. Then you can use:
JSONAssert.assertEquals(expectedJson, actualJson, JSONCompareMode.STRICT)
And you will get nicely formatted deltas like:
java.lang.AssertionError: Resources.DbRdsLiferayInstance.Properties.KmsKeyId
Expected: kms-key-2
got: kms-key