Related
I have a problem with handling JSON data from different sources. So, my plan was to use JSON-LD, and store the data from a source in RDF so that I can do some analysis work on them. But I don't know how to turn a regular JSON in a JSON-LD correctly. For example, I don't know how to get the correct context for the JSON-LD object.
In my project, each of the sources contain information about the infra configs. This information can be extracted in JSON format, but each source has a different structure.
In the following example you can see how I try to use "pyld" and "rdflib" to turn a JSON object in a Graph, but you can see that the output is not as expected:
Side note, my question was reported as spam by stackoverflow when I used an URL as IRI, even the example URL used most examples. So if you want to run this example you have to replace the <unique_iri> for a real URL to make it work.
Example
import json
from pyld import jsonld
from rdflib import Graph
# JSON data
nodes = [
{
"sysid": "vm_remote",
"type": "vm",
"name": "remote",
"config": {
"id": "worker_1",
"cpu": "2",
},
"connect": [
"db_users"
]
},
{
"sysid": "db_users",
"type": "db",
"name": "users",
"config": {
"id": "database_1",
"location": "eu_west",
}
}
]
# Define the context for the JSON-LD object
context = {
"#version": 1.1,
"#base": "<unique_iri>/team_name/",
"#vocab": "<unique_iri>/resources/onprem/",
"sysid": "#id",
"type": "#type",
"config": {
"#id": "config",
"#context": {
"#base": "<unique_iri>/team_name/config/"
}
},
"connect": {"#id": "relation#connect", "#type": "#id", "#container": "#set"}
}
doc = {
"#context": context,
"#graph": nodes,
"#id": "graph",
"#type": "graph"
}
print("\nInput JSON-LD:\n" + json.dumps(doc, indent=2))
expended_data = jsonld.expand(doc)
print("\n\Expanded JSON-LD:\n" + json.dumps(expended_data, indent=2))
graph = Graph().parse(data=json.dumps(expended_data), format='json-ld')
print("\nRDF Graph:\n" + graph.serialize(format='json-ld'))
# Find the type of each entry (a resource)
q = """
PREFIX resources: <<unique_iri>/resources/onprem/>
SELECT DISTINCT ?type
WHERE
{
?s resources:type ?type .
}
"""
print()
for row in graph.query(q):
print("Type: %s" % row)
Output
Input JSON-LD:
{
"#context": {
"#version": 1.1,
"#base": "<unique_iri>/team_name/",
"#vocab": "<unique_iri>/resources/onprem/",
"sysid": "#id",
"type": "#type",
"config": {
"#id": "config",
"#context": {
"#base": "<unique_iri>/team_name/config/"
}
},
"connect": {
"#id": "relation#connect",
"#type": "#id",
"#container": "#set"
}
},
"#graph": [
{
"sysid": "vm_remote",
"type": "vm",
"name": "remote",
"config": {
"id": "worker_1",
"cpu": "2"
},
"connect": [
"db_users"
]
},
{
"sysid": "db_users",
"type": "db",
"name": "users",
"config": {
"id": "database_1",
"location": "eu_west"
}
}
],
"#id": "graph",
"#type": "graph"
}
\Expanded JSON-LD:
[
{
"#graph": [
{
"<unique_iri>/resources/onprem/config": [
{
"<unique_iri>/resources/onprem/cpu": [
{
"#value": "2"
}
],
"<unique_iri>/resources/onprem/id": [
{
"#value": "worker_1"
}
]
}
],
"<unique_iri>/resources/onprem/relation#connect": [
{
"#id": "db_users"
}
],
"<unique_iri>/resources/onprem/name": [
{
"#value": "remote"
}
],
"#id": "vm_remote",
"#type": [
"<unique_iri>/resources/onprem/vm"
]
},
{
"<unique_iri>/resources/onprem/config": [
{
"<unique_iri>/resources/onprem/id": [
{
"#value": "database_1"
}
],
"<unique_iri>/resources/onprem/location": [
{
"#value": "eu_west"
}
]
}
],
"<unique_iri>/resources/onprem/name": [
{
"#value": "users"
}
],
"#id": "db_users",
"#type": [
"<unique_iri>/resources/onprem/db"
]
}
],
"#id": "graph",
"#type": [
"<unique_iri>/resources/onprem/graph"
]
}
]
RDF Graph:
[
{
"#id": "file:///C:...",
"#type": [
"<unique_iri>/resources/onprem/graph"
]
}
]
Type: <unique_iri>/resources/onprem/graph
The graph is missing the nodes and I don't understand what I am doing wrong.
I am also not sure how to deal with the config nodes. These should be nodes with their own unique identifier since other data sources will be pointing to these too.
Also, these python libraries give me other results than the online playground tool from json-ld.
Can somebody please help me?
I have a kafka message like below, where im trying to read the data from the json path. However im having a challenge when reading some of the attributes from the json path. here is the sample message.
sample1:
{
"header": {
"bu": "google",
"id": "12345",
"bum": "google",
"originTimestamp": "2021-10-09T15:17:09.842+00:00",
"batchSize": "0",
"jobType": "Batch"
},
"payload": {
"derivationdetails": {
"Id": "6783jhvvh897u31y283y",
"itemid": "1234567",
"batchid": 107,
"attributes": {
"itemid": "1234567",
"lineNbr": "1498",
"cat": "5929",
"Id": "6783jhvvh897u31y283y",
"indicator": "false",
"subcat": "3514"
},
"Exception": {
"values": [
{
"type": "PICK",
"value": "blocked",
"Reason": [
"RULE"
],
"rules": [
"439"
]
}
],
"rulesBagInfo": [
{
"Idtype": "XXXX",
"uniqueid": "7889423rbhevfhjaufdyeuiryeukjbdafvjd",
"rulesMatch": [
"439"
]
}
]
}
}
}
}
sample 2: Same message but see the difference in "Payload"
{
"header": {
"bu": "google",
"id": "12345",
"bum": "google",
"originTimestamp": "2021-10-09T15:17:09.842+00:00",
"batchSize": "0",
"jobType": "Batch"
},
"payload": {
"Id": "6783jhvvh897u31y283y",
"itemid": "1234567",
"batchid": 107,
"attributes": {
"itemid": "1234567",
"lineNbr": "1498",
"cat": "5929",
"Id": "6783jhvvh897u31y283y",
"indicator": "false",
"subcat": "3514"
},
"Exception": {
"values": [
{
"type": "PICK",
"value": "blocked",
"Reason": [
"RULE"
],
"rules": [
"439"
]
}
],
"rulesBagInfo": [
{
"Idtype": "XXXX",
"uniqueid": "7889423rbhevfhjaufdyeuiryeukjbdafvjd",
"rulesMatch": [
"439"
]
}
]
}
}
}
If you observe, sometimes the message has "derivationdetails", and sometimes it doesn't. But irrespective of its existence, i need to read the values of id,itemid and batchid. I tried using
$.payload[*].id
$.payload[*].itemid
$.payload[*].batchid
But i see that for batchid is returning null even though it has a value in the message, and the attributes under "attributes" return null if im using the above. For fields under "attributes" im using this(example):
$.payload.attributes.itemId
And, completely blank on how to read the below part.
"Exception": {
"values": [
{
"type": "PICK",
"value": "blocked",
"Reason": [
"RULE"
],
"rules": [
"439"
]
}
],
"rulesBagInfo": [
{
"Idtype": "XXXX",
"uniqueid": "7889423rbhevfhjaufdyeuiryeukjbdafvjd",
"rulesMatch": [
"439"
]
Im new to this and need some suggestions on how to read the attributes properly. Any help would be much appreciated.Thanks
Use ..(recursive descent, Deep scan. JSONPath borrows this syntax from E4X.) to get the values. But It will return a list if there are multiple entries with same key nested in deep.
Below jsonpath expressions will return a list with one item each for both sample1 and sample2
$.payload..attributes.Id
$.payload..attributes.itemid
$.payload..batchid
$.payload..Exception
I am going to write json schema to verify tree data.
Schema consisting of top root and block below.
There may be another block below the block.
Schema for validation.
schema = {
"$schema": "http://json-schema.org/draft-04/schema",
"$ref": "#/definitions/root",
"definitions":{
"root": {
"properties": {
"name": {
"type": "string"
},
"children": {
"type": "array",
"items": [
{"$ref":"#/definitions/block"}
]
}
},
"required": ["name", "children"]
},
"block": {
"properties": {
"name": {
"type": "string"
},
"children": {
"type": "array",
"items": [
{"$ref":"#/definitions/block"}
]
}
},
"required": ["name"]
}
}
}
Below is incorrect data for testing. The last name properties do not exist.
{
"name": "group8",
"children": [
{
"name": "group7",
"children": [
{
"name": "group6",
"children": [
{
"name": "group5",
"children": [
{ ###### wrong
"children": []
}
]
}
]
}
]
}
]
}
This data validates well, but it doesn't work on a slightly complex tree.
# Error: ValidationError: file /home/gulliver/.local/lib/python2.7/site-packages/jsonschema/validators.py line 934: 'name' is a required property #
{
"name": "group8",
"children": [
{
"name": "group7",
"children": [
{
"name": "group6",
"children": [
{
"name": "group12",
"children": [
{
"name": "group11",
"children": [
{
"name": "group10",
"children": []
}
]
}
]
},
{
"name": "group9",
"children": [
{
"name": "group5",
"children": [
{ ####### wrong
"children": []
}
]
}
]
}
]
}
]
},
{
"name": "group13",
"children": [
{
"name": "null1",
"children": []
}
]
}
]
}
It does not work when the data at the bottom of the tree is invalid.
My guess is that the branch splits and this happens, does anyone know why or how to fix it?
I tested using python and jsonschema.
When items is an array, it applies the subschema values to the same index location in the array in the instance.
For example, where you define...
"items": [
{"$ref":"#/definitions/block"}
]
only the first item in the array will be tested. It has nothing to do with deep nesting. For example, the follwing data is valid according to your schema...
{
"name": "group8",
"children": [
{
"name": "group7"
},
{
"something": "else",
"Not": "name"
}
]
}
(Live demo: https://jsonschema.dev/s/etFGE)
If you modify your use of items, then it will work like you expect:
"items": {"$ref":"#/definitions/block"}
(do this for both uses)
Live demo: https://jsonschema.dev/s/rk1OD
I have a single table in database like database table. I want to search a child from database and return a hierarchical JSON to a front end in order to create a tree. How can I do that in FLASK.
My expected JSON for mat should be like expected JSON
Since you have tagged your question with flask, this post assumes you are using Python as well. To format your database values in JSON string, you can query the db and then use recursion:
import sqlite3, collections
d = list(sqlite3.connect('file.db').cursor().execute("select * from values"))
def get_tree(vals):
_d = collections.defaultdict(list)
for a, *b in vals:
_d[a].append(b)
return [{'name':a, **({} if not (c:=list(filter(None, b))) else {'children':get_tree(b)})} for a, b in _d.items()]
import json
print(json.dumps(get_tree(d), indent=4))
Output:
[
{
"name": "AA",
"children": [
{
"name": "BB",
"children": [
{
"name": "EE",
"children": [
{
"name": "JJ",
"children": [
{
"name": "EEV"
},
{
"name": "FFW"
}
]
},
{
"name": "KK",
"children": [
{
"name": "HHX"
}
]
}
]
}
]
},
{
"name": "CC",
"children": [
{
"name": "FF",
"children": [
{
"name": "LL",
"children": [
{
"name": "QQY"
}
]
},
{
"name": "MM",
"children": [
{
"name": "RRV"
}
]
}
]
},
{
"name": "GG",
"children": [
{
"name": "NN",
"children": [
{
"name": "SSW"
}
]
}
]
}
]
},
{
"name": "DD",
"children": [
{
"name": "HH",
"children": [
{
"name": "OO",
"children": [
{
"name": "TTZ"
}
]
}
]
},
{
"name": "II",
"children": [
{
"name": "PP",
"children": [
{
"name": "UUW"
}
]
}
]
}
]
}
]
}
]
I am trying to create EMR-5.30.1 clusters with applications such as Hadoop, livy, Spark, ZooKeeper, and Hive with the help of the CloudFormation template. But the issue is with this template is I am able the cluster with only one application from the above list of applications.
below is the CloudFormation Template
{
"AWSTemplateFormatVersion": "2010-09-09",
"Description": "Best Practice EMR Cluster for Spark or S3 backed Hbase",
"Parameters": {
"EMRClusterName": {
"Description": "Name of the cluster",
"Type": "String",
"Default": "emrcluster"
},
"KeyName": {
"Description": "Must be an existing Keyname",
"Type": "String",
"Default": "keyfilename"
},
"MasterInstanceType": {
"Description": "Instance type to be used for the master instance.",
"Type": "String",
"Default": "m5.xlarge"
},
"CoreInstanceType": {
"Description": "Instance type to be used for core instances.",
"Type": "String",
"Default": "m5.xlarge"
},
"NumberOfCoreInstances": {
"Description": "Must be a valid number",
"Type": "Number",
"Default": 1
},
"SubnetID": {
"Description": "Must be Valid public subnet ID",
"Default": "subnet-ee15b3e0",
"Type": "String"
},
"LogUri": {
"Description": "Must be a valid S3 URL",
"Default": "s3://aws/elasticmapreduce/",
"Type": "String"
},
"S3DataUri": {
"Description": "Must be a valid S3 bucket URL ",
"Default": "s3://aws/elasticmapreduce/",
"Type": "String"
},
"ReleaseLabel": {
"Description": "Must be a valid EMR release version",
"Default": "emr-5.30.1",
"Type": "String"
},
"Applications": {
"Description": "Please select which application will be installed on the cluster this would be either Ganglia and spark, or Ganglia and s3 backed Hbase",
"Type": "String",
"AllowedValues": [
"Spark",
"Hbase",
"Hive",
"Livy",
"ZooKeeper"
]
}
},
"Mappings": {},
"Conditions": {
"Spark": {
"Fn::Equals": [
{
"Ref": "Applications"
},
"Spark"
]
},
"Hbase": {
"Fn::Equals": [
{
"Ref": "Applications"
},
"Hbase"
]
},
"Hive": {
"Fn::Equals": [
{
"Ref": "Applications"
},
"Hive"
]
},
"Livy": {
"Fn::Equals": [
{
"Ref": "Applications"
},
"Livy"
]
},
"ZooKeeper": {
"Fn::Equals": [
{
"Ref": "Applications"
},
"ZooKeeper"
]
}
},
"Resources": {
"EMRCluster": {
"DependsOn": [
"EMRClusterServiceRole",
"EMRClusterinstanceProfileRole",
"EMRClusterinstanceProfile"
],
"Type": "AWS::EMR::Cluster",
"Properties": {
"Applications": [
{
"Name": "Ganglia"
},
{
"Fn::If": [
"Spark",
{
"Name": "Spark"
},
{
"Ref": "AWS::NoValue"
}
]
},
{
"Fn::If": [
"Hbase",
{
"Name": "Hbase"
},
{
"Ref": "AWS::NoValue"
}
]
},
{
"Fn::If": [
"Hive",
{
"Name": "Hive"
},
{
"Ref": "AWS::NoValue"
}
]
},
{
"Fn::If": [
"Livy",
{
"Name": "Livy"
},
{
"Ref": "AWS::NoValue"
}
]
},
{
"Fn::If": [
"ZooKeeper",
{
"Name": "ZooKeeper"
},
{
"Ref": "AWS::NoValue"
}
]
}
],
"Configurations": [
{
"Classification": "hbase-site",
"ConfigurationProperties": {
"hbase.rootdir":{"Ref":"S3DataUri"}
}
},
{
"Classification": "hbase",
"ConfigurationProperties": {
"hbase.emr.storageMode": "s3"
}
}
],
"Instances": {
"Ec2KeyName": {
"Ref": "KeyName"
},
"Ec2SubnetId": {
"Ref": "SubnetID"
},
"MasterInstanceGroup": {
"InstanceCount": 1,
"InstanceType": {
"Ref": "MasterInstanceType"
},
"Market": "ON_DEMAND",
"Name": "Master"
},
"CoreInstanceGroup": {
"InstanceCount": {
"Ref": "NumberOfCoreInstances"
},
"InstanceType": {
"Ref": "CoreInstanceType"
},
"Market": "ON_DEMAND",
"Name": "Core"
},
"TerminationProtected": false
},
"VisibleToAllUsers": true,
"JobFlowRole": {
"Ref": "EMRClusterinstanceProfile"
},
"ReleaseLabel": {
"Ref": "ReleaseLabel"
},
"LogUri": {
"Ref": "LogUri"
},
"Name": {
"Ref": "EMRClusterName"
},
"AutoScalingRole": "EMR_AutoScaling_DefaultRole",
"ServiceRole": {
"Ref": "EMRClusterServiceRole"
}
}
},
"EMRClusterServiceRole": {
"Type": "AWS::IAM::Role",
"Properties": {
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"elasticmapreduce.amazonaws.com"
]
},
"Action": [
"sts:AssumeRole"
]
}
]
},
"ManagedPolicyArns": [
"arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
],
"Path": "/"
}
},
"EMRClusterinstanceProfileRole": {
"Type": "AWS::IAM::Role",
"Properties": {
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"ec2.amazonaws.com"
]
},
"Action": [
"sts:AssumeRole"
]
}
]
},
"ManagedPolicyArns": [
"arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role"
],
"Path": "/"
}
},
"EMRClusterinstanceProfile": {
"Type": "AWS::IAM::InstanceProfile",
"Properties": {
"Path": "/",
"Roles": [
{
"Ref": "EMRClusterinstanceProfileRole"
}
]
}
}
},
"Outputs": {}
}
Also, I want to add a bootstrap script in this template as well, Can anyone please help me with the issue.
As per my knoweldge and understanding, Applications in your case should be an array like below, as mentioned in documentation
"Applications" : [ Application, ... ],
In you case, you can list applications like
"Applications" : [
{"Name" : "Spark"},
{"Name" : "Hbase"},
{"Name" : "Hive"},
{"Name" : "Livy"},
{"Name" : "Zookeeper"},
]
For more arguments other than Name to individual application dictionary , see detail here, you can pass Args, Additional_info etc
You can use following way:-
If you set "ReleaseLabel" then there is no need to mention versions of applications
"Applications": [{
"Name": "Hive"
},
{
"Name": "Presto"
},
{
"Name": "Spark"
}
]
For bootstrap:-
"BootstrapActions": [{
"Name": "setup",
"ScriptBootstrapAction": {
"Path": "s3://bucket/key/Bootstrap.sh"
}
}]
Define like this to create all applications at once.
{
"Type": "AWS::EMR::Cluster",
"Properties": {
"Applications": [
{
"Name": "Ganglia"
},
{
"Name": "Spark"
},
{
"Name": "Livy"
},
{
"Name": "ZooKeeper"
},
{
"Name": "JupyterHub"
}
]
}
}