Creating nodes and relations from JSON (dynamically) - json

I've got a couple hundred JSONs in a structure like the following example:
{
"JsonExport": [
{
"entities": [
{
"identity": "ENTITY_001",
"surname": "SMIT",
"entityLocationRelation": [
{
"parentIdentification": "PARENT_ENTITY_001",
"typeRelation": "SEEN_AT",
"locationIdentity": "LOCATION_001"
},
{
"parentIdentification": "PARENT_ENTITY_001",
"typeRelation": "SEEN_AT",
"locationIdentity": "LOCATION_002"
}
],
"entityEntityRelation": [
{
"parentIdentification": "PARENT_ENTITY_001",
"typeRelation": "FRIENDS_WITH",
"childIdentification": "ENTITY_002"
}
]
},
{
"identity": "ENTITY_002",
"surname": "JACKSON",
"entityLocationRelation": [
{
"parentIdentification": "PARENT_ENTITY_002",
"typeRelation": "SEEN_AT",
"locationIdentity": "LOCATION_001"
}
]
},
{
"identity": "ENTITY_003",
"surname": "JOHNSON"
}
],
"identification": "REGISTRATION_001",
"locations": [
{
"city": "LONDON",
"identity": "LOCATION_001"
},
{
"city": "PARIS",
"identity": "LOCATION_002"
}
]
}
]
}
With these JSON's, I want to make a graph consisting of the following nodes: Registration, Entity and Location. This part I've figured out and made the following:
WITH "file:///example.json" AS json_file
CALL apoc.load.json(json_file,"$.JsonExport.*" ) YIELD value AS data
MERGE(r:Registration {id:data.identification})
WITH json_file
CALL apoc.load.json(json_file,"$.JsonExport..locations.*" ) YIELD value AS locations
MERGE(l:Locations{identity:locations.identity, name:locations.city})
WITH json_file
CALL apoc.load.json(json_file,"$.JsonExport..entities.*" ) YIELD value AS entities
MERGE(e:Entities {name:entities.surname, identity:entities.identity})
All the entities and locations should have a relation with the registration. I thought I could do this by using the following code:
MERGE (e)-[:REGISTERED_ON]->(r)
MERGE (l)-[:REGISTERED_ON]->(r)
However this code doesn’t give the desired output. It creates extra "empty" nodes and doesn't connect to the registration node. So the first question is: How do I connect the location and entities nodes to the registration node. And in light of the other JSON's, the entities and locations should only be linked to the specific registration.
Furthermore, I would like to make the entity -> location relation and the entity - entity relation and use the given type of relation (SEEN_AT or FRIENDS_WITH) as label for the given relation. How can this be done? I'm kind of lost at this point and don’t see how to solve this. If someone could guide me into the right direction I would be much obliged.

Variable names (like e and r) are not stored in the DB, and are bound to values only within individual queries. MERGE on a pattern with an unbound variable will just create the entire pattern (including creating an empty node for unbound node variables).
When you MERGE a node, you should only specify the unique identifying property for that node, to avoid duplicates. Any other properties you want to set at the time of creation should be set using ON CREATE SET.
It is inefficient to parse through the JSON data 3 times to get different areas of the data. And it is especially inefficient the way your query was doing it, since each subsequent CALL/MERGE group of clauses would be done multiple times (since every previous CALL produces multiple rows, and the number of rows increases multiplicative). You can use aggregation to get around that, but it is unnecessary in your case, since you can just do the entire query in a single pass through the JSON data.
This may work for you:
CALL apoc.load.json(json_file,"$.JsonExport.*" ) YIELD value AS data
MERGE(r:Registration {id:data.identification})
FOREACH(ent IN data.entities |
MERGE (e:Entities {identity: ent.identity})
ON CREATE SET e.name = ent.surname
MERGE (e)-[:REGISTERED_ON]->(r)
FOREACH(loc1 IN ent.entityLocationRelation |
MERGE (l1:Locations {identity: loc1.locationIdentity})
MERGE (e)-[:SEEN_AT]->(l1))
FOREACH(ent2 IN ent.entityEntityRelation |
MERGE (e2:Entities {identity: ent2.childIdentification})
MERGE (e)-[:FRIENDS_WITH]->(e2))
)
FOREACH(loc IN data.locations |
MERGE (l:Locations{identity:loc.identity})
ON CREATE SET l.name = loc.city
MERGE (l)-[:REGISTERED_ON]->(r)
)
For simplicity, it hard-codes the FRIENDS_WITH and REGISTERED_ON relationship types, as MERGE only supports hard-coded relationship types.

So playing with neo4j/cyper I've learned some new stuff and came to another solution for the problem. Based on the given example data, the following can create the nodes and edges dynamically.
WITH "file:///example.json" AS json_file
CALL apoc.load.json(json_file,"$.JsonExport.*" ) YIELD value AS data
CALL apoc.merge.node(['Registration'], {id:data.identification}, {},{}) YIELD node AS vReg
UNWIND data.entities AS ent
CALL apoc.merge.node(['Person'], {id:ent.identity}, {}, {id:ent.identity, surname:ent.surname}) YIELD node AS vPer1
UNWIND ent.entityEntityRelation AS entRel
CALL apoc.merge.node(['Person'],{id:entRel.childIdentification},{id:entRel.childIdentification},{}) YIELD node AS vPer2
CALL apoc.merge.relationship(vPer1, entRel.typeRelation, {},{},vPer2) YIELD rel AS ePer
UNWIND data.locations AS loc
CALL apoc.merge.node(['Location'], {id:loc.identity}, {name:loc.city}) YIELD node AS vLoc
UNWIND ent.entityLocationRelation AS locRel
CALL apoc.merge.relationship(vPer1, locRel.typeRelation, {},{},vLoc) YIELD rel AS eLoc
CALL apoc.merge.relationship(vLoc, "REGISTERED_ON", {},{},vReg) YIELD rel AS eReg1
CALL apoc.merge.relationship(vPer1, "REGISTERED_ON", {},{},vReg) YIELD rel AS eReg2
CALL apoc.merge.relationship(vPer2, "REGISTERED_ON", {},{},vReg) YIELD rel AS eReg3
RETURN vPer1,vPer2, vReg, vLoc, eLoc, eReg1, eReg2, eReg3

Related

Importing json in neo4J

[PROBLEM - My final solution below]
I'd like to import a json file containing my data into Neo4J.
However, it is super slow.
The Json file is structured as follow
{
"graph": {
"nodes": [
{ "id": 3510982, "labels": ["XXX"], "properties": { ... } },
{ "id": 3510983, "labels": ["XYY"], "properties": { ... } },
{ "id": 3510984, "labels": ["XZZ"], "properties": { ... } },
...
],
"relationships": [
{ "type": "bla", "startNode": 3510983, "endNode": 3510982, "properties": {} },
{ "type": "bla", "startNode": 3510984, "endNode": 3510982, "properties": {} },
....
]
}
}
Is is similar to the one proposed here: How can I restore data from a previous result in the browser?.
By looking at the answer.
I discovered that I can use
CALL apoc.load.json("file:///test.json") YIELD value AS row
WITH row, row.graph.nodes AS nodes
UNWIND nodes AS node
CALL apoc.create.node(node.labels, node.properties) YIELD node AS n
SET n.id = node.id
and then
CALL apoc.load.json("file:///test.json") YIELD value AS row
with row
UNWIND row.graph.relationships AS rel
MATCH (a) WHERE a.id = rel.endNode
MATCH (b) WHERE b.id = rel.startNode
CALL apoc.create.relationship(a, rel.type, rel.properties, b) YIELD rel AS r
return *
(I have to do it in two times because else their are relation duplication due to the two unwind).
But this is super slow because I have a lot of entities and I suspect the program to search over all of them for each relation.
At the same time, I know "startNode": 3510983 refers to a node.
So the question: does it exists anyway to speed up to import process using ids as index, or something else?
Note that my nodes have differents types. So I did not find a way to create an index for all of them, and I suppose that would be too huge (memory)
[MY SOLUTION]
CALL apoc.load.json('file:///test.json') YIELD value
WITH value.graph.nodes AS nodes, value.graph.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, COLLECT({id: n.id, node: node, labels:labels(node)}) AS nMap
UNWIND rels AS r
MATCH (w{id:r.startNode})
MATCH (y{id:r.endNode})
CALL apoc.create.relationship(w, r.type, r.properties, y) YIELD rel
RETURN rel
[EDITED]
This approach may work more efficiently:
CALL apoc.load.json("file:///test.json") YIELD value
WITH value.graph.nodes AS nodes, value.graph.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, apoc.map.mergeList(COLLECT({id: n.id, node: node})) AS nMap
UNWIND rels AS r
CALL apoc.create.relationship(nMap[r.startNode], r.type, r.properties, nMap[r.endNode]) YIELD rel
RETURN rel
This query does not use MATCH at all (and does not need indexing), since it just relies on an in-memory mapping from the imported node ids to the created nodes. However, this query could run out of memory if there are a lot of imported nodes.
It also avoids invoking SET by using apoc.map.setKey to add the id property to n.properties.
The 2 UNWINDs do not cause a cartesian product, since this query uses the aggregating function COLLECT (before the second UNWIND) to condense all the preceding rows into one (because the grouping key, rels, is a singleton).
Have you tried indexing the nodes before the LOAD JSON? This may not be tenable since you have multiple node labels. But if they are limited you can create placeholder node, create and index and then delete the placeholder. After this, run the LOAD Json
Create (n:YourLabel{indx:'xxx'})
create index on: YourLabel(indx)
match (n:YourLabel) delete n
The index will speed the matching or merging

Get last element of array by parsing JSON with Neo4j APOC

Short task description: I need to get the last element of an array/list of one of the fields in nested JSON, here the input JSON file:
{
"origin": [{
"label": "Alcohol drinks",
"tag": [],
"type": "string",
"xpath": []
},
{
"label": "Wine",
"tag": ["red", "white"],
"type": "string",
"xpath": ["Alcohol drinks"]
},
{
"label": "Port wine",
"tag": ["Portugal", "sweet", "strong"],
"type": "string",
"xpath": ["Alcohol drinks", "Wine"]
},
{
"label": "Sandeman Cask 33",
"tag": ["red", "expensive"],
"type": "string",
"xpath": ["Alcohol drinks", "Wine", "Port wine"]
}
]
}
I need to get the last element of "xpath" field, in order to create relationship with appropriate "label". Here is the code, which creates connection to all elements mentioned in "xpath", I need just connection to the last one:
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
MERGE(label:concept{name:or.label})
ON CREATE SET label.type = or.type
FOREACH(tagName IN or.tag | MERGE(tag:concept{name:tagName})
MERGE (tag)-[r:link]-(label)
ON CREATE SET r.Weight=1
ON MATCH SET r.Weight=r.Weight+1)
FOREACH(xpathName IN or.xpath | MERGE (xpath:concept{name:xpathName})
MERGE (label)-[r:link]-(xpath))
Probably there is something like:
apoc.agg.last(or.xpath)
which returns just an array of arrays or all "xpath" from all 4 records of "origin".
I will appreciate any help, probably there some workarounds (not necessary as I proposed) to solve this issue. Thank you in advance!
N.B. All this should be done from an app, not from within Neo4j browser.
Probably the easiest way would be to split this query into two queries if you want to only take the xpath array of the last element in the origin object.
Query: 1
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
MERGE(label:concept{name:or.label})
ON CREATE SET label.type = or.type
FOREACH(tagName IN or.tag | MERGE(tag:concept{name:tagName})
MERGE (tag)-[r:link]-(label)
ON CREATE SET r.Weight=1
ON MATCH SET r.Weight=r.Weight+1)
Query 2:
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
WITH value.origin[-1] as or
MATCH(label:concept{name:or.label})
FOREACH(xpathName IN or.xpath | MERGE (xpath:concept{name:xpathName})
MERGE (label)-[r:link]-(xpath))
Combining these two queries into a single one feels hacky anyway and I would avoid it, but I guess you can do the following.
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
MERGE(label:concept{name:or.label})
ON CREATE SET label.type = or.type
FOREACH(tagName IN or.tag | MERGE(tag:concept{name:tagName})
MERGE (tag)-[r:link]-(label)
ON CREATE SET r.Weight=1
ON MATCH SET r.Weight=r.Weight+1)
// Any aggregation function will break the UNWIND loop
// and return a single row as we want to write it only once
WITH value.origin[-1] as last, count(*) as agg
FOREACH(xpathName IN last.xpath |
MERGE(label:concept{name:last.label})
MERGE (xpath:concept{name:xpathName})
MERGE (label)-[r:link]-(xpath))
Sounds like you're looking for the last() function? This will return the last element of a list.
In this case, since you UNWIND the origin to 4 rows, you'll get the last element of the list for each of those rows.
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
RETURN last(or.xpath) as last

Possible to chain results in N1ql?

I'm currently trying to do a bit of complex N1QL for a project I'm working on, theoretically I could do all of this processing in multiple N1QL calls and by parsing the results each time, however if possible I'd like for this to contained in one call.
What I would like to do is:
filter all documents that contain a "dataSync.test.id" field with more than 1 id
Read back all other ids in that list
Use that list to get other documents containing those ids
Get the "dataSync.test._channels" field for those documents (optionally a filter by docType might help parsing)
This would probably return a list of "dataSync.test._channels"
Is this possible in N1QL? It appears like it might be but I can't get the syntax right.
My data structures look a little like
{
"dataSync": {
"test": {
"_channels": [
"RP"
],
"id": [
"dataSync_user_1015",
"dataSync_user_1010",
"dataSync_user_1005"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
{
"dataSync": {
"test": {
"_channels": [
"RSD"
],
"id": [
"dataSync_user_1010"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
Yes. I think you can do all these.
Initial set of IDs with filtering can be retrieved as a subquery and then you can get subsquent documents by joins.
SELECT fulldoc
FROM (select meta().id as dockey from doc where a=1) as mydoc
INNER JOIN doc fulldoc ON KEYS mydoc.dockey;
There are optimizations that can be done here. Try the sequencing first to ensure you're get the job done.

Parsing Google Custom Search API for Elasticsearch Documents

After retrieving results from the Google Custom Search API and writing it to JSON, I want to parse that JSON to make valid Elasticsearch documents. You can configure a parent - child relationship for nested results. However, this relationship seems to not be inferred by the data structure itself. I've tried automatically loading, but not results.
Below is some example input that doesn't include things like id or index. I'm trying to focus on creating the correct data structure. I've tried modifying graph algorithms like depth-first-search but am running into problems with the different data structures.
Here's some example input:
# mock data structure
google = {"content": "foo",
"results": {"result_one": {"persona": "phone",
"personb": "phone",
"personc": "phone"
},
"result_two": ["thing1",
"thing2",
"thing3"
],
"result_three": "none"
},
"query": ["Taylor Swift", "Bob Dole", "Rocketman"]
}
# correctly formatted documents for _source of elasticsearch entry
correct_documents = [
{"content":"foo"},
{"results": ["result_one", "result_two", "result_three"]},
{"result_one": ["persona", "personb", "personc"]},
{"persona": "phone"},
{"personb": "phone"},
{"personc": "phone"},
{"result_two":["thing1","thing2","thing3"]},
{"result_three": "none"},
{"query": ["Taylor Swift", "Bob Dole", "Rocketman"]}
]
Here is my current approach this is still a work in progress:
def recursive_dfs(graph, start, path=[]):
'''recursive depth first search from start'''
path=path+[start]
for node in graph[start]:
if not node in path:
path=recursive_dfs(graph, node, path)
return path
def branching(google):
""" Get branches as a starting point for dfs"""
branch = 0
while branch < len(google):
if google[google.keys()[branch]] is dict:
#recursive_dfs(google, google[google.keys()[branch]])
pass
else:
print("branch {}: result {}\n".format(branch, google[google.keys()[branch]]))
branch += 1
branching(google)
You can see that recursive_dfs() still needs to be modified to handle string, and list data structures.
I'll keep going at this but if you have thoughts, suggestions, or solutions then I would very much appreciate it. Thanks for your time.
here is a possible answer to your problem.
def myfunk( inHole, outHole):
for keys in inHole.keys():
is_list = isinstance(inHole[keys],list);
is_dict = isinstance(inHole[keys],dict);
if is_list:
element = inHole[keys];
new_element = {keys:element};
outHole.append(new_element);
if is_dict:
element = inHole[keys].keys();
new_element = {keys:element};
outHole.append(new_element);
myfunk(inHole[keys], outHole);
if not(is_list or is_dict):
new_element = {keys:inHole[keys]};
outHole.append(new_element);
return outHole.sort();

How to enter multiple table data in mongoDB using json

I am trying to learn mongodb. Suppose there are two tables and they are related. For example like this -
1st table has
First name- Fred, last name- Zhang, age- 20, id- s1234
2nd table has
id- s1234, course- COSC2406, semester- 1
id- s1234, course- COSC1127, semester- 1
id- s1234, course- COSC2110, semester- 1
how to insert data in the mongo db? I wrote it like this, not sure is it correct or not -
db.users.insert({
given_name: 'Fred',
family_name: 'Zhang',
Age: 20,
student_number: 's1234',
Course: ['COSC2406', 'COSC1127', 'COSC2110'],
Semester: 1
});
Thank you in advance
This would be a assuming that what you want to model has the "student_number" and the "Semester" as what is basically a unique identifier for the entries. But there would be a way to do this without accumulating the array contents in code.
You can make use of the upsert functionality in the .update() method, with the help of of few other operators in the statement.
I am going to assume you are going this inside a loop of sorts, so everything on the right side values is actually a variable:
db.users.update(
{
"student_number": student_number,
"Semester": semester
},
{
"$setOnInsert": {
"given_name": given_name,
"family_name": family_name,
"Age": age
},
"$addToSet": { "courses": course }
},
{ "upsert": true }
)
What this does in an "upsert" operation is first looks for a document that may exist in your collection that matches the query criteria given. In this case a "student_number" with the current "Semester" value.
When that match is found, the document is merely "updated". So what is being done here is using the $addToSet operator in order to "update" only unique values into the "courses" array element. This would seem to make sense to have unique courses but if that is not your case then of course you can simply use the $push operator instead. So that is the operation you want to happen every time, whether the document was "matched" or not.
In the case where no "matching" document is found, a new document will then be inserted into the collection. This is where the $setOnInsert operator comes in.
So the point of that section is that it will only be called when a new document is created as there is no need to update those fields with the same information every time. In addition to this, the fields you specified in the query criteria have explicit values, so the behavior of the "upsert" is to automatically create those fields with those values in the newly created document.
After a new document is created, then the next "upsert" statement that uses the same criteria will of course only "update" the now existing document, and as such only your new course information would be added.
Overall working like this allows you to "pre-join" the two tables from your source with an appropriate query. Then you are just looping the results without needing to write code for trying to group the correct entries together and simply letting MongoDB do the accumulation work for you.
Of course you can always just write the code to do this yourself and it would result in fewer "trips" to the database in order to insert your already accumulated records if that would suit your needs.
As a final note, though it does require some additional complexity, you can get better performance out of the operation as shown by using the newly introduced "batch updates" functionality.For this your MongoDB server version will need to be 2.6 or higher. But that is one way of still reducing the logic while maintaining fewer actual "over the wire" writes to the database.
You can either have two separate collections - one with student details and other with courses and link them with "id".
Else you can have a single document with courses as inner document in form of array as below:
{
"FirstName": "Fred",
"LastName": "Zhang",
"age": 20,
"id": "s1234",
"Courses": [
{
"courseId": "COSC2406",
"semester": 1
},
{
"courseId": "COSC1127",
"semester": 1
},
{
"courseId": "COSC2110",
"semester": 1
},
{
"courseId": "COSC2110",
"semester": 2
}
]
}