Importing json in neo4J - json

[PROBLEM - My final solution below]
I'd like to import a json file containing my data into Neo4J.
However, it is super slow.
The Json file is structured as follow
{
"graph": {
"nodes": [
{ "id": 3510982, "labels": ["XXX"], "properties": { ... } },
{ "id": 3510983, "labels": ["XYY"], "properties": { ... } },
{ "id": 3510984, "labels": ["XZZ"], "properties": { ... } },
...
],
"relationships": [
{ "type": "bla", "startNode": 3510983, "endNode": 3510982, "properties": {} },
{ "type": "bla", "startNode": 3510984, "endNode": 3510982, "properties": {} },
....
]
}
}
Is is similar to the one proposed here: How can I restore data from a previous result in the browser?.
By looking at the answer.
I discovered that I can use
CALL apoc.load.json("file:///test.json") YIELD value AS row
WITH row, row.graph.nodes AS nodes
UNWIND nodes AS node
CALL apoc.create.node(node.labels, node.properties) YIELD node AS n
SET n.id = node.id
and then
CALL apoc.load.json("file:///test.json") YIELD value AS row
with row
UNWIND row.graph.relationships AS rel
MATCH (a) WHERE a.id = rel.endNode
MATCH (b) WHERE b.id = rel.startNode
CALL apoc.create.relationship(a, rel.type, rel.properties, b) YIELD rel AS r
return *
(I have to do it in two times because else their are relation duplication due to the two unwind).
But this is super slow because I have a lot of entities and I suspect the program to search over all of them for each relation.
At the same time, I know "startNode": 3510983 refers to a node.
So the question: does it exists anyway to speed up to import process using ids as index, or something else?
Note that my nodes have differents types. So I did not find a way to create an index for all of them, and I suppose that would be too huge (memory)
[MY SOLUTION]
CALL apoc.load.json('file:///test.json') YIELD value
WITH value.graph.nodes AS nodes, value.graph.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, COLLECT({id: n.id, node: node, labels:labels(node)}) AS nMap
UNWIND rels AS r
MATCH (w{id:r.startNode})
MATCH (y{id:r.endNode})
CALL apoc.create.relationship(w, r.type, r.properties, y) YIELD rel
RETURN rel

[EDITED]
This approach may work more efficiently:
CALL apoc.load.json("file:///test.json") YIELD value
WITH value.graph.nodes AS nodes, value.graph.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, apoc.map.mergeList(COLLECT({id: n.id, node: node})) AS nMap
UNWIND rels AS r
CALL apoc.create.relationship(nMap[r.startNode], r.type, r.properties, nMap[r.endNode]) YIELD rel
RETURN rel
This query does not use MATCH at all (and does not need indexing), since it just relies on an in-memory mapping from the imported node ids to the created nodes. However, this query could run out of memory if there are a lot of imported nodes.
It also avoids invoking SET by using apoc.map.setKey to add the id property to n.properties.
The 2 UNWINDs do not cause a cartesian product, since this query uses the aggregating function COLLECT (before the second UNWIND) to condense all the preceding rows into one (because the grouping key, rels, is a singleton).

Have you tried indexing the nodes before the LOAD JSON? This may not be tenable since you have multiple node labels. But if they are limited you can create placeholder node, create and index and then delete the placeholder. After this, run the LOAD Json
Create (n:YourLabel{indx:'xxx'})
create index on: YourLabel(indx)
match (n:YourLabel) delete n
The index will speed the matching or merging

Related

Creating nodes and relations from JSON (dynamically)

I've got a couple hundred JSONs in a structure like the following example:
{
"JsonExport": [
{
"entities": [
{
"identity": "ENTITY_001",
"surname": "SMIT",
"entityLocationRelation": [
{
"parentIdentification": "PARENT_ENTITY_001",
"typeRelation": "SEEN_AT",
"locationIdentity": "LOCATION_001"
},
{
"parentIdentification": "PARENT_ENTITY_001",
"typeRelation": "SEEN_AT",
"locationIdentity": "LOCATION_002"
}
],
"entityEntityRelation": [
{
"parentIdentification": "PARENT_ENTITY_001",
"typeRelation": "FRIENDS_WITH",
"childIdentification": "ENTITY_002"
}
]
},
{
"identity": "ENTITY_002",
"surname": "JACKSON",
"entityLocationRelation": [
{
"parentIdentification": "PARENT_ENTITY_002",
"typeRelation": "SEEN_AT",
"locationIdentity": "LOCATION_001"
}
]
},
{
"identity": "ENTITY_003",
"surname": "JOHNSON"
}
],
"identification": "REGISTRATION_001",
"locations": [
{
"city": "LONDON",
"identity": "LOCATION_001"
},
{
"city": "PARIS",
"identity": "LOCATION_002"
}
]
}
]
}
With these JSON's, I want to make a graph consisting of the following nodes: Registration, Entity and Location. This part I've figured out and made the following:
WITH "file:///example.json" AS json_file
CALL apoc.load.json(json_file,"$.JsonExport.*" ) YIELD value AS data
MERGE(r:Registration {id:data.identification})
WITH json_file
CALL apoc.load.json(json_file,"$.JsonExport..locations.*" ) YIELD value AS locations
MERGE(l:Locations{identity:locations.identity, name:locations.city})
WITH json_file
CALL apoc.load.json(json_file,"$.JsonExport..entities.*" ) YIELD value AS entities
MERGE(e:Entities {name:entities.surname, identity:entities.identity})
All the entities and locations should have a relation with the registration. I thought I could do this by using the following code:
MERGE (e)-[:REGISTERED_ON]->(r)
MERGE (l)-[:REGISTERED_ON]->(r)
However this code doesn’t give the desired output. It creates extra "empty" nodes and doesn't connect to the registration node. So the first question is: How do I connect the location and entities nodes to the registration node. And in light of the other JSON's, the entities and locations should only be linked to the specific registration.
Furthermore, I would like to make the entity -> location relation and the entity - entity relation and use the given type of relation (SEEN_AT or FRIENDS_WITH) as label for the given relation. How can this be done? I'm kind of lost at this point and don’t see how to solve this. If someone could guide me into the right direction I would be much obliged.
Variable names (like e and r) are not stored in the DB, and are bound to values only within individual queries. MERGE on a pattern with an unbound variable will just create the entire pattern (including creating an empty node for unbound node variables).
When you MERGE a node, you should only specify the unique identifying property for that node, to avoid duplicates. Any other properties you want to set at the time of creation should be set using ON CREATE SET.
It is inefficient to parse through the JSON data 3 times to get different areas of the data. And it is especially inefficient the way your query was doing it, since each subsequent CALL/MERGE group of clauses would be done multiple times (since every previous CALL produces multiple rows, and the number of rows increases multiplicative). You can use aggregation to get around that, but it is unnecessary in your case, since you can just do the entire query in a single pass through the JSON data.
This may work for you:
CALL apoc.load.json(json_file,"$.JsonExport.*" ) YIELD value AS data
MERGE(r:Registration {id:data.identification})
FOREACH(ent IN data.entities |
MERGE (e:Entities {identity: ent.identity})
ON CREATE SET e.name = ent.surname
MERGE (e)-[:REGISTERED_ON]->(r)
FOREACH(loc1 IN ent.entityLocationRelation |
MERGE (l1:Locations {identity: loc1.locationIdentity})
MERGE (e)-[:SEEN_AT]->(l1))
FOREACH(ent2 IN ent.entityEntityRelation |
MERGE (e2:Entities {identity: ent2.childIdentification})
MERGE (e)-[:FRIENDS_WITH]->(e2))
)
FOREACH(loc IN data.locations |
MERGE (l:Locations{identity:loc.identity})
ON CREATE SET l.name = loc.city
MERGE (l)-[:REGISTERED_ON]->(r)
)
For simplicity, it hard-codes the FRIENDS_WITH and REGISTERED_ON relationship types, as MERGE only supports hard-coded relationship types.
So playing with neo4j/cyper I've learned some new stuff and came to another solution for the problem. Based on the given example data, the following can create the nodes and edges dynamically.
WITH "file:///example.json" AS json_file
CALL apoc.load.json(json_file,"$.JsonExport.*" ) YIELD value AS data
CALL apoc.merge.node(['Registration'], {id:data.identification}, {},{}) YIELD node AS vReg
UNWIND data.entities AS ent
CALL apoc.merge.node(['Person'], {id:ent.identity}, {}, {id:ent.identity, surname:ent.surname}) YIELD node AS vPer1
UNWIND ent.entityEntityRelation AS entRel
CALL apoc.merge.node(['Person'],{id:entRel.childIdentification},{id:entRel.childIdentification},{}) YIELD node AS vPer2
CALL apoc.merge.relationship(vPer1, entRel.typeRelation, {},{},vPer2) YIELD rel AS ePer
UNWIND data.locations AS loc
CALL apoc.merge.node(['Location'], {id:loc.identity}, {name:loc.city}) YIELD node AS vLoc
UNWIND ent.entityLocationRelation AS locRel
CALL apoc.merge.relationship(vPer1, locRel.typeRelation, {},{},vLoc) YIELD rel AS eLoc
CALL apoc.merge.relationship(vLoc, "REGISTERED_ON", {},{},vReg) YIELD rel AS eReg1
CALL apoc.merge.relationship(vPer1, "REGISTERED_ON", {},{},vReg) YIELD rel AS eReg2
CALL apoc.merge.relationship(vPer2, "REGISTERED_ON", {},{},vReg) YIELD rel AS eReg3
RETURN vPer1,vPer2, vReg, vLoc, eLoc, eReg1, eReg2, eReg3

Get last element of array by parsing JSON with Neo4j APOC

Short task description: I need to get the last element of an array/list of one of the fields in nested JSON, here the input JSON file:
{
"origin": [{
"label": "Alcohol drinks",
"tag": [],
"type": "string",
"xpath": []
},
{
"label": "Wine",
"tag": ["red", "white"],
"type": "string",
"xpath": ["Alcohol drinks"]
},
{
"label": "Port wine",
"tag": ["Portugal", "sweet", "strong"],
"type": "string",
"xpath": ["Alcohol drinks", "Wine"]
},
{
"label": "Sandeman Cask 33",
"tag": ["red", "expensive"],
"type": "string",
"xpath": ["Alcohol drinks", "Wine", "Port wine"]
}
]
}
I need to get the last element of "xpath" field, in order to create relationship with appropriate "label". Here is the code, which creates connection to all elements mentioned in "xpath", I need just connection to the last one:
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
MERGE(label:concept{name:or.label})
ON CREATE SET label.type = or.type
FOREACH(tagName IN or.tag | MERGE(tag:concept{name:tagName})
MERGE (tag)-[r:link]-(label)
ON CREATE SET r.Weight=1
ON MATCH SET r.Weight=r.Weight+1)
FOREACH(xpathName IN or.xpath | MERGE (xpath:concept{name:xpathName})
MERGE (label)-[r:link]-(xpath))
Probably there is something like:
apoc.agg.last(or.xpath)
which returns just an array of arrays or all "xpath" from all 4 records of "origin".
I will appreciate any help, probably there some workarounds (not necessary as I proposed) to solve this issue. Thank you in advance!
N.B. All this should be done from an app, not from within Neo4j browser.
Probably the easiest way would be to split this query into two queries if you want to only take the xpath array of the last element in the origin object.
Query: 1
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
MERGE(label:concept{name:or.label})
ON CREATE SET label.type = or.type
FOREACH(tagName IN or.tag | MERGE(tag:concept{name:tagName})
MERGE (tag)-[r:link]-(label)
ON CREATE SET r.Weight=1
ON MATCH SET r.Weight=r.Weight+1)
Query 2:
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
WITH value.origin[-1] as or
MATCH(label:concept{name:or.label})
FOREACH(xpathName IN or.xpath | MERGE (xpath:concept{name:xpathName})
MERGE (label)-[r:link]-(xpath))
Combining these two queries into a single one feels hacky anyway and I would avoid it, but I guess you can do the following.
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
MERGE(label:concept{name:or.label})
ON CREATE SET label.type = or.type
FOREACH(tagName IN or.tag | MERGE(tag:concept{name:tagName})
MERGE (tag)-[r:link]-(label)
ON CREATE SET r.Weight=1
ON MATCH SET r.Weight=r.Weight+1)
// Any aggregation function will break the UNWIND loop
// and return a single row as we want to write it only once
WITH value.origin[-1] as last, count(*) as agg
FOREACH(xpathName IN last.xpath |
MERGE(label:concept{name:last.label})
MERGE (xpath:concept{name:xpathName})
MERGE (label)-[r:link]-(xpath))
Sounds like you're looking for the last() function? This will return the last element of a list.
In this case, since you UNWIND the origin to 4 rows, you'll get the last element of the list for each of those rows.
WITH "file:///D:/project/neo_proj/input.json" AS url
CALL apoc.load.json(url) YIELD value
UNWIND value.origin as or
RETURN last(or.xpath) as last

N1QL Distinct Query on Nested Arrays

(Couchbase 4.5) Suppose I have the following object stored in my couchbase instance:
{
parentArray : [
{
childArray: [{value: 'v1'}, {value:'v2'}]
},
{
childArray: [{value: 'v1'}, {value: 'v3'}]
}
]
}
Now I want to select the distinct elements from childArray, which should return an array equal to ['v1', 'v2', 'v3'].
I have a couple solutions to this. My first thought was to go ahead and use the UNNEST operation:
SELECT DISTINCT ca.value FROM `my-bucket` AS b UNNEST b.parentArray AS pa UNNEST pa.childArray AS ca WHERE _class="someclass" AND dataType="someDataType";
With this approach I get a polynomial explosion in the number of scanned elements (due to the unnest'ing of two arrays), and the query takes a bit of time to complete (for my real data on the order of 24 seconds). When I remove unnest, and simply query for distinct elements on the top-level elements (those adjacent to parentArray), it takes on the order of milliseconds.
Another solution is to handle this in the application code, by simply iterating through the returned values and finding the distinct values my-self. This approach is bad, because it brings too much data into the application space.
Any help please!
Thank you!
UPDATE: Looks like without a "WHERE" clause using the "UNNEST" statements the performance is fast. So do I need Array Indexes here?
UPDATE: Nevermind about the previous update, since there is no index elements in the where clause. Also, but I do notice that if I remove the UNNEST OR the WHERE then the query is fast. Moreover, looking at the explain and adding a GSI for compound index (_class, dataType) I can see "IndexScan" on the provided index.
INSERT INTO default values("3",{ "parentArray" : [ { "childArray": [{"value": 'v1'}, {"value":'v2'}] }, { "childArray": [{"value": 'v1'}, {"value": 'v3'}] } ] });
SELECT ARRAY_DISTINCT(ARRAY v.`value` FOR v WITHIN parentArray END) FROM default;
OR
SELECT ARRAY_DISTINCT(ARRAY_FLATTEN(
ARRAY ARRAY v.`value` FOR v IN ca.childArray END FOR ca IN parentArray END,
2)) FROM default;
You can add where clause. If this requires across the documents use the following.
INSERT INTO default values("4",{ "parentArray" : [ { "childArray": [{"value": 'v5'}, {"value":'v2'}] }, { "childArray": [{"value": 'v1'}, {"value": 'v3'}] } ] });
SELECT ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(ARRAY v.`value` FOR v WITHIN parentArray END),2)) FROM default;
SELECT ARRAY_DISTINCT(ARRAY_FLATTEN(ARRAY_AGG(ARRAY_FLATTEN(ARRAY ARRAY v.`value` FOR v IN ca.childArray END FOR ca IN parentArray END,2)),2)) FROM default;

N1QL nested json, query on field inside object inside array

I have json documents in my Couchbase cluster that looks like this
{
"giata_properties": {
"propertyCodes": {
"provider": [
{
"code": [
{
"value": [
{
"name": "Country Code",
"value": "EG"
},
{
"name": "City Code",
"value": "HRG"
},
{
"name": "Hotel Code",
"value": "91U"
}
]
}
],
"providerCode": "gta",
"providerType": "gds"
},
{
"code": [
{
"value": [
{
"value": "071801"
}
]
},
{
"value": [
{
"value": "766344"
}
]
}
],
"providerCode": "restel",
"providerType": "gds"
},
{
"code": [
{
"value": [
{
"value": "HRG03Z"
}
]
},
{
"value": [
{
"value": "HRG04Z"
}
]
}
],
"providerCode": "5VF",
"providerType": "tourOperator"
}
]
}
}
}
I'm trying to create a query that fetches a single document based on the value of giata_properties.propertyCodes.provider.code.value.value and a specific providerType.
So for example, my input is 071801 and restel, I want a query that will fetch me the document I pasted above (because it contains these values).
I'm pretty new to N1QL so what I tried so far is (without the providerType input)
SELECT * FROM giata_properties AS gp
WHERE ANY `field` IN `gp.propertyCodes.provider.code.value` SATISFIES `field.value` = '071801' END;
This returns me an empty result set. I'm probably doing all of this wrongly.
edit1:
According to geraldss answer I was able to achieve my goal via 2 different queries
1st (More general) ~2m50.9903732s
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp SATISFIES v.`value` = '071801' END;
2nd (More specific) ~2m31.3660388s
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp.propertyCodes.provider[*].code SATISFIES v.`value` = '071801' END;
Bucket have around 550K documents. No indexes but the primary currently.
Question part 2
When I do either of the above queries, I get a result streamed to my shell very quickly, then I spend the rest of the query time waiting for the engine to finish iterating over all documents. I'm sure that I'll be only getting 1 result from future queries so I thought I can use LIMIT 1 so the engine stops searching on first result, I tried something like
SELECT * FROM giata_properties AS gp WHERE ANY v WITHIN gp SATISFIES v.`value` = '071801' END LIMIT 1;
But that made no difference, I get a document written to my shell and then keep waiting until the query finishes completely. How can this be configured correctly?
edit2:
I've upgraded to the latest enterprise 4.5.1-2844, I have only the primary index created on giata_properties bucket, when I execute the query along with the LIMIT 1 keyword it still takes the same time, it doesn't stop quicker.
I've also tried creating the array index you suggested but the query is not using the index and it keeps insisting on using the #primary index (even if I use USE INDEX clause).
I tried removing SELF from the index you suggested and it took a much longer time to build and now the query can use this new index, but I'm honestly not sure what I'm doing here.
So 3 questions:
1) Why LIMIT 1 using primary index doesn't make the query stop at first result?
2) What's the difference between the index you suggested with and without SELF? I tried to look for SELF keyword documentation but I couldn't find anything.
This is how both indexes look in Web ui
Index 1 (Your original suggestion) - Not working
CREATE INDEX `gp_idx1` ON `giata_properties`((distinct (array (`v`.`value`) for `v` within (array_star((((self.`giata_properties`).`propertyCodes`).`provider`)).`code`) end)))
Index 2 (Without SELF)
CREATE INDEX `gp_idx2` ON `giata_properties`((distinct (array (`v`.`value`) for `v` within (array_star(((self.`propertyCodes`).`provider`)).`code`) end)))
3) What would be the query for a specific giata_properties.propertyCodes.provider.code.value.value and a specific providerCode? I managed to do both separately but I wasn't successful in merging them.
Thanks for all your help dear
Here is a query without the providerType.
EXPLAIN SELECT *
FROM giata_properties AS gp
WHERE ANY v WITHIN gp.giata_properties.propertyCodes.provider[*].code SATISFIES v.`value` = '071801' END;
You can also index this in Couchbase 4.5.0 and above.
CREATE INDEX idx1 ON giata_properties( DISTINCT ARRAY v.`value` FOR v WITHIN SELF.giata_properties.propertyCodes.provider[*].code END );
Edit to answer question edits
The performance has been addressed in 4.5.x. You should try the following on Couchbase 4.5.1 and post the execution times here.
Test on 4.5.1.
Create the index.
Use the LIMIT. In 4.5.1, the limit is pushed down to the index.

Possible to chain results in N1ql?

I'm currently trying to do a bit of complex N1QL for a project I'm working on, theoretically I could do all of this processing in multiple N1QL calls and by parsing the results each time, however if possible I'd like for this to contained in one call.
What I would like to do is:
filter all documents that contain a "dataSync.test.id" field with more than 1 id
Read back all other ids in that list
Use that list to get other documents containing those ids
Get the "dataSync.test._channels" field for those documents (optionally a filter by docType might help parsing)
This would probably return a list of "dataSync.test._channels"
Is this possible in N1QL? It appears like it might be but I can't get the syntax right.
My data structures look a little like
{
"dataSync": {
"test": {
"_channels": [
"RP"
],
"id": [
"dataSync_user_1015",
"dataSync_user_1010",
"dataSync_user_1005"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
{
"dataSync": {
"test": {
"_channels": [
"RSD"
],
"id": [
"dataSync_user_1010"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
Yes. I think you can do all these.
Initial set of IDs with filtering can be retrieved as a subquery and then you can get subsquent documents by joins.
SELECT fulldoc
FROM (select meta().id as dockey from doc where a=1) as mydoc
INNER JOIN doc fulldoc ON KEYS mydoc.dockey;
There are optimizations that can be done here. Try the sequencing first to ensure you're get the job done.