Adding frequency counter between nodes in neo4j during csv import - csv

I've got a csv file with ManufacturerPartNumbers and Manufacturers. Both values can potentially be duplicated across rows one or more times. Meaning I could have ManufacturerParnNumber,Manufactuerere: A|X , A|Y, A|Y, B|X, C,X
In this case, I'd like to create ManufacturerPartNumber nodes (A), (B), (C) and Manufacturer nodes (X), (Y)
I also want to create relationships of
(A)-[MADE_BY]->(X)
(A)-[MADE_BY]->(Y)
And I also want to apply a weighting value in the relationship between A -> Y since it appears twice in my dataset, so that I know that there's a more frequent relationship between A|Y than there is between A|X.
Is there a more efficient way of doing this? I'm dealing with 10M rows of csv data and it is crashing during import.
:param UploadFile => 'http://localhost:11001/project-f64568ab-67b6-4560-ae89-8aea882892b0/file.csv';
//open the CSV file
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
//create nodes
MERGE (mfgr:Manufacturer {name: COALESCE(trim(toUpper(csvLine.Manufacturer)),'NULL')})
MERGE (mpn:MPN {name: COALESCE(trim(toUpper(csvLine.MPN)),'NULL')})
//set relationships
MERGE (mfgr)-[a:MAKES]->(mpn)
SET a += {appearances: (CASE WHEN a.appearances is NULL THEN 0 ELSE a.appearances END) + 1, refid: (CASE WHEN a.refid is NULL THEN csvLine.id ELSE a.refid + ' ~ ' + csvLine.id END)}
;

Separating the node creation from the relationships creation and then setting the values helped a bit.
Ultimately what had the most impact was that I spun up an AuraDB at max size and then imported all of the data, followed by resizing it back down. Probably not an ideal way to handle it, but it worked better than all the other optimization and only cost me a few bucks!
//QUERY ONE: var2 and var1 nodes
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
MERGE (var2:VAR2 {name: COALESCE(trim(toUpper(csvLine.VAR2)),'NULL')})
MERGE (var1:VAR1 {name: COALESCE(trim(toUpper(csvLine.VAR1)),'NULL')})
;
//QUERY TWO: var2 and var1 nodes
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
MERGE (var2:VAR2 {name: COALESCE(trim(toUpper(csvLine.VAR2)),'NULL')})
MERGE (var1:VAR1 {name: COALESCE(trim(toUpper(csvLine.VAR1)),'NULL')})
MERGE (var2)-[a:RELATES_TO]->(var1) SET a += {appearances: (CASE WHEN a.appearances is NULL THEN 0 ELSE a.appearances END) + 1}
;
//QUERY THREE: handle descriptors
//open the CSV file
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
UNWIND split(trim(toUpper(csvLine.Descriptor)), ' ') AS DescriptionSep1 UNWIND split(trim(toUpper(DescriptionSep1)), ',') AS DescriptionSep2 UNWIND split(trim(toUpper(DescriptionSep2)), '|') AS DescriptionSep3 UNWIND split(trim(toUpper(DescriptionSep3)), ';') AS DescriptionSep4
MERGE (var2:VAR2 {name: COALESCE(trim(toUpper(csvLine.VAR2)),'NULL')})
MERGE (var1:VAR1 {name: COALESCE(trim(toUpper(csvLine.VAR1)),'NULL')})
MERGE (descriptor:Descriptor {name: COALESCE(trim(toUpper(DescriptionSep4)),'NULL')})
SET descriptor += {appearances: (CASE WHEN descriptor.appearances is NULL THEN 0 ELSE descriptor.appearances END) + 1}
MERGE (descriptor)-[d:DESCRIBES]->(var1)
SET d += {appearances: (CASE WHEN d.appearances is NULL THEN 0 ELSE d.appearances END) + 1}
;

Related

Loading CSV in Neo4j is time consuming

I want load a CDR csv file with 648000 records to neo4j (4.4.10), But it is about 4 days and And it is not yet completed.
My CSV have 648000 records with 7 columns. and the size of file is about 48 MB.
My computer have 100 GB RAM and intel Zeon E5 CPU.
the columns of CSV are:
OP_Name
TP_Name
Called_Number
OP_ANI
Setup_Time
Duration
OP_Price
the code that I use to load CSV in Neo4j is:
```Cypher
:auto load csv with headers from 'file:///cdr.csv' as line FIELDTERMINATOR ','
with line
where line['Called_Number'] is not null and line['OP_ANI'] is not null
with line['OP_ANI'] as OP_Phone,
(CASE line['OP_Name']
WHEN 'TIC' THEN 'IRAN'
ELSE 'Foreign' END) AS OP_country,
line['Called_Number'] as Called_Phone,
(CASE line['TP_Name']
WHEN 'TIC' THEN 'IRAN'
ELSE 'Foreign' END) AS TP_country,
line['Setup_Time'] as Setup_Time,
line['Duration'] as Duration,
line['OP_Price'] as OP_Price
call {
with OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price
MERGE (c:Customer{phone: toInteger(Called_Phone)})
on create set c.country = TP_country
WITH c, OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price
CALL apoc.create.addLabels( c, [ c.country ] ) YIELD node
MERGE (c2:Customer{phone: toInteger(OP_Phone)})
on create set c2.country = OP_country
WITH c2, OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price, c
CALL apoc.create.addLabels( c2, [ c2.country ] ) YIELD node
MERGE (c2)-[r:CALLED{setupTime: Setup_Time,
duration: Duration,
OP_Price: OP_Price}]->(c)
} IN TRANSACTIONS
```
How can I speed up the load operation?
MERGE acts as an upsert in Neo4j. So the statement:
MERGE (c:Customer{phone: toInteger(Called_Phone)})
checks if there is a Customer node with the given phone number is there or not. If it is, it performs the update otherwise creates the node. When there is a large number of nodes, this lookup can be very slow, and CSV import will be slow overall. Creating an index on the phone property of Customer should do the trick. You can create the index like this:
CREATE INDEX phone IF NOT EXISTS FOR (n:Customer) ON (n.phone)

How to fix timeless execution in cypher query - Neo4j Graph Database?

I'm dealing with the import of Common Weakness Enumeration Catalog (.json file) to the Neo4j Graph Database, using cypher language query and the apoc library. Although i import properly the fields: Weaknesses, Views, External_References, i have an execution problem (without any error) with the import of the field: Categories which is executing without ending. Below i present the structure of .json file and my cypher code.
"Weakness_Catalog": {
"Weaknesses": {"Weakness":[...]}
"Categories": {"Category":[...]}
"Views": {"View":[...]}
"External_References": {"External_Reference":[...]}
}
Cypher Query
After several tests i think that the logic error is between the last 2 parts [with value....(catRef)], without them, the query executes pretty good, at normal time. I've also changed a setting param. at the db configuration file due to an error (cypher.lenient_create_relationship = true). And i tested the different import sequence with the same bad results (weakness, categories, views, ext. references etc.)
call apoc.load.json(files) yield value
unwind value.Weakness_Catalog.Weaknesses.Weakness as weakness
merge (i:GeneralInfo_CWE {Name:value.Weakness_Catalog.Name, Version:value.Weakness_Catalog.Version,
Date:value.Weakness_Catalog.Date, Schema:'https://cwe.mitre.org/data/xsd/cwe_schema_v6.4.xsd'})
merge(w:CWE {Name:'CWE-' + weakness.ID})
set w.Extended_Name=weakness.Name, w.Abstraction=weakness.Abstraction,
w.Structure=weakness.Structure, w.Status=weakness.Status, w.Description=weakness.Description,
w.Extended_Description= apoc.convert.toString(weakness.Extended_Description),
w.Likelihood_Of_Exploit=weakness.Likelihood_Of_Exploit,
w.Background_Details=apoc.convert.toString(weakness.Background_Details.Background_Detail),
w.Modes_Of_Introduction=[value in weakness.Modes_Of_Introduction.Introduction | value.Phase],
w.Submission_Date=weakness.Content_History.Submission.Submission_Date,
w.Submission_Name=weakness.Content_History.Submission.Submission_Name,
w.Submission_Organization=weakness.Content_History.Submission.Submission_Organization,
w.Modifications=[value in weakness.Content_History.Modification | apoc.convert.toString(value)],
w.Alternate_Terms=apoc.convert.toString(weakness.Alternate_Terms),
w.Notes=[value in weakness.Notes.Note | apoc.convert.toString(value)],
w.Affected_Resources=[value in weakness.Affected_Resources.Affected_Resource | value],
w.Functional_Areas=[value in weakness.Functional_Areas.Functional_Area | value]
merge (w)-[:belongsTo]->(i)
with w, weakness, value
unwind weakness.Related_Weaknesses.Related_Weakness as Rel_Weakness
match (cwe:CWE) where cwe.Name='CWE-' + Rel_Weakness.CWE_ID
merge (w)-[:Related_Weakness{Nature:Rel_Weakness.Nature}]->(cwe)
with w, weakness, value
unwind weakness.Applicable_Platforms as appPl
foreach (lg in appPl.Language |
merge(ap:Applicable_Platform{Type:'Language', Prevalence:lg.Prevalence,
Name:coalesce(lg.Name, 'NOT SET'), Class:coalesce(lg.Class, 'NOT SET')})
merge(w)-[:Applicable_Platform]->(ap))
with w, weakness, value, appPl
foreach (tch in appPl.Technology |
merge(ap:Applicable_Platform{Type:'Technology', Prevalence:tch.Prevalence,
Name:coalesce(tch.Name, 'NOT SET'), Class:coalesce(tch.Class, 'NOT SET')})
merge(w)-[:Applicable_Platform]->(ap))
with w, weakness, value, appPl
foreach (arc in appPl.Architecture |
merge(ap:Applicable_Platform{Type:'Architecture', Prevalence:arc.Prevalence,
Name:coalesce(arc.Name, 'NOT SET'), Class:coalesce(arc.Class, 'NOT SET')})
merge(w)-[:Applicable_Platform]->(ap))
with w, weakness, value, appPl
foreach (os in appPl.Operating_System |
merge(ap:Applicable_Platform{Type:'Operating System', Prevalence:os.Prevalence,
Name:coalesce(os.Name, 'NOT SET'), Class:coalesce(os.Class, 'NOT SET')})
merge(w)-[:Applicable_Platform]->(ap))
with w, weakness, value
foreach (example in weakness.Demonstrative_Examples.Demonstrative_Example |
merge(ex:Demonstrative_Example {Intro_Text:apoc.convert.toString(example.Intro_Text)})
set ex.Body_Text=[value in example.Body_Text | apoc.convert.toString(value)],
ex.Example_Code=[value in example.Example_Code | apoc.convert.toString(value)]
merge (w)-[:hasExample]->(ex))
with w, weakness, value
foreach (consequence in weakness.Common_Consequences.Consequence |
merge (con:Consequence{CWE:w.Name, Scope:[value in consequence.Scope | value]})
set con.Impact=[value in consequence.Impact | value],
con.Note=consequence.Note, con.Likelihood=consequence.Likelihood
merge(w)-[:hasConsequence]->(con))
with w, weakness, value
foreach (dec in weakness.Detection_Methods.Detection_Method |
merge(d:Detection_Method {Method:dec.Method})
merge(w)-[wd:canBeDetected{Description:apoc.convert.toString(dec.Description)}]->(d)
set wd.Effectiveness=dec.Effectiveness, wd.Effectiveness_Notes=dec.Effectiveness_Notes,
wd.Detection_Method_ID=dec.Detection_Method_ID)
with w, weakness, value
foreach (mit in weakness.Potential_Mitigations.Mitigation |
merge(m:Mitigation {Description:apoc.convert.toString(mit.Description)})
set m.Phase=[value in mit.Phase | value], m.Strategy=mit.Strategy,
m.Effectiveness=mit.Effectiveness, m.Effectiveness_Notes=mit.Effectiveness_Notes,
m.Mitigation_ID=mit.Mitigation_ID
merge(w)-[:hasMitigation]->(m))
with w, weakness, value
foreach (rap in weakness.Related_Attack_Patterns.Related_Attack_Pattern |
merge(cp:CAPEC {Name:rap.CAPEC_ID})
merge(w)-[:RelatedAttackPattern]->(cp))
with w, weakness, value
foreach (reference in value.Weakness_Catalog.External_References.External_Reference |
merge(r:External_Reference{Reference_ID:reference.Reference_ID})
set r.Author=[value in reference.Author | value], r.Title=reference.Title,
r.Edition=reference.Edition, r.URL=reference.URL,
r.Publication_Year=reference.Publication_Year, r.Publisher=reference.Publisher)
with w, weakness, value
unwind weakness.References.Reference as exReference
match (ref:External_Reference) where ref.Reference_ID=exReference.External_Reference_ID
merge(w)-[:hasExternal_Reference]->(ref)
with value
unwind value.Weakness_Catalog.Views.View as view
merge (v:CWE_VIEW{ViewID:view.ID})
set v.Name=view.Name, v.Type=view.Type, v.Status=view.Status,
v.Objective=apoc.convert.toString(view.Objective), v.Filter=view.Filter,
v.Notes=apoc.convert.toString(view.Notes),
v.Submission_Name=view.Content_History.Submission.Submission_Name,
v.Submission_Date=view.Content_History.Submission.Submission_Date,
v.Submission_Organization=view.Content_History.Submission.Submission_Organization,
v.Modification=[value in view.Content_History.Modification | apoc.convert.toString(value)]
foreach (value in view.Audience.Stakeholder |
merge (st:Stakeholder{Type:value.Type})
merge (v)-[rel:usefulFor]->(st)
set rel.Description=value.Description)
with v, view, value
unwind (case view.Members.Has_Member when [] then [null] else view.Members.Has_Member end) as members
optional match (MemberWeak:CWE{Name:'CWE-' + members.CWE_ID})
merge (v)-[:hasMember{ViewID:members.View_ID}]->(MemberWeak)
with v, view, value
unwind (case view.References.Reference when [] then [null] else view.References.Reference end) as viewExReference
optional match (viewRef:External_Reference{Reference_ID:viewExReference.External_Reference_ID})
merge (v)-[:hasExternal_Reference{ViewID:v.ViewID}]->(viewRef)
with value
unwind value.Weakness_Catalog.Categories.Category as category
merge (c:CWE_Category{CategoryID:category.ID})
set c.Name=category.Name, c.Status=category.Status, c.Summary=apoc.convert.toString(category.Summary),
c.Notes=apoc.convert.toString(category.Notes), c.Submission_Name=category.Content_History.Submission.Submission_Name,
c.Submission_Date=category.Content_History.Submission.Submission_Date,
c.Submission_Organization=category.Content_History.Submission.Submission_Organization,
c.Modification=[value in category.Content_History.Modification | apoc.convert.toString(value)]
with c, category
unwind (case category.References.Reference when [] then [null] else category.References.Reference end) as categoryExReference
optional match (catRef:External_Reference{Reference_ID:categoryExReference.External_Reference_ID})
merge (c)-[:hasExternal_Reference{CategoryID:c.CategoryID}]->(catRef)
So, the problem was that every time i use with, i'm working in nested loops. The more nested loops, the slower the query will be. A good way to speed up, is to create simplier queries when it's possible.
For example in the json file:
"Weakness_Catalog": {
"Weaknesses": {"Weakness":[...]}
"Categories": {"Category":[...]}
"Views": {"View":[...]}
"External_References": {"External_Reference":[...]}
}
i will execute one query for Weaknesses, one for Categories, one for Views and one for External_References.

Read CSV to Neo4j creating one node per column and relations

I am stuck in the command in Neo4j (I am a newbie) to create a database based on a CSV like this:
Country,Name1,Name2,Name3,Influence
France,John,Pete,Josh,2
Italy,Pete,Bepe,Juan,3
USA,Josh,Juan,Pete,1
Spain,Juan,John,,2
When I try to create one node per person (NameX) setting the relationship between names columns adding the tags of Influence and Country,It fails because there are empty names.
How can achive this?
Thanks
UPDATE:
LOAD CSV WITH HEADERS FROM 'file:///diag.csv' AS row FIELDTERMINATOR ';'
MERGE (c:Country{name:row.Country})
WITH CASE row.name1 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name1] END as
name1List ,c
WITH CASE row.name2 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name2] END as
name2List ,c
WITH CASE row.name3 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name3] END as
name3List ,c
FOREACH (x IN name1List | MERGE (n:Node{name : x} )
MERGE (n)-[:REL_TYPE]->(c)
)
FOREACH (x IN name2List | MERGE (n:Node{name : x} )
MERGE (n)-[:REL_TYPE]->(c)
)
FOREACH (x IN name3List | MERGE (n:Node{name : x} )
MERGE (n)-[:REL_TYPE]->(c)
)
RETURN SUM(1)
Getting error:
Variable row not defined (line 4, column 11 (offset: 209))
"WITH CASE row.name2 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name2] END as >name2List ,c"
The last line has empty Name3 field. Try adding Name3 to the last line in your data set.
Spain,Juan,John, {empty - fill this},2
LOAD CSV WITH HEADERS FROM 'file:///diag.csv' AS row FIELDTERMINATOR ';' MERGE (c:Country{name:row.Country})
WITH CASE row.name1 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name1] END as name1List ,
CASE row.name2 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name2] END as name2List ,
CASE row.name3 WHEN NULL THEN [] WHEN '' THEN [] ELSE [row.name3] END as name3List ,c,row
FOREACH (x IN name1List | MERGE (n:Node{name : x} ) MERGE (n)-[:REL_TYPE]->(c) )
FOREACH (x IN name2List | MERGE (n:Node{name : x} ) MERGE (n)-[:REL_TYPE]->(c) )
FOREACH (x IN name3List | MERGE (n:Node{name : x} ) MERGE (n)-[:REL_TYPE]->(c) ) RETURN SUM(1)
here, with the help of 'case' expression of cypher , we are creating either empty list when its null or empty or list with one value i.e (row.name3) .After case check , we can use this list to iterate and create node with property name3 . so , when its null or empty , you iterate zero times, so you wont get the error .
Finally, sum(1) will give you number of rows you processed .so, you can cross check if you have processed all the rows in the csv file or not

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

creating multiple labels with csv

I am trying to load a csv file to create nodes and labels. Is there a way to I add more than one label at the same time? (I am using neo4j 2.1.1)
this is my csv:
1,Test1,hardkey,button
2,Test2,touch,button
3,Test3,,screen
I tried this:
LOAD CSV FROM 'file:/Users/Claudia/Documents/nodes.csv' AS csvLine
FOREACH (n IN (CASE WHEN csvLine[2]='hardkey' THEN [1] ELSE[] END) |
MERGE (p:hardkey {name: csvLine[1]})
)
FOREACH (n IN (CASE WHEN csvLine[2]='touch' THEN [1] ELSE[] END) |
MERGE (p:touch {name: csvLine[1]})
)
This works, but how do I get the other column ("button" and "screen") included?
Thanks a lot.
Like this?
See the MERGE documentation.
LOAD CSV FROM 'file:/Users/Claudia/Documents/nodes.csv' AS csvLine
FOREACH (n IN (CASE WHEN csvLine[2]='hardkey' THEN [1] ELSE[] END) |
MERGE (p:hardkey {name: csvLine[1]}) ON CREATE SET p.what = csvLine[3]
)
FOREACH (n IN (CASE WHEN csvLine[2]='touch' THEN [1] ELSE[] END) |
MERGE (p:touch {name: csvLine[1]}) ON CREATE SET p.what = csvLine[3]
)