Import big csv file partially into Neo4j - csv

A big csv file (ca. 7 GB), which is my result of some pairwise computation between rows of objects 1 and objects 2 using 3 different methods looks like this:
obj 1 , obj 2 , method1 , method2 , method3
obj1 & obj2 are strings and method1,method2,method3 are float values.
I don't want to import the whole csv into Neo4j, but when values of method1 is above certain threshold, I want that specific row be imported and between object1 and object2 of that row an edge to be defined and the same for method2 and method3, what I mean is, that when values of method2 is above certain threshold, I want that specific row be imported and between object1 and object2 of that row an edge to be defined. thanks #cybersam and #Dave Bennett, who wroted me here original queries, I changed them a bit and run 3 following queries seperately. After that when i write a for example simple query like:
match (n)-[r:similar_on_method2]-(m) return n,r,m
I get not only demanded relation but other relations included in result graph, I dont know, what is wrong ?!
Using periodic commit
LOAD CSV WITH HEADERS
FROM 'file:///objects.csv'
AS line
WITH line
WHERE toFloat(line.method1) >= $x
MERGE (obj1:Object {name: line.obj1})
MERGE (obj2:Object {name: line.obj2})
MERGE (obj1)-[:similar_on_method1]->(obj2)
Using periodic commit
LOAD CSV WITH HEADERS
FROM 'file:///objects.csv'
AS line
WITH line
WHERE toFloat(line.method2) >= $x
MERGE (obj1:Object {name: line.obj1})
MERGE (obj2:Object {name: line.obj2})
MERGE (obj1)-[:similar_on_method2]->(obj2)
Using periodic commit
LOAD CSV WITH HEADERS
FROM 'file:///objects.csv'
AS line
WITH line
WHERE toFloat(line.method3) >= $x
MERGE (obj1:Object {name: line.obj1})
MERGE (obj2:Object {name: line.obj2})
MERGE (obj1)-[:similar_on_method3]->(obj2)

I guess how big is big? But something along these lines should get you started. If the obj1 and obj2 values repeat in your data or already exist in your database, you will want to create some indexes on the obj1 and obj2 values.
LOAD CSV WITH HEADERS
FROM 'file:///objects.csv'
AS line
WITH line
WHERE toInteger(line.method1) >= $x
AND toInteger(line.method2) >= $y
AND toInteger(line.method3) >= $z
MERGE (obj1:Object {name: line.obj1})
MERGE (obj2:Object {name: line.obj2})
MERGE (obj1)-[:LINK]->(obj2)

Related

Create nodes from CSV in Neo4j

I have a csv file and I want to make 2 nodes with relation (node country-reported_on->node report_date). I have tried this code but it returns empty nodes with numbers instead of country name.
Here is what my dataset looks like:
PEOPLE_POSITIVE_CASES_COUNT;REPORT_DATE;COUNTRY_SHORT_NAME;PEOPLE_DEATH_COUNT;LIFE_EXPECTANCY;GDP;DENSITY_POPULATION;WORKFORCE
0;22.01.2020;Lesotho;0;54.836;875.353432963926;70.5616600790514
134;09.07.2020;Lesotho;1;54.836;875.353432963926;70.5616600790514
79557;02.03.2021;Zambia;1104;64.194;985.132436038869;94.4781600309238
106470;02.03.2021;Kenya;1863;66.991;1878.58070251348;94.4781600309238
Here is the code that I used:
LOAD CSV WITH HEADERS FROM "file:///dataset.csv"
as row WITH row WHERE row.COUNTRY_SHORT_NAME IS NOT NULL
MERGE (c:Country {name: row.COUNTRY_SHORT_NAME,
life_exp: row.LIFE_EXPECTANCY,
gdp: row.GDP,
density_population: row.DENSITY_POPULATION,
worforce: row.WORKFORCE } )
MERGE ( d:Report_date { date: row.REPORT_DATE } )
MERGE (c)-[:reported_on {cases_count: row.PEOPLE_POSITIVE_CASES_COUNT,
death_count: row.PEOPLE_DEATH_COUNT}]->(d)
EDIT
I changed the delimiter to ';' because that is what we had in our dataset however we still get bad results here is how it looks like in neo4j after running this code:
LOAD CSV WITH HEADERS FROM "file:///dataset.csv"
as row FIELDTERMINATOR ';' WITH row WHERE row.COUNTRY_SHORT_NAME IS NOT NULL
MERGE (c:Country {name: row.COUNTRY_SHORT_NAME,
life_exp: row.LIFE_EXPECTANCY,
gdp: row.GDP,
density_population: row.DENSITY_POPULATION,
worforce: row.WORKFORCE } )
MERGE ( d:Report_date { date: row.REPORT_DATE } )
MERGE (c)-[:reported_on {cases_count: row.PEOPLE_POSITIVE_CASES_COUNT,
death_count: row.PEOPLE_DEATH_COUNT}]->(d)
I think you got confused with node caption in Neo4j browser, all nodes get assigned an node id by default and nodes must be showing that. You can change it to country name property by clicking on node label. Screen shot for reference.

Is there a way to import a csv into Neo4j using foreach or unwind?

I am using the following .csv file for Neo4j import. There are 202 rackets. The numbers below racketX are the rating the user has given that racket.
I want to create the relationships among the users and the rating they have given to each racket. This is my current approach:
LOAD CSV WITH HEADERS FROM 'http://spreding.online/racket-recommendation-system/data/formattedFiles/formattedUsers.csv' AS row
WITH row
WHERE row.username IS NOT NULL
MERGE (u:User {
username: row.username,
height_m: toInteger(row.height),
weight_kg: toInteger(row.weight)
})
WITH row, u, range(3, 204) as indexes
MATCH (r:Racket)
UNWIND r as racket
UNWIND indexes as i
MERGE (u)-[:RATES {rating:toInteger(row[i])}]->(racket)
I get a "cannot access a map" error. Can you help me?
I would break down the load into multiple steps.
Load the users.
LOAD CSV WITH HEADERS FROM 'http://spreding.online/racket-recommendation-system/data/formattedFiles/formattedUsers.csv' AS row
WITH row
WHERE row.username IS NOT NULL
MERGE (u:User {
username: row.username,
height_m: toInteger(row.height),
weight_kg: toInteger(row.weight)
})
Load the rackets.
UNWIND RANGE(1,202) as idx
CREATE (:Racket {racketNumber:"racket"+idx})
Load the relationships.
LOAD CSV WITH HEADERS FROM 'http://spreding.online/racket-recommendation-system/data/formattedFiles/formattedUsers.csv' AS row
UNWIND RANGE (1,202) as idx
MATCH (u:User {username:row.username})
MATCH (r:Racket {racketNumber:"racket"+idx})
MERGE (u)-[:RATES {rating:toInteger(row["racket"+idx])}]->(r)

Create nodes conditionally when loading nodes from csv in Neo4j

I have a data set in csv format. One of the fields is "elem_type". Based on this type I need to create different types nodes and give to the "columns" of my csv a different name based on the "elem_type" when loading the data using csv load, any way to do that?
My csv has no header and the data look like this:
0, 123, Marco, Ciao
1, 345, Merceds, Car, Yellow
2, 987, Boat, 150cm
Based on the first colmuns that is my "elem_type" i want to load the data and define 3 types of nodes (Person, Car, Boat) and also based on the elem_type define different header
I highly recommend to pre-parse the csv file into separate files for each label. It will make the cypher for import so much easier. In the following I use a little trick by wrapping the CASE command inside a FOREACH:
load csv from "file:///test.csv" as line
foreach (i in case when line[0] = '0' then [1] else [] end |
merge (p:Person {id: line[1]}) set p.name = line[2] )
foreach (i in case when line[0] = '1' then [1] else [] end |
merge (c:Car {id: line[1]}) set c.name = line[2], c.color = line[4] )
foreach (i in case when line[0] = '2' then [1] else [] end |
merge (b:Boat {id: line[1]}) set b.name = line[2] )
Also, don't forget to add indexes on the properties you are merging on.

How to specify relationship type in CSV?

I have a CSV file with data like:
ID,Name,Role,Project
1,James,Owner,TST
2,Ed,Assistant,TST
3,Jack,Manager,TST
and want to create people whose relationships to the project are therein specified. I attempted to do it like this:
load csv from 'file:/../x.csv' as line
match (p:Project {code: line[3]})
create (n:Individual {name: line[1]})-[r:line[2]]->(p);
but it barfs with:
Invalid input '[': expected an identifier character, whitespace, '|',
a length specification, a property map or ']' (line 1, column 159
(offset: 158))
as it can't seem to dereference line in the relationship creation. if I hard-code that it works:
load csv from 'file:/../x.csv' as line
match (p:Project {code: line[3]})
create (n:Individual {name: line[1]})-[r:WORKSFOR]->(p);
so how do I make the reference?
Right now you can't as this is structural information.
Either use neo4j-import tool for that.
Or specify it manually as you did, or use this workaround:
load csv with headers from 'file:/../x.csv' as line
match (p:Project {code: line.Project})
create (n:Individual {name: lineName})
foreach (x in case line.Role when "Owner" then [1] else [] end |
create (n)-[r:Owner]->(p)
)
foreach (x in case line.Role when "Assistant" then [1] else [] end |
create (n)-[Assistant]->(p)
)
foreach (x in case line.Role when "Manager" then [1] else [] end |
create (n)-[r:Manager]->(p)
)
Michael's answer stands, however, I've found that what I can do is specify an attribute to the relationship, like this:
load csv from 'file:/.../x.csv' as line
match (p:Project {code: line[3]})
create (i:Individual {name: line[1]})-[r:Role { type: line[2] }]->(p)
and I can make Neo4j display the type attribute of the relationship instead of the label
this question is old, but there is a post by Mark Needham
that provide a great and easy solution using APOC
as follow
load csv with headers from "file:///people.csv" AS row
MERGE (p1:Person {name: row.node1})
MERGE (p2:Person {name: row.node2})
WITH p1, p2, row
CALL apoc.create.relationship(p1, row.relationship, {}, p2) YIELD rel
RETURN rel
note: the "YIELD rel" is essential and so for the return part

Parsing numerical data using Prolog?

I am new to prolog and am considering using it for a small data analysis application. Here is what I am seeking to accomplish:
I have a CSV file with some data of the following from:
a,b,c
d,e,f
g,h,i
...
The data is purely numerical and I need to do the following: 1st, I need to group rows according to the following scheme:
So what's going on above?
I start at the 1st row, which has value 'a' in column one. Then, I keep going down the rows until I hit a row whose value in column one differs from 'a' by a certain amount, 'z'. The process is then repeated, and many "groups" are formed after the process is complete.
For each of these groups, I want to find the mean of columns two and three (as an example, for the 1st group in the picture above, the mean of column two would be: (b+e+h)/3).
I am pretty sure this can be done in prolog. However, I have 50,000+ rows of data and since prolog is declarative, I am not sure how efficient prolog would be at accomplishing the above task?
Is it feasible to work out a prolog program to accomplish the above task, so that efficiency of the program is not significantly lower than a procedural analog?
this snippet could be a starting point for your task
:- [library(dcg/basics)].
rownum(Z, AveList) :- phrase_from_file(row_scan(Z, [], [], AveList), 'numbers.txt').
row_scan(Z, Group, AveSoFar, AveList) -->
number(A),",",number(B),",",number(C),"\n",
{ row_match(Z, A,B,C, Group,AveSoFar, Group1,AveUpdated) },
row_scan(Z, Group1, AveUpdated, AveList).
row_scan(_Z, _Group, AveList, AveList) --> "\n";[].
% row_match(Z, A,B,C, Group,Ave, Group1,Ave1)
row_match(_, A,B,C, [],Ave, [(A,B,C)],Ave).
row_match(Z, A,B,C, [H|T],Ave, Group1,Ave1) :-
H = (F,_,_),
( A - F =:= Z
-> aggregate_all(agg(count,sum(C2),sum(C3)),
member((_,C2,C3), [(A,B,C), H|T]), agg(Count,T2,T3)),
A2 is T2/Count, A3 is T3/Count,
Group1 = [], Ave1 = [(A2,A3)|Ave]
; Group1 = [H,(A,B,C)|T], Ave1 = Ave
).
with this input
1,2,3
4,5,6
7,8,9
10,2,3
40,5,6
70,8,9
16,0,0
yields
?- rownum(6,L).
L = [ (3.75, 4.5), (5, 6)]