Load from CSV to create a binary tree - csv

I need to create a binary tree in Neo4j. I've started with creating two CSV, one for vertices and one for edges and then I launched two query to create the entire tree.
I thought that I could create the entire tree with only one query.
The CSV from where I start is this:
"parent","child_1","child_1_attr1","child_1_attr2","edge_1_attr1","edge_1_attr2","child_2","child_2_attr1","child_2_attr2","edge_2_attr1","edge_2_attr2"
"vertex_1","vertex_2","2","5","4","1","vertex_3","5","3","2","2"
"vertex_2","vertex_4","3","5","2","3","vertex_5","4","4","4","3"
"vertex_3","vertex_6","2","1","2","4","vertex_7","2","2","5","5"
"vertex_4","vertex_8","4","4","4","5","vertex_9","2","3","2","5"
"vertex_5","vertex_10","1","1","3","3","vertex_11","1","3","2","3"
"vertex_6","vertex_12","3","1","1","1","vertex_13","1","2","5","1"
"vertex_7","vertex_14","4","2","2","1","vertex_15","2","5","4","3"
Then I tried this query:
LOAD CSV WITH HEADERS FROM 'file:///Prova1.csv' AS line
Match (p:Vertex {name: line.parent})
Create (c1:Vertex {name: line.child_1, attr1: line.child_1_attr1, attr2: line.child_1_attr2})
Create (c2:Vertex {name: line.child_2, attr1: line.child_2_attr1, attr2: line.child_2_attr2})
Create (p)<-[:EDGE {attr1: line.edge_1_attr1, attr2: line.edge_1_attr2}]-(c1)
Create (p)<-[:EDGE {attr1: line.edge_2_attr1, attr2: line.edge_2_attr2}]-(c2)
Before this query I create manually the first Vertex, and the I run this query, but the only result that I get is the creation of Vertices 1,2 and 3.
It should match the parent ( that is always already created ), then create the two childs and then it should connect these two children to his father.
Who can help me?

Likely your view of execution is, for each line/row, execute all Cypher code, then repeat this for the next line/row until finished, and this is incorrect.
Instead, each individual Cypher operation will execute for all rows, then the next Cypher operation will execute for all rows, etc.
This means your MATCH operation:
Match (p:Vertex {name: line.parent})
is performed across all lines in your CSV at the same time, and only then will it proceed to the next operation (your CREATE, acting on all lines), and so on.
Since you stated that you manually created the first vertex, that vertex is the only one that will be matched upon, the MATCH will fail for all other lines in your CSV because the CREATE statements haven't executed yet so those nodes don't exist. This means only two vertexes will be created, the children of that solely matched node.
It's usually good practice when importing CSV data to create all nodes first, and then have a separate CSV processing for matching upon the already created nodes and creating the relevant relationships.
However, if you do want to create everything in one goal, you'll likely want to use MERGE in various places, but this is also tricky if you don't fully understand the behavior of MERGE (it's like an attempt to MATCH, and if no match is found, a CREATE) or don't fully understand how Cypher is executed (as in this case).
You'll also want to MERGE according to unique node values instead of all properties, and SET the remaining properties. It's also especially helpful to have either unique constraints or indexes (whichever is appropriate) on the relevant label/properties for faster execution, especially as the size of your graph grows.
This query may work.
LOAD CSV WITH HEADERS FROM 'file:///Prova1.csv' AS line
MERGE (p:Vertex {name: line.parent})
MERGE (c1:Vertex {name: line.child_1})
SET c1.attr1 = line.child_1_attr1, c1.attr2 = line.child_1_attr2
MERGE (c2:Vertex {name: line.child_2})
SET c2.attr1 = line.child_2_attr1, c2.attr2 = line.child_2_attr2
Create (p)<-[:EDGE {attr1: line.edge_1_attr1, attr2: line.edge_1_attr2}]-(c1)
Create (p)<-[:EDGE {attr1: line.edge_2_attr1, attr2: line.edge_2_attr2}]-(c2)
The reason this one works is that by the time your very first MERGE is completed for all parent nodes, that will have created ALL parent nodes (or rather, nodes that will be parents) in your graph.
So when we reach the MERGE for your child nodes, that will MATCH most of those already created nodes in your graph...the only new nodes that will be created at that point will be leaf nodes, which would not have been created by your first MERGE, since they won't be parents to any other nodes, and do not show up in your parent column in your CSV.

For some reason the import query is not working because you match the parent first and then create your nodes and relationships. I modified the query this way and it is working now:
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
CREATE (c1:Vertex {name: line.child_1, attr1: line.child_1_attr1, attr2: line.child_1_attr2}),
(c2:Vertex {name: line.child_2, attr1: line.child_2_attr1, attr2: line.child_2_attr2}) WITH c1,c2, line
MATCH (p:Vertex {name:line.parent}) CREATE (p)<-[:EDGE {attr1: line.edge_1_attr1, attr2: line.edge_1_attr2}]-(c1),
(p)<-[:EDGE {attr1: line.edge_2_attr1, attr2: line.edge_2_attr2}]-(c2)
So if you first create the nodes and then match the parent and create the relationships the query is working. The result looks like this:
I will investigate on your query to find a reason why it is not working because I do not really understand why it is not working.

foreach (num in range(1,15) |
merge (parent:Node {number: num})
merge (left:Node {number: num + num})
merge (right:Node {number: num + num + 1})
merge (left)<-[:LEFT]-(parent)-[:RIGHT]->(right)
)
Explanation:
This creates a perfect binary tree structure with 31 nodes. Then you can include the same numbers in the CSV to find and add properties to each correspondingly numbered node.
In a binary tree if you include a number property on the first (root or top most node) with value 1, then increment each subsequent node's number value by 1 (left to right; top to bottom) then you get a convenient mathematical relationship where each node's left child has a number value of the parent's number + number and the right child is number + number + 1.

Related

Creating one to many relations in neo4j

so I'm very new to using a graph database, and I have chosen neo4j. I'm trying to make a simple recommending system based on the graph nodes.
So I have my original dataset that is a CSV that looks like this:
Since some of the fields have Semicolons, I separated them and parsed it to a new CSV. (Basically made every combination of fields)
New CSV looks like this:
Above image is just shown for N2, I have done the same thing for N1 and N3 aswell.
Now, I need to create nodes and relations in such a way that each
Name KNOWS Language
Name WORKED_WITH Database.
Hence, I ran the following query:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
CREATE (n:Name {name: row.Name})
CREATE (l: Language {language: row.Language})
CREATE (d: Database {database: row.Database})
CREATE (n)-[:KNOWS]->(l)
CREATE (n)-[:WORKED_WITH]->(d)
This is the following output I get:
Only shown for N2 nodes
Since I want to build a recommender, my idea was to link the name to language and database.
Expected output:
I want to link it in this way so I can count the total number of incoming nodes on a Language or Database to recommend it.
Can someone tell me where I'm going wrong?
When you use CREATE clause it creates new nodes each time.
If you want to use the existing node and create only if it doesn't exist then you need to use MERGE clause instead of CREATE.
Here is your query with MERGE:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
MERGE (n:Name {name: row.Name})
MERGE (l: Language {Language: row.Language})
MERGE (d: Database {database: row.Database})
MERGE (n)-[:KNOWS]->(l)
MERGE (n)-[:WORKED_WITH]->(d)

Create a node for each column only once while importing csv into Neo4j

I have a csv file that looks the following way:
I want to create a database from it in Neo4j. Rows are nodes with labels gene, columns are also nodes with labels cell. I need to write a CREATE query that would create all my gene and cell - nodes and a relationship one for each combination of gene and cell. Currently I am stuck with the following code:
LOAD CSV WITH HEADERS FROM 'file:///merged_full.csv' AS line
CREATE (:Gene {id: line.gene_ids, name: line.wikigene_name})
I need to somehow iterate over all columns - starting from index 3 - after creating gene nodes, but I do not know how to do that.
Here are 3 queries that, performed in order, should do what you want.
This query creates a temporary Headers node with a names property that contains the collection of headers from the CSV file. It uses LIMIT 1 to only process the first row of the file. It also creates all the Cell nodes, each with it own name property.
LOAD CSV FROM 'file:///merged_full.csv' AS line
MERGE (h:Headers)
SET h.names = line
WITH line
LIMIT 1
UNWIND line[3..] AS name
MERGE (c:Cell {name: name})
This query uses the APOC function apoc.map.fromNodes to generate a map named cells, which maps each cell name to its cell node. It also gets the Headers node. It then loads the non-header data from the CSV file (using SKIP 1 to skip over the header row), and processes each row as follows. It uses MERGE to get/create a Gene node, g, with the desired id and name. It uses the REDUCE function to generate a collection of the Cell nodes that have a "1" column value in the current row, and the FOREACH clause then creates a (g)-[:HAS]->(x) relationship (if necessary) for every cell, x, in that collection.
WITH apoc.map.fromNodes('Cell', 'name') AS cells
MATCH (h:Headers)
LOAD CSV FROM 'file:///merged_full.csv' AS line
WITH h, cells, line
SKIP 1
MERGE (g:Gene {id: line[1], name: line[2]})
FOREACH(
x IN REDUCE(s = [], i IN RANGE(3, SIZE(line)-1) |
CASE line[i] WHEN "1" THEN s + cells[h.names[i]] ELSE s END) |
MERGE (g)-[:HAS]->(x))
This query just deletes the temporary Headers node (if you wish):
MATCH (h:Headers)
DELETE h;
If the columns correspond with cell nodes, then you should know all the cell nodes you need just be looking at the CSV header.
I'd recommend writing a small query just to create each of the cell nodes you need, then create an index or unique constraint on :Cell(id) (or name, or whatever the property is that is meant to identify a :Cell).
At that point the problem becomes getting and processing each relevant column (I assume only the ones with 1 as the value). APOC Procedures may help here.
apoc.map.sortedProperties() can be used to take your line map and give you a list of key/value list pairs, which you can filter down to those where the key begins with 'V', and where the value is 1, then use what's remaining to match on the relevant :Cell node and create the relationship.

Reshape the dataset into more relational format (Transpose SOME rows and assign them to a data subset)

I have a spreadsheet/csv:
Code:,101,Course Description:,"Introduction to Rocket Science",
Student Name,Lecture Hours,Labs Hours,Test Score,Status
John Galt,48,120,4.7,Passed
James Taggart,50,120,4.9,Passed
...
I need to reshape it to the following view:
Code:,Course Description:,Students,Lecture Hours,Labs Hours,Average Test Score,Teaching Staff
101,"Introduction to Rocket Science",John Galt,48,120,4.7,Passed
101,"Introduction to Rocket Science",James Taggart,50,120,4.9,Passed
...
Beleive it or not, can not get the right idea how to do that despite it seems to be very primitive transformation, is there any silver bullet for this?
Original records (csv) have in a way json-like structure so my first approach was to represent the original data as a vector and then transpose it, (but in this case my resulting table looks like sparced matrix - rows I have transpored are blank in the rest of its values)
Another way Im considering - **serialize it into jsons and then de-serialize** into new spreadsheet (jsonize()) - in this case, Im having problems with merging them properly.
In both ways I have it "half-working";
Can anyone suggest simple and reliable algorithm for this;
Any language, RegEx, any tools, code snippets are very appreciated
Assuming that the pattern you've described here is consistent throughout, there are quite a few different approaches you could take I think, but in all cases you basically can use that fact that the 'Course' rows start with "Code:" but that's never going to be a student name.
You can take advantage of this either by a regular expression find/replace, or within OpenRefine.
Example:
Open file in a text editor that supports regular expressions in
find/replace
Search for lines starting with 'Code:' and add additional commas to the start of the row to shift the course data columns to the
right e.g. search for: ^Code: replace with: ,,,,,^Code:
If you now import the file into OpenRefine then you'll have a project with 10 columns (the 10th col is caused by the trailing
comma at the end of the course data row)
You can now use Transpose (or just rename) on the right-most columns which contain the course data, while leaving the left-most
columns which contain the student details
Isolate the rows that contain the phrase 'Student Name' in the first column and remove them (via a filter or facet)
Move the Course Code/Description columns to the beginning of the project, and use the 'Edit Cells->Fill Down' option on each column to get the values repeated on all the relevant lines
Finally rename the columns as you want, remove any extraneous columns

how to create a SSIS package which creates three text files, using same variables but the textfile is only created when the correct data is found?

There are only 3 files that can be created : "File_1", "File_2" and "File_3". The same variable name is used in each instance (User::FileDirectory) and (User::File_name), but because the actual value of the variable changes, a new file is created.However the files are only created if there is data to go into the file. i.e. if there are no records to populate the file, it will not be created at all. When the files are created, the date the file was created should also be added to the filename. eg: File1_22102011.txt
Ok if the above was a little confusing, the following is how it works,
All the files use the same variable, but it is reset before each file is created.
• So it populates a result set in memory with the first sql selection (ID number, First_Name and Main_Name). It sets the file variable to “File_1”. If there are records in the result set, it creates and writes to this filename.
• Then it creates a new result set with the second selection(Contract No). It sets the variable to "File_2". If there are records in this new result set, a new file will be created from the variable(which now has a new value)
• Finally a third result set is created (Contract_no, ExperianNo, Entity_ID_Number, First_Name, Main_Name), and the file variable is set to "File_3". Again if there are records in the result set, then this file will be created and written to.
I have worked on a few methods to achieve this but they all have failed, So little help will be greatly appreciated.
While what you have works, I think it'd be rather painful to maintain.
I would approach it as 3 sequence containers running in parallel. Each container would have a data flow and two file tasks hanging off it based on success of the parent and the value of row count variable. If the row count variable is 0, delete the file. If it's greater than 0, rename it to File_n
As you can see, I have a container for the first file. The data flow creates an output a.txt file. Based on the value of the variable #RowCount1, it will either delete the empty file or rename it to File_1.
Each data flow would look like a source query, a row count transformation and a file destination with a temporary name (a.txt, b.txt, c.txt). As a file is always created, even if it's empty, we will need to delete or rename it afterwards which will be accomplished based on the file operation tasks.
In my opinion, this approach will be cleaner as it will allow you to test and debug each item in a cleaner manner rather than dealing with an in-memory dataset.

How to export a flat file with different rows using SSIS?

I have tree tables, Customer, Invoice and InvoiceRow with the standard relations.
These I have to export in one fixed field length file with the first two characters of each row identifying the row type. The row types have different specifications.
I could probably do it with a nested loop in a script block, but this is my first ever SSIS package and that solution feels wrong.
edit:
The output has to have:
Customer
Invoice
Rows
Customer
Invoice
Rows
and so on
Your gut feeling on doing this using a Script Destination component is correct. Unfortunately, this scenario doesn't jive with SSIS well. I don't consider this a beginner package. If you must use SSIS then I'd start by inner joining all the data so there is one row for each InvoiceRow, containing the data needed from all three tables.
CustomerCols, InvoiceCols, RowCols
Then, in the script destination component you'll need to keep track of the customer and invoice values, as they change you'll need to write extra rows to the output.
See Creating a Destination with the Script Component for more information on script destination.
My experience shows that script destinations can have good performance.
I would avoid writing Script Destination, and use just Script Transform + Flat File Destination. This way, you concentrate on the logical output (strings of data), while allowing SSIS to do actual writing to the file (it might be a bit more efficient, plus you concentrate on your business, not on writing to files).
First, you'll need to get denormalized data. You can do joins and sorts in the DBMS, but if you don't want to put too much pressure on DBMS - just get sorted data out of it and merge it using two SSIS Merge Join transforms.
Then do the script: keep running values of current Customer and Invoice, output them when they change, output InvoiceRow on every input. Something like this:
if (this.CustomerID != InputBuffer.CustomerID) {
this.CustomerID = InputBuffer.CustomerID;
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "Customer: " + InputBuffer.CustomerID + " " + InputBuffer.CustomerName;
}
// repeat the same code for Invoice
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "InvoiceRow: " + InputBuffer.InvoiceRowPrice;
Finally, add a Flat File Destination with a single column (OutputColumn created by the script) to write this to the file.
Process your three tables so that the outputs are all appropriate for your output file (including the row type designator). You'll have to do this in three separate flow paths in your data flow, then bring the rows together in a Union All data flow element. From there, process them as needed to create your output file.