Create a node for each column only once while importing csv into Neo4j - csv

I have a csv file that looks the following way:
I want to create a database from it in Neo4j. Rows are nodes with labels gene, columns are also nodes with labels cell. I need to write a CREATE query that would create all my gene and cell - nodes and a relationship one for each combination of gene and cell. Currently I am stuck with the following code:
LOAD CSV WITH HEADERS FROM 'file:///merged_full.csv' AS line
CREATE (:Gene {id: line.gene_ids, name: line.wikigene_name})
I need to somehow iterate over all columns - starting from index 3 - after creating gene nodes, but I do not know how to do that.

Here are 3 queries that, performed in order, should do what you want.
This query creates a temporary Headers node with a names property that contains the collection of headers from the CSV file. It uses LIMIT 1 to only process the first row of the file. It also creates all the Cell nodes, each with it own name property.
LOAD CSV FROM 'file:///merged_full.csv' AS line
MERGE (h:Headers)
SET h.names = line
WITH line
LIMIT 1
UNWIND line[3..] AS name
MERGE (c:Cell {name: name})
This query uses the APOC function apoc.map.fromNodes to generate a map named cells, which maps each cell name to its cell node. It also gets the Headers node. It then loads the non-header data from the CSV file (using SKIP 1 to skip over the header row), and processes each row as follows. It uses MERGE to get/create a Gene node, g, with the desired id and name. It uses the REDUCE function to generate a collection of the Cell nodes that have a "1" column value in the current row, and the FOREACH clause then creates a (g)-[:HAS]->(x) relationship (if necessary) for every cell, x, in that collection.
WITH apoc.map.fromNodes('Cell', 'name') AS cells
MATCH (h:Headers)
LOAD CSV FROM 'file:///merged_full.csv' AS line
WITH h, cells, line
SKIP 1
MERGE (g:Gene {id: line[1], name: line[2]})
FOREACH(
x IN REDUCE(s = [], i IN RANGE(3, SIZE(line)-1) |
CASE line[i] WHEN "1" THEN s + cells[h.names[i]] ELSE s END) |
MERGE (g)-[:HAS]->(x))
This query just deletes the temporary Headers node (if you wish):
MATCH (h:Headers)
DELETE h;

If the columns correspond with cell nodes, then you should know all the cell nodes you need just be looking at the CSV header.
I'd recommend writing a small query just to create each of the cell nodes you need, then create an index or unique constraint on :Cell(id) (or name, or whatever the property is that is meant to identify a :Cell).
At that point the problem becomes getting and processing each relevant column (I assume only the ones with 1 as the value). APOC Procedures may help here.
apoc.map.sortedProperties() can be used to take your line map and give you a list of key/value list pairs, which you can filter down to those where the key begins with 'V', and where the value is 1, then use what's remaining to match on the relevant :Cell node and create the relationship.

Related

Neo4j integrate relationship matrix from csv with millions of relationships fast

I am integrating a huge csv file into my neo4j database.
This csv is a matrix of relationships whereas the first row represents one type of nodes (cells ("model_id") and the first column represents the other type of node (genes ("gene_id"). I want to create a relationship for each cell so that the value in the corresponding cell is a property of the relationship in neo4j.
The csv has this structure:
gene_id, cell1, cell2, cell3
gene1, value, value, value
gene2, value, value, value
gene3, value, value, value
For the gene1 want and gene 2 I want to create a relationship (gene1:Gene{gene_id:"gene1")-[r:READCOUNT{value:value}]->(cell1:Cell{model_id:"cell1"})
In my case I already created the cell and gene nodes beforehand and creating the relationships takes very long (37000 genes and 1200 cells --> 1,200*3,700=44,400,000). Is there a way to boost this insertion?
My Query looks like this right now:
CALL apoc.periodic.iterate('
CALL apoc.load.csv("rna.csv",{sep:";"}) yield map as line
return line
','
WITH line
MATCH (c:CellLine)
WITH c,line
SKIP 1
MATCH (g:Gene ) WHERE g.gene_id=line["gene_id"]
UNWIND c as x
Create (x)-[e:READCOUNT{value:line[x.model_id]}]->(g)
', {batchSize:1, iterateList:true, parallel:true});

Dynamically merge two CSV files using Dataweave in Mule

I get CSV files of different length from different sources. The columns within the CSV are different with the only exception is each CSV file will always have an Id column which can be used to tie the records within different CSV files. At a time, two such CSV files needs to be processed. The process is to take the Id column from the first file and match the rows within the second CSV file and create a third file which contains contents from the first and second file. The id column can be repeated in the first file. Eg is given below. please note that the first file I might have 18 to 19 combination of different data columns so, I cannot hardcode the transformation within dataweave and there is a chance that a new file will be added every time as well. A dynamic approach is what I wanted to accomplish. So once written, the logic should work even if a new file is added. These files get pretty big as well.
The sample files are given below.
CSV1.csv
--------
id,col1,col2,col3,col4
1,dat1,data2,data3,data4
2,data5,data6,data6,data6
2,data9,data10,data11,data12
2,data13,data14,data15,data16
3,data17,data18,data19,data20
3,data21,data22,data23,data24
CSV2.csv
--------
id,obectId,resid,remarks
1,obj1,res1,rem1
2,obj2,res2,rem2
3,obj3,res3,rem3
Expected file output -CSV3.csv
---------------------
id,col1,col2,col3,col4,objectid,resid,remarks
1,dat1,data2,data3,data4,obj1,res1,rem1
2,data5,data6,data6,data6,obj2,res2,rem2
2,data9,data10,data11,data12,obj2,res2,rem2
2,data13,data14,data15,data16,obj2,res2,rem2
3,data17,data18,data19,data20,obj3,res3,rem3
3,data21,data22,data23,data24,obj3,res3,rem3
I was thinking to use pluck to get the column values for the first file. I idea was to get the columns in the transformation without hardcoding it. But I am getting some errors. After this I have the task of searching for the id and getting the value from the second file
{(
using(keys = payload pluck $$)
(
payload map
( (value, index) ->
{
(keys[index]) : value
}
)
)
)}
I am getting the following error when using pluck
Type mismatch for 'pluck' operator
found :array, :function
required :object, :function
I am thinking of using groupBy on id on the second file to facilitate better searching. But need suggestions on how to append the contents in one transformation to form the 3rd file.
Since you want to combine both CSVs without renaming the column names, you can try something like below
var file2Grouped=file2 groupBy ((item) -> item.id)
---
file1 map ((item) -> item ++ ((file2Grouped[item.id])[0] default {}) - 'id')
output
id,col1,col2,col3,col4,obectId,resid,remarks
1,dat1,data2,data3,data4,obj1,res1,rem1
2,data5,data6,data6,data6,obj2,res2,rem2
2,data9,data10,data11,data12,obj2,res2,rem2
2,data13,data14,data15,data16,obj2,res2,rem2
3,data17,data18,data19,data20,obj3,res3,rem3
3,data21,data22,data23,data24,obj3,res3,rem3
Working expression is as given below. The removing the id should happen before the default
var file2Grouped=file2 groupBy ((item) -> item.id)
---
file1 map ((item) -> item ++ ((file2Grouped[item.id])[0] - 'id' default {}))

Load from CSV to create a binary tree

I need to create a binary tree in Neo4j. I've started with creating two CSV, one for vertices and one for edges and then I launched two query to create the entire tree.
I thought that I could create the entire tree with only one query.
The CSV from where I start is this:
"parent","child_1","child_1_attr1","child_1_attr2","edge_1_attr1","edge_1_attr2","child_2","child_2_attr1","child_2_attr2","edge_2_attr1","edge_2_attr2"
"vertex_1","vertex_2","2","5","4","1","vertex_3","5","3","2","2"
"vertex_2","vertex_4","3","5","2","3","vertex_5","4","4","4","3"
"vertex_3","vertex_6","2","1","2","4","vertex_7","2","2","5","5"
"vertex_4","vertex_8","4","4","4","5","vertex_9","2","3","2","5"
"vertex_5","vertex_10","1","1","3","3","vertex_11","1","3","2","3"
"vertex_6","vertex_12","3","1","1","1","vertex_13","1","2","5","1"
"vertex_7","vertex_14","4","2","2","1","vertex_15","2","5","4","3"
Then I tried this query:
LOAD CSV WITH HEADERS FROM 'file:///Prova1.csv' AS line
Match (p:Vertex {name: line.parent})
Create (c1:Vertex {name: line.child_1, attr1: line.child_1_attr1, attr2: line.child_1_attr2})
Create (c2:Vertex {name: line.child_2, attr1: line.child_2_attr1, attr2: line.child_2_attr2})
Create (p)<-[:EDGE {attr1: line.edge_1_attr1, attr2: line.edge_1_attr2}]-(c1)
Create (p)<-[:EDGE {attr1: line.edge_2_attr1, attr2: line.edge_2_attr2}]-(c2)
Before this query I create manually the first Vertex, and the I run this query, but the only result that I get is the creation of Vertices 1,2 and 3.
It should match the parent ( that is always already created ), then create the two childs and then it should connect these two children to his father.
Who can help me?
Likely your view of execution is, for each line/row, execute all Cypher code, then repeat this for the next line/row until finished, and this is incorrect.
Instead, each individual Cypher operation will execute for all rows, then the next Cypher operation will execute for all rows, etc.
This means your MATCH operation:
Match (p:Vertex {name: line.parent})
is performed across all lines in your CSV at the same time, and only then will it proceed to the next operation (your CREATE, acting on all lines), and so on.
Since you stated that you manually created the first vertex, that vertex is the only one that will be matched upon, the MATCH will fail for all other lines in your CSV because the CREATE statements haven't executed yet so those nodes don't exist. This means only two vertexes will be created, the children of that solely matched node.
It's usually good practice when importing CSV data to create all nodes first, and then have a separate CSV processing for matching upon the already created nodes and creating the relevant relationships.
However, if you do want to create everything in one goal, you'll likely want to use MERGE in various places, but this is also tricky if you don't fully understand the behavior of MERGE (it's like an attempt to MATCH, and if no match is found, a CREATE) or don't fully understand how Cypher is executed (as in this case).
You'll also want to MERGE according to unique node values instead of all properties, and SET the remaining properties. It's also especially helpful to have either unique constraints or indexes (whichever is appropriate) on the relevant label/properties for faster execution, especially as the size of your graph grows.
This query may work.
LOAD CSV WITH HEADERS FROM 'file:///Prova1.csv' AS line
MERGE (p:Vertex {name: line.parent})
MERGE (c1:Vertex {name: line.child_1})
SET c1.attr1 = line.child_1_attr1, c1.attr2 = line.child_1_attr2
MERGE (c2:Vertex {name: line.child_2})
SET c2.attr1 = line.child_2_attr1, c2.attr2 = line.child_2_attr2
Create (p)<-[:EDGE {attr1: line.edge_1_attr1, attr2: line.edge_1_attr2}]-(c1)
Create (p)<-[:EDGE {attr1: line.edge_2_attr1, attr2: line.edge_2_attr2}]-(c2)
The reason this one works is that by the time your very first MERGE is completed for all parent nodes, that will have created ALL parent nodes (or rather, nodes that will be parents) in your graph.
So when we reach the MERGE for your child nodes, that will MATCH most of those already created nodes in your graph...the only new nodes that will be created at that point will be leaf nodes, which would not have been created by your first MERGE, since they won't be parents to any other nodes, and do not show up in your parent column in your CSV.
For some reason the import query is not working because you match the parent first and then create your nodes and relationships. I modified the query this way and it is working now:
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
CREATE (c1:Vertex {name: line.child_1, attr1: line.child_1_attr1, attr2: line.child_1_attr2}),
(c2:Vertex {name: line.child_2, attr1: line.child_2_attr1, attr2: line.child_2_attr2}) WITH c1,c2, line
MATCH (p:Vertex {name:line.parent}) CREATE (p)<-[:EDGE {attr1: line.edge_1_attr1, attr2: line.edge_1_attr2}]-(c1),
(p)<-[:EDGE {attr1: line.edge_2_attr1, attr2: line.edge_2_attr2}]-(c2)
So if you first create the nodes and then match the parent and create the relationships the query is working. The result looks like this:
I will investigate on your query to find a reason why it is not working because I do not really understand why it is not working.
foreach (num in range(1,15) |
merge (parent:Node {number: num})
merge (left:Node {number: num + num})
merge (right:Node {number: num + num + 1})
merge (left)<-[:LEFT]-(parent)-[:RIGHT]->(right)
)
Explanation:
This creates a perfect binary tree structure with 31 nodes. Then you can include the same numbers in the CSV to find and add properties to each correspondingly numbered node.
In a binary tree if you include a number property on the first (root or top most node) with value 1, then increment each subsequent node's number value by 1 (left to right; top to bottom) then you get a convenient mathematical relationship where each node's left child has a number value of the parent's number + number and the right child is number + number + 1.

SSIS - Process a flat file with varying data

I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.
The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.

SSIS: Flat File Source to SQL without Duplicate Rows

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.
E.g. Data:
Item Number | Item Name | Update Date
ITEM-01 | First Item | 1-Jan-2013
ITEM-01 | First Item | 5-Jan-2013
ITEM-24 | Another Item | 12-Mar-2012
ITEM-24 | Another Item | 13-Mar-2012
ITEM-24 | Another Item | 14-Mar-2012
Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.
I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.
So is there any other way?
There are a couple of approaches you can take to do this.
1. Aggregate Transformation
Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:
2. Script Component Transformation
Essentially, you could implement the logic you mentioned above:
if next item number = previous item number then do NOT import this
line
First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):
Select Transformation as the Script Component type
Add the Script Component after the Flat File Source in your Data Flow:
Double Click the Script Component to open the Script Transformation Editor.
Under Input Columns, select all columns:
Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None
Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):
Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ItemNumber.Equals(previousItemNumber))
{
Output0Buffer.AddRow();
Output0Buffer.ItemName = Row.ItemName;
Output0Buffer.ItemNumber = Row.ItemNumber;
Output0Buffer.UpdateDate = Row.UpdateDate;
}
previousItemNumber = Row.ItemNumber;
}
private string previousItemNumber = string.Empty;
If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.