Creating one to many relations in neo4j - csv

so I'm very new to using a graph database, and I have chosen neo4j. I'm trying to make a simple recommending system based on the graph nodes.
So I have my original dataset that is a CSV that looks like this:
Since some of the fields have Semicolons, I separated them and parsed it to a new CSV. (Basically made every combination of fields)
New CSV looks like this:
Above image is just shown for N2, I have done the same thing for N1 and N3 aswell.
Now, I need to create nodes and relations in such a way that each
Name KNOWS Language
Name WORKED_WITH Database.
Hence, I ran the following query:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
CREATE (n:Name {name: row.Name})
CREATE (l: Language {language: row.Language})
CREATE (d: Database {database: row.Database})
CREATE (n)-[:KNOWS]->(l)
CREATE (n)-[:WORKED_WITH]->(d)
This is the following output I get:
Only shown for N2 nodes
Since I want to build a recommender, my idea was to link the name to language and database.
Expected output:
I want to link it in this way so I can count the total number of incoming nodes on a Language or Database to recommend it.
Can someone tell me where I'm going wrong?

When you use CREATE clause it creates new nodes each time.
If you want to use the existing node and create only if it doesn't exist then you need to use MERGE clause instead of CREATE.
Here is your query with MERGE:
LOAD CSV WITH HEADERS FROM "file:///data.csv" AS row
MERGE (n:Name {name: row.Name})
MERGE (l: Language {Language: row.Language})
MERGE (d: Database {database: row.Database})
MERGE (n)-[:KNOWS]->(l)
MERGE (n)-[:WORKED_WITH]->(d)

Related

Can we compare columns of multiple inputs file to derive a new column in SSIS

I am trying to create a derived column based on columns provided in different input file but unfortunately I keep getting error when I tried to map my Raw_File_1 with Derived Column.The error looks like this:
Cannot create a connector.
The destination component does not have any available inputs for use in creating a path.
My goal is to able to connect both Raw_File_1 and Map_File_1 into Derived Column and generate a new column.
If anyone can provide me any suggestion that would be great!!
I have source file and reference file both are flat file. My source file has column a, column b and column c and my reference file has column d, column e and column f.
If column a=column d and column b=column f then I want to populate column c as the same value as column f. How can I do this kind of analysis or lookup in SSIS
Based on your comments that I patched into the question, you're looking to augment the existing data based on matching data from your reference file.
The core of your SSIS package will look like this
In the first data flow, we will source from map_file_1 and load into a "raw" file.
I configure my raw file destination like this
When the package runs, it'll fill that special format file with the reference data. It's important, because you can either use a database or a raw file as your lookup source.
Finally, we get to work! A flat file source to a Lookup component. In the first tab of that lookup, be sure to change the Connection type from the default of "OLE DB connection manager" to "Cache connection manager"
In Connection tab, click to create a new CCM and use the raw file generated in the preceding step.
Map columns A to D and B to E (assuming data types match). Click the check box on column F and in the Lookup Operation part, Replace C with that value.
Final thoughts
This will be a case sensitive lookup. If things don't have a match in the reference file, it's going to blow up. That's probably not what you want so configure the Lookup transformation to not do that ;)
I blogged about using Excel to populate the cache if you want more words http://billfellows.blogspot.com/2011/11/using-excel-in-ssis-lookup.html
Your question is not clear, i will try to give some suggestions:
If you are looking to perform a lookup with a derived column:
You can use Cache Transform component and Cache connection manager to achieve that:
SSIS - How To Use Flat File Or Excel File In Lookup Transformation [Cache Transformation]
If you are looking to Merge both input:
Then you need to use Merge Join or Union All components:
SSIS Union All Transformation
Learn SSIS : MERGE, MERGE JOIN and UNION ALL
SSIS Basics: Using the Merge Join Transformation

How to Implement logging at the end of each job In talend?

I am new to Talend os.
However, I received a task:
Create file delimited .csv metadata (one for Lead & Opportunity).
Move files to your repository on the AWS server (the etl_process1 login).
Create two tables sfdc_leads_reporting_raw and sfdc_opp_reporting_raw.
Load the data from the files into the tables. Ensure the data types are correctly used when creating metadata schemas & tables.
Till step 4 I am done.
Now the problem is:
How to Implement logging at the end of each job to report the number of leads (count of distinct id in leads table) and number of opportunities created (count of opportunity id) by stages (how many converted, qualified, closed won, and dead)?
Help would be appreciated.
You can get this data using global variables, in a subjob at the end of your job. Most components provide a global variable called tComponent_NB_LINE (or _NB_LINE_INSERTED for database components) that gives you the number of lines output by the component.
For instance tFileOutputDelimited_1_NB_LINE or tOracleOutput_1_NB_LINE_INSERTED.
Using these variables you can log into console or file.
Here is a simple example. If you have a tOracleOutput_1 in your job you can do:
tPostJob -- OnComponentOk -- tFixedFlowInput -- Main -- tLogRow
Inside tFixedFlowInput you retrieve the variable
(Integer)globalMap.get("tOracleOutput_1_NB_LINE_INSERTED")`.
If you need to log aggregated info, you can append a tAggregateRow to your output components, and use tSetGlobalVar to get count by certain criteria.

Connect imported nodes (CSV LOAD) to a general group

I was trying to build a query that will solve this tasks:
Import a CSV with the format "user","group" to Neo4J
Generate for each USER a node - avoid duplicates
Generate for each GROUP a node - avoid duplicates
Connect the node USER to the imported GROUP
Finally connect every imported GROUP to a MAINGROUP
I have written the query like this:
LOAD CSV FROM "file:.....csv" AS csvLine
MERGE (u:User { name: csvLine[0]})
MERGE (g:Group { name: csvLine[1]})
MERGE (u)-[:IS_MEMBER_OF]->(g)
MERGE (g)-[:IS_MEMBER_OF]->(m:Group {name: "MAINGROUP"})
So far this works as I get every User and every Group and they are connected.
Problem: All my GROUPs do not have a relationship to a single Node (MAINGROUP) but each GROUP has a relationsship to a duplicate MAINGROUP - means for every GROUP my query seems to generate a duplicate new MAINGROUP (although I was hoping MERGE would prevent this) so I end up with as many nodes MAINGROUP as I have imported GROUPs.
How do I need to alter the query to get the desired graph?
This is a common gotcha of using MERGE. See the docs here.
When you use MERGE on a pattern, it creates everything if the entire pattern didn't already exist, not just the portions of the pattern that didn't already exist.
What you should be doing is using MERGE once to find/create (m:Group {name: "MAINGROUP"}) and then MERGE just the new relationship. Because MERGE is matching on the whole pattern (g)-[:IS_MEMBER_OF]->(m:Group {name: "MAINGROUP"}) and it doesn't exist, it's re-creating the main group every time.
So you might want to do this:
LOAD CSV FROM "file:.....csv" AS csvLine
MERGE (u:User { name: csvLine[0]})
MERGE (g:Group { name: csvLine[1]})
MERGE (u)-[:IS_MEMBER_OF]->(g)
MERGE (m:Group {name: "MAINGROUP"})
MERGE (g)-[:IS_MEMBER_OF]->(m)
The last two lines are different.
This way of getting tripped up with MERGE is unfortunately really common. :)

Import CSV column from different file into new file

I have 2 CSV files almost identical with the following differences:
The first has a column, "date".
The second doesn't have "date" and also has 50 rows less than the 1st ("email").
They are a list of subscribers with date created. The second, however, is the updated list with subscribers who wanted to be removed, but this no longer has the date created.
Is there any way to import column "date" from 1st CSV into the 2nd CSV by making a reference to the "email" column so I can get the correct date of that subscriber?
Sorry, there seems to be not a ready made (probably an evening's worth of effort) command line tool available.
You could look at different ways, one complex way is to load it in tables, to the merge (using a select and join on the two tables) and export it back as csv.
The simplest I could think of was to use R (given that you have header names, in your CSV?):
csv1_data <- read.csv('/path/to/csv1.csv')
csv2_data <- read.csv('/path/to/csv2.csv')
merged_csv <- merge(csv1_data, csv2_data)
write.table(merged_csv,file="/path/to/merged_csv.csv",sep=",",row.names=T)
The first 2 lines load the data in R, the 3 line merges them using the default S3 method, the final line exports the result as a csv file, with the headers.
Hope this helps!

How to export a flat file with different rows using SSIS?

I have tree tables, Customer, Invoice and InvoiceRow with the standard relations.
These I have to export in one fixed field length file with the first two characters of each row identifying the row type. The row types have different specifications.
I could probably do it with a nested loop in a script block, but this is my first ever SSIS package and that solution feels wrong.
edit:
The output has to have:
Customer
Invoice
Rows
Customer
Invoice
Rows
and so on
Your gut feeling on doing this using a Script Destination component is correct. Unfortunately, this scenario doesn't jive with SSIS well. I don't consider this a beginner package. If you must use SSIS then I'd start by inner joining all the data so there is one row for each InvoiceRow, containing the data needed from all three tables.
CustomerCols, InvoiceCols, RowCols
Then, in the script destination component you'll need to keep track of the customer and invoice values, as they change you'll need to write extra rows to the output.
See Creating a Destination with the Script Component for more information on script destination.
My experience shows that script destinations can have good performance.
I would avoid writing Script Destination, and use just Script Transform + Flat File Destination. This way, you concentrate on the logical output (strings of data), while allowing SSIS to do actual writing to the file (it might be a bit more efficient, plus you concentrate on your business, not on writing to files).
First, you'll need to get denormalized data. You can do joins and sorts in the DBMS, but if you don't want to put too much pressure on DBMS - just get sorted data out of it and merge it using two SSIS Merge Join transforms.
Then do the script: keep running values of current Customer and Invoice, output them when they change, output InvoiceRow on every input. Something like this:
if (this.CustomerID != InputBuffer.CustomerID) {
this.CustomerID = InputBuffer.CustomerID;
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "Customer: " + InputBuffer.CustomerID + " " + InputBuffer.CustomerName;
}
// repeat the same code for Invoice
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "InvoiceRow: " + InputBuffer.InvoiceRowPrice;
Finally, add a Flat File Destination with a single column (OutputColumn created by the script) to write this to the file.
Process your three tables so that the outputs are all appropriate for your output file (including the row type designator). You'll have to do this in three separate flow paths in your data flow, then bring the rows together in a Union All data flow element. From there, process them as needed to create your output file.