Create Relationship from CSV Import Adding a Relationship Property - csv

I have created a set of nodes from a CSV import and labelled them as 'Argument'.
I have another CSV file which contains Connector_ID, Start_Object_ID, End_Object_ID which I want to:
Create the relationship (from start object to end object)
Add the value of the Connector_ID to the relationship created
At the moment I've only got as far as failing to create the relationships (valid syntax but does nothing) using:
LOAD CSV WITH HEADERS FROM "file:///Users/argument_has_part_argument.txt" AS row
MATCH (argument1:Argument {object_ID: row.Start_Object_ID})
MATCH (argument2:Argument {object_ID: row.End_Object_ID})
MERGE (argument1)-[:has_part]->(argument2);
but cannot yet see
why it fails to do anything
how to get it to create a relationship
and how to add the Connector_ID to the connector so created.
Any pointers?

from: http://neo4j.com/developer/guide-import-csv/#_csv_data_quality
Cypher
What Cypher sees, is what will be imported, so you can use that to your advantage. You can use LOAD CSV without creating graph structure and just output samples, counts or distributions. So it is also possible to detect incorrect header column counts, delimiters, quotes, escapes or spelling of header names.
// assert correct line count
LOAD CSV FROM "file-url" AS line
RETURN count(*);
// check first few raw lines
LOAD CSV FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;
// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;
For your last question:
LOAD CSV WITH HEADERS FROM "file:///Users/argument_has_part_argument.txt" AS row
MATCH (argument1:Argument {object_ID: row.Start_Object_ID})
MATCH (argument2:Argument {object_ID: row.End_Object_ID})
MERGE (argument1)-[r:has_part]->(argument2)
ON CREATE SET r.connector_ID = row.Connector_ID;

Related

Replacing multiple values in CSV

I have a directory full of CSVs. A script I use loads each CSV via a Loop and corrects commonly known errors in several columns prior to being imported into an SQL database. The corrections I want to apply are stored in a JSON file so that a user can freely add/remove any corrections on-the-fly without altering the main script.
My script works fine for 1 value correction, per column, per CSV. However I have noticed that 2 or more columns per CSV now contain additional errors, as well as more than one correction per column is now required.
Here is relevant code:
with open('lookup.json') as f:
translation_table = json.load(f)
for filename in gl.glob("(Compacted)_*.csv"):
df = pd.read_csv(filename, dtype=object)
#... Some other enrichment...
# Extract the file "key" with a regular expression (regex)
filekey = re.match(r"^\(Compacted\)_([A-Z0-9-]+_[0-9A-z]+)_[0-9]{8}_[0-9]{6}.csv$", filename).group(1)
# Use the translation tables to apply any error fixes
if filekey in translation_table["error_lookup"]:
tablename = translation_table["error_lookup"][filekey]
df[tablename[0]] = df[tablename[0]].replace({tablename[1]: tablename[2]})
else:
pass
And here is the lookup.json file:
}
"error_lookup": {
"T7000_08": ["MODCT", "C00", -5555],
"T7000_17": ["MODCT", "C00", -5555],
"T7000_20": ["CLLM5", "--", -5555],
"T700_13": ["CODE", "100T", -5555]
}
For example if a column (in a CSV that includes the key "T7000_20") has a new erroneous value of ";;" in column CLLM5, how can I ensure that values that contain "--" and ";;" are replaced with "-5555"? How do I account for another column in the same CSV too?
Can you change the JSON file? The example below would edit Column A (old1 → new 1 and old2 → new2) and would make similar changes to Column B:
{'error_lookup': {'T7000_20': {'colA': ['old1', 'new1', 'old2', 'new2'],
'colB': ['old3', 'new3', 'old4', 'new4']}}}
The JSON parsing gets more complex, in order to handle current use case and new requirements.

LOAD CSV: mulitple MERGE and Eager operator

I have a large CSV file that contains mulitple nodes per line. I would like to use LOAD CSV to MERGE the nodes and set some properties. However, I always get the "Eager operator" warning for this query:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///MRCONSO.RRF' AS line FIELDTERMINATOR '|'
MERGE (c:Concept {cui: line[0]})
ON CREATE SET c.language = line[1]
MERGE (l:LexicalForm {lui: line[3]})
ON CREATE SET l.status = line[2];
When I remove the ON CREATE part it works but I want to merge on a specific ID, not on the combination of the ID and the other properties.
Is it possible to rephrase this somehow to avoid the Eager operator? I would like to create 6 different nodes from a single line and the alternative would be to iterate the file 6 times.
You have an RRF file and load CSV requires a CSV line. When you reference a field in the csv you need to include the data type and your tag (line). You also need the periodic commit number of rows per iteration. Your file needs to be in the import folder of your database.
For example:
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///MRCONSO.RRF' AS line FIELDTERMINATOR '|'
MERGE (c:Concept {cui: toInteger(linecui),language: toString(line.language})

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

Neo4J Create Relationships in Cypher returns no changes, no rows

I have a CSV dataset through which I'm trying to build relationships between two node types(Comment and Person) that already exist in my database.
This is the database information -
This is the csv file of the current relationship comment_hasCreator_person that I'm trying to build -
The problem is - no matter which Cypher query I try, all of them returns the same thing - "no changes, no rows".
Here are the different variations of the query I've tried -
This is the first query -
// comment_hasCreator_person_0_0.csv
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://dl.dropbox.com/s/qb4occggixmaz9g/comment_hasCreator_person_0_0.csv" AS line
MATCH (comment:Comment { id: toInt(line.Comment.id)}),(person:Person { id: toInt(line.Person.id)})
CREATE (comment)-[:hasCreator]->(person)
I assumed this might have not worked because my CSV headers were initially named Comment.id and Person.id. So I removed the . and tried out the query, with the same result -
// comment_hasCreator_person_0_0.csv
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://dl.dropbox.com/s/qb4occggixmaz9g/comment_hasCreator_person_0_0.csv" AS line
MATCH (comment:Comment { id: toInt(line.Commentid)}),(person:Person { id: toInt(line.Personid)})
CREATE (comment)-[:hasCreator]->(person)
When that didn't work, I followed this answer and tried using MERGE instead of CREATE, even though it shouldn't make a difference because the relationships didn't exist in the first place -
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://www.dropbox.com/s/qb4occggixmaz9g/comment_hasCreator_person_0_0.csv?dl=0" AS line
MATCH (comment:Comment { id: toInt(line.Commentid)}),(person:Person { id: toInt(line.Personid)})
MERGE (comment)-[r:hasCreator]->(person)
RETURN comment,r, person
This query just returned "no rows".
I also tried a variation of the query where I didn't use the toInt() function, but that didn't make any difference.
To ensure the nodes exist, I selected random cell values from the CSV file and used a MATCH clause to ensure the corresponding Comment and Person nodes exist in the database, and I did find all the nodes.
As the last step, I decided to create a relationship manually between the first row values from the CSV file -
MATCH (c:Comment{id:1236950581249}), (p:Person{id:10995116284808})
CREATE (c)-[r:hasCreator]->(p)
RETURN c,r,p
and this worked just fine -
I'm totally clueless as to why the relationships won't get created when I import it from the CSV file. I would appreciate any help.
You have a problem in yout CSV file. The field terminator character used in it is "|" and not the default ",". You can edit your CSV file and chenge the field terminator character to "," or use the option FIELDTERMINATOR available in the LOAD CSV.
Try editing your query to something like this:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "https://www.dropbox.com/s/qb4occggixmaz9g/comment_hasCreator_person_0_0.csv?dl=0" AS line
FIELDTERMINATOR '|'
MATCH (comment:Comment { id: toInt(line.Commentid)}),(person:Person { id: toInt(line.Personid)})
MERGE (comment)-[r:hasCreator]->(person)
RETURN comment,r, person
You are missing the field terminator here as it is | in your case, instead of ;.
You can try this out:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "filename" AS LINE FIELDTERMINATOR '|'
MERGE (comment:Comment { id: toInt(LINE.Commentid)})
MERGE (person:Person { id: toInt(line.Personid)})
MERGE (comment) - [r:has_creator] -> (person)
RETURN comment,r,person
Another reason for this kind of error may be white spaces in CSV file. If line in CSV looks like:
2a9b40bc-78f0-4e79-9b2b-441108883448, Pink node - 2, 2, pink
then index 1 for results will be: ' Pink node - 2' (notice space at beginning), not: 'Pink node - 2'. Editing csv files or using trim() function would be the solution here:
...
WHERE a.id = trim(line[0]) AND b.id = trim(line[1])
...

CSV LOAD and updating existing nodes / creating new ones

I might be on the wrong track so I could use some helpful input. I receive data from other systems by CSV files which I can import into my DB with CSV LOAD. So far so good.
I stucked when I need to reload the CSV again to follow up updates. I cannot delet the former data as I might have additional user input already attached so I would need a query that imports the CSV data, makes a match and when it finds the node it will just use SET to override the existing properties. Saying that I am unsure how to catch the cases where there is no node in the DB (new record) and we need to create a node.
LOAD CSV FROM "file:xxx.csv" AS csvLine
MATCH (c:Customer {code:"ABC"})
SET c.name = name: csvLine[0]
***OPTIONAL MATCH // Here I am unsure how to express when the node is not found***
MERGE (c:Customer { name: csvLine[0], code: csvLine[1]})
So ideally Cypher would check if the node is there and make an UPDATE by SET the new property coming with the CSV or - if the node cannot be found - creates a new one with the CSV data.
And - as a sidenote: How would I find nodes that are not in the CSV file but in the DB in order to mark them as obsolete? (This might not be able in the import but maybe someone has an idea how to solve this in order to keep the DB clean of deleted records - which can only be detected by a comparison with the latest CSV import - happy for every idea).
Any idea or hint how to write the query for updaten the graph while importing?
You need to use MERGEs ON MATCH and/or ON CREATE handlers, see http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match. I assume the customer code in the second column is the identifier - so the name in column one might change on updates:
LOAD CSV FROM "file:xxx.csv" AS csvLine
MERGE (c:Customer {code:csvLine[1]})
ON CREATE SET c.name = csvLine[0]
ON MATCH SET c.name = csvLine[0]