I might be on the wrong track so I could use some helpful input. I receive data from other systems by CSV files which I can import into my DB with CSV LOAD. So far so good.
I stucked when I need to reload the CSV again to follow up updates. I cannot delet the former data as I might have additional user input already attached so I would need a query that imports the CSV data, makes a match and when it finds the node it will just use SET to override the existing properties. Saying that I am unsure how to catch the cases where there is no node in the DB (new record) and we need to create a node.
LOAD CSV FROM "file:xxx.csv" AS csvLine
MATCH (c:Customer {code:"ABC"})
SET c.name = name: csvLine[0]
***OPTIONAL MATCH // Here I am unsure how to express when the node is not found***
MERGE (c:Customer { name: csvLine[0], code: csvLine[1]})
So ideally Cypher would check if the node is there and make an UPDATE by SET the new property coming with the CSV or - if the node cannot be found - creates a new one with the CSV data.
And - as a sidenote: How would I find nodes that are not in the CSV file but in the DB in order to mark them as obsolete? (This might not be able in the import but maybe someone has an idea how to solve this in order to keep the DB clean of deleted records - which can only be detected by a comparison with the latest CSV import - happy for every idea).
Any idea or hint how to write the query for updaten the graph while importing?
You need to use MERGEs ON MATCH and/or ON CREATE handlers, see http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match. I assume the customer code in the second column is the identifier - so the name in column one might change on updates:
LOAD CSV FROM "file:xxx.csv" AS csvLine
MERGE (c:Customer {code:csvLine[1]})
ON CREATE SET c.name = csvLine[0]
ON MATCH SET c.name = csvLine[0]
Related
I started to pull GLUE table via using pyathena since last week. However, one annoying thing I noticed that is if I wrote my code as shown below, sometimes it works and returns a pandas dataframe but other times, this piece of codes will create a csv and a csv metadata in the folder where physical data (parquet) are stored in S3 and registered in GLUE.
I know that if you use pandas cursor, it may end up with these two files but I just wonder if I can access data without these two files since every time these two files generated in S3, my read in process failed.
Thank you!
import os
access_key_id = os.getenv('AWS_ACCESS_KEY_ID')
secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY')
connect1 = connect(s3_staging_dir='s3://xxxxxxxxxxxxx')
df = pd.read_sql("select * from abc.table_name", connect1)
df.head()
go to Athena
click settings -> workgroup name -> edit workgroup
Update "Query result location"
click "Override client-side settings"
Note: If you have not setup any other workgroups for your Athena environment, you should only find one workgroup named "Primary".
This should resolve your problem. For more information you can read:
https://docs.aws.amazon.com/athena/latest/ug/querying.html
I am using following CSV load Cypher statement to import csv file with about 3.5m records. But it only imports about 3.2m. So about 300000 records are not imported.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM ("file:///path/to/csvfile.csv") as line
CREATE (ticket:Ticket {id: line.transaction_hash, from_stop: toInt(line.from_stop), to_stop: toInt(line.to_stop), ride_id: toInt(line.ride_id), price: toFloat(line.price)})
MATCH (from_stop:Stop)-[r:RELATES]->(to_stop:Stop) WHERE toInt(line.route_id) in r.routes
CREATE (from_stop)-[:CONNECTS {ticket_id: ID(ticket)}]->(to_stop)
Note that Stop nodes are already created in separate import statement.
When I only created Nodes without creating relationships it was able to import all data. This same import statement works fine with smaller set of same format csv data.
I tried twice just to make sure it wasn't terminated accidentally.
Is there node to relationship limit in Neo4J? Or what could be other reason?
Neo4J version: 3.0.3 size of database directory is 5.31 GiB.
This is probably because whenever the MATCH does not succeed for a line, the entire query for that line (including the first CREATE) also fails.
On the other hand, the failure of an OPTIONAL MATCH would not abort the entire query for a line. Try this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM ("file:///path/to/csvfile.csv") as line
CREATE (ticket:Ticket {id: line.transaction_hash, from_stop: toInt(line.from_stop), to_stop: toInt(line.to_stop), ride_id: toInt(line.ride_id), price: toFloat(line.price)})
OPTIONAL MATCH (from:Stop)-[r:RELATES]->(to:Stop)
WHERE toInt(line.route_id) in r.routes
FOREACH(x IN CASE WHEN from IS NULL THEN NULL ELSE [1] END |
CREATE (from)-[:CONNECTS {ticket_id: ID(ticket)}]->(to)
);
The FOREACH clause uses a somewhat roundabout technique to only CREATE the relationship if the OPTIONAL MATCH succeeded for a line.
I have created a set of nodes from a CSV import and labelled them as 'Argument'.
I have another CSV file which contains Connector_ID, Start_Object_ID, End_Object_ID which I want to:
Create the relationship (from start object to end object)
Add the value of the Connector_ID to the relationship created
At the moment I've only got as far as failing to create the relationships (valid syntax but does nothing) using:
LOAD CSV WITH HEADERS FROM "file:///Users/argument_has_part_argument.txt" AS row
MATCH (argument1:Argument {object_ID: row.Start_Object_ID})
MATCH (argument2:Argument {object_ID: row.End_Object_ID})
MERGE (argument1)-[:has_part]->(argument2);
but cannot yet see
why it fails to do anything
how to get it to create a relationship
and how to add the Connector_ID to the connector so created.
Any pointers?
from: http://neo4j.com/developer/guide-import-csv/#_csv_data_quality
Cypher
What Cypher sees, is what will be imported, so you can use that to your advantage. You can use LOAD CSV without creating graph structure and just output samples, counts or distributions. So it is also possible to detect incorrect header column counts, delimiters, quotes, escapes or spelling of header names.
// assert correct line count
LOAD CSV FROM "file-url" AS line
RETURN count(*);
// check first few raw lines
LOAD CSV FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;
// check first 5 line-sample with header-mapping
LOAD CSV WITH HEADERS FROM "file-url" AS line WITH line
RETURN line
LIMIT 5;
For your last question:
LOAD CSV WITH HEADERS FROM "file:///Users/argument_has_part_argument.txt" AS row
MATCH (argument1:Argument {object_ID: row.Start_Object_ID})
MATCH (argument2:Argument {object_ID: row.End_Object_ID})
MERGE (argument1)-[r:has_part]->(argument2)
ON CREATE SET r.connector_ID = row.Connector_ID;
I have used a command like this to successfully create named nodes from csv:
load csv with headers from "file:/Users/lwyglend/Developer/flavourGroups.csv" as
flavourGroup
create (fg {name: flavourGroup.flavourGroup})
set fg:flavourGroup
return fg
However I am not having any luck using load from csv to create relationships with a similar command:
load csv with headers from "file:/Users/lwyglend/Developer/flavoursByGroup.csv" as
relationship
match (flavour {name: relationship.flavour}),
(flavourGroup {name: relationship.flavourGroup})
create flavour-[:BELONGS_TO]->flavourGroup
From a headed csv file that looks a bit like this:
flavour, flavourGroup
fish, marine
caviar, marine
There are no errors, the command seems to execute, but no relationships are actually created.
If I do a simple match on name: fish and name: marine and then construct the belongs to relationship between the fish and marine pre-existing nodes with cypher, the relationship sets up fine.
Is there a problem with importing from csv? Is my code wrong somehow? I have played around with a few different things but as a total newb to neo4j would appreciate any advice you have.
Wiggle,
I don't know for sure if this is your problem, but I discovered that if you have spaces after your commas in your CSV file (as you show in your example), they appear to be included as part of the field names and field contents. When I made a CSV file like the one you showed and tried to load it, I found that it failed. When I took out the spaces, I found that it succeeded.
As a test, try this query:
LOAD FROM CSV WITH HEADERS FROM "file:/Users/lwyglend/Developer/flavoursByGroup.csv" AS line
RETURN line.flavourGroup
then try this query:
LOAD FROM CSV WITH HEADERS FROM "file:/Users/lwyglend/Developer/flavoursByGroup.csv" AS line
RETURN line.` flavourGroup`
Grace and peace,
Jim
I'm a bit late in answering your question, but I don't think the spaces alone are the culprit. In your example cypher there is no association to the actual nodes in your database, only the csv alias named "relationship".
Try something along this line instead:
load csv with headers from "file:/Users/lwyglend/Developer/flavoursByGroup.csv" as
relationship
match (f:flavour), (fg:flavourGroup)
where f.name = relationship.flavour and
fg.name = relationship.flavourGroup
create (f)-[:BELONGS_TO]->(fg)
I'm importing a csv file of contacts and where one parent has many children it leaves the duplicated values blank. I need to make sure that they are populated when they reach the database however.
Is there a way that I can implement the following when I'm importing a .csv file into Perl and then exporting into MySQL?
if (value is null)
value = value above.
Thanks!
Why don't you place the individual values you read from the CSV file into an array (e.g. #FIELD_DATA). Then when you encounter an empty field while iterating over a row (e.g. for column 4) you can write
unless (length($CSV_FIELD[4])) {
$CSV_FIELD[4] = $FIELD_DATA[4]
}
Not with an import statement afaik. You could, however, make use of triggers (http://dev.mysql.com/doc/refman/5.0/en/triggers.html). Keep in mind though, that this will seriously impact the performance of the import statement.
Also: if they are duplicate values you should have a critical look at your database model or your setup overall.