Load csv file and create nodes and reletionships - csv

I've a csv file like the sample below and I want to import that into Neo4j and create nodes and relationships.
"N_ID","Name","Relationship","A_ID","Address"
"N_01","John Doe","resident","A_01","1138 Mapleview Drive"
"N_02","Jane Doe","resident","A_01","1138 Mapleview Drive"
"N_03","Randall L Russo","visitor","A_02","866 Sweetwood Drive"
"N_04","Sam B Haley","resident","A_03","152 Point Street"
"N_01","John Doe","mailing address","A_04",3490 Horizon Circle"
'm able to create nodes using the code below but i don't know how to create the relationships based on the csv file.
using PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM
‘File://contacts.csv' AS line
CREATE (:Person {ID:line.N_ID, name:line.Name})
I tried this, but it doesn't work.
CREATE (:Person {N_ID:line.N_ID, Name:line.Name})-[:line.Relationship]-> (:Address {A_ID:line.A_ID, Address:line.Address})
Please bear with me I'm new to Neo4j.

Install the apoc plugin and try this query:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file://contacts.csv' AS line
MERGE (p1:Person {N_ID:line.N_ID})
ON CREATE SET p1.Name=line.Name
MERGE (a1:Address {A_ID:line.A_ID})
ON CREATE SET a1.Address=line.Address
WITH a1,p1,line
CALL apoc.merge.relationship(p1,line.Relationship,{},{},a1) YIELD rel
RETURN count(*);

Related

Read S3 CSV file and insert into RDS mysql using AWS Glue

I have a CSV file in S3 bucket which gets updated/refreshed with new data generated from ML model every week. I have created an ETL pipeline in AWS glue to read data(CSV file) from S3 bucket and load it into RDS(mysql server). I have connected my RDS via SSMS. I was able to load data successfully into RDS and validate currect row counts with 50000. When I run the job again, the whole table; ie same file contents in CSV file gets appended. Here is the sample code:
datasink5 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice4, database = "<dbname>", table_name = "<table schema name>", transformation_ctx = "datasink5")
Next week when I run my model there will be 1000 new rows in that CSV file. So when I run my ETL job in Glue, it should append 1000 new row values with previously loaded 5000 rows. Total row counts should reflect as 6000.
Can anyone tell me how to achieve this? Is there anyway we can truncate or drop table before inserting all new data? In that way we could avoid duplication.
Note: I will have to run "Crawler" to read data from S3 bucket every week to get new data with existing row values.
sample code generate using AWS glue.
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Any help would be appreciated.

ETL script in Python to load data from another server .csv file into mysql

I work as a Business Analyst and new to Python.
In one of my project, I want to extract data from .csv file and load that data into my MySQL DB (Staging).
Can anyone guide me with a sample code and frameworks I should use?
Simple program to create sqllite. You can read the CSV file and use dynamic_entry to insert into your desired target table.
import sqlite3
import time
import datetime
import random
conn = sqlite3.connect('test.db')
c = conn.cursor()
def create_table():
c.execute('create table if not exists stuffToPlot(unix REAL, datestamp TEXT, keyword TEXT, value REAL)')
def data_entry():
c.execute("INSERT INTO stuffToPlot VALUES(1452549219,'2016-01-11 13:53:39','Python',6)")
conn.commit()
c.close()
conn.close()
def dynamic_data_entry():
unix = time.time();
date = str(datetime.datetime.fromtimestamp(unix).strftime('%Y-%m-%d %H:%M:%S'))
keyword = 'python'
value = random.randrange(0,10)
c.execute("INSERT INTO stuffToPlot(unix,datestamp,keyword,value) values(?,?,?,?)",
(unix,date,keyword,value))
conn.commit()
def read_from_db():
c.execute('select * from stuffToPlot')
#data = c.fetchall()
#print(data)
for row in c.fetchall():
print(row)
read_from_db()
c.close()
conn.close()
You can iterate through the data in CSV and load into sqllite3. Please refer below link as well.
Quick easy way to migrate SQLite3 to MySQL?
If that's a properly formatted CSV file you can use the LOAD DATA INFILE MySQL command and you won't need any python. Then after it is loaded in the staging area (without processing) you can continue transforming it using sql/etl tool of choice.
https://dev.mysql.com/doc/refman/8.0/en/load-data.html
A problem with that is that you need to add all columns but still even if you have data you don't need you might prefer to load everything in the staging.

Shapefile to MDB with custom field structure [duplicate]

This question already has an answer here:
How to create table in mdb from dbf query
(1 answer)
Closed 6 years ago.
I have a Shapefile with 80.000 polygons that they are grouped by a specific field called "OTA".
I wanted to convert each Shapefile (it's attribute table) to mdb database (not Personal Geodatabase) with one table in it with the same name as the Shapefile and with a given field structure.
In the code I used I had to load on Python 2 new modules:
pypyodbc and adodbapi
The first module was used to create the mdb file for each shapefile and the second to create the table in the mdb and fill the table with the data from the attribute table of the shapefile.
The code I came up with is the following:
import pypyodbc
import adodbapi
Folder = ur'C:\TestPO' # Folder to save the mdbs
FD = Folder+ur'\27ALLPO.shp' # Shapefile
Map = u'PO' # Map type
N = u'27' # Prefecture
OTAList = sorted(set([row[0] for row in arcpy.da.SearchCursor(FD,('OTA'))]))
cnt = 0
for OTAvalue in OTAList:
cnt += 1
dbname = N+OTAvalue+Map
pypyodbc.win_create_mdb(Folder+'\\'+dbname+'.mdb')
conn_str = (r"Provider=Microsoft.Jet.OLEDB.4.0;Data Source="+Folder+"\\"+dbname+ur".mdb;")
conn = adodbapi.connect(conn_str)
crsr = conn.cursor()
SQL = "CREATE TABLE ["+dbname+"] ([FID] INT,[AREA] FLOAT,[PERIMETER] FLOAT,[KA_PO] VARCHAR(10),[NOMOS] VARCHAR(2),[OTA] VARCHAR(3),[KATHGORPO] VARCHAR(2),[KATHGORAL1] VARCHAR(2),[KATHGORAL2] VARCHAR(2),[LABEL_PO] VARCHAR(8),[PHOTO_45] VARCHAR(14),[PHOTO_60] VARCHAR(10),[PHOTO_PO] VARCHAR(8),[POLY_X_CO] DECIMAL(10,3),[POLY_Y_CO] DECIMAL(10,3),[PINAKOKXE] VARCHAR(11),[LANDTYPE] DECIMAL(2,0));"
crsr.execute(SQL)
conn.commit()
with arcpy.da.SearchCursor(FD,['FID','AREA','PERIMETER','KA_PO','NOMOS','OTA','KATHGORPO','KATHGORAL1','KATHGORAL2','LABEL_PO','PHOTO_45','PHOTO_60','PHOTO_PO','POLY_X_CO','POLY_Y_CO','PINAKOKXE','LANDTYPE'],'"OTA" = \'{}\''.format(OTAvalue)) as cur:
for row in cur:
crsr.execute("INSERT INTO "+dbname+" VALUES ("+str(row[0])+","+str(row[1])+","+str(row[2])+",'"+row[3]+"','"+row[4]+"','"+row[5]+"','"+row[6]+"','"+row[7]+"','"+row[8]+"','"+row[9]+"','"+row[10]+"','"+row[11]+"','"+row[12]+"',"+str(row[13])+","+str(row[14])+",'"+row[15]+"',"+str(row[16])+");")
conn.commit()
crsr.close()
conn.close()
print (u'«'+OTAvalue+u'» ('+str(cnt)+u'/'+str(len(OTAList))+u')')
Executing this code took about 5 minutes to complete the task for about 140 mdbs.
As you can see, I execute an "INSERT INTO" statement for each record of the shapefile.
Is this the correct way (and probably the fastest) or should I collect all the statements for each "OTA" and execute them all together?
I don't think anyone's going to write your code for you, but if you try some VBA yourself, and tell us what happened and what worked and what you're stuck on, you'll get a great response.
Saying that - to start with I don't see any reason to use VB6 when you can use VBA right inside your mdb file.
Use DIR command and possibly FileSystemObject to loop through all DBFs in a given folder, or use FileDialog object to select multiple files at one go
Then process each file with
DoCmd.TransferDatabase command
TransferType:=acImport, _
DatabaseType:="dBASE III", _
DatabaseName:="your-dbf-filepath", _
ObjectType:=acTable, _
Source:="Source", _
Destination:="your-newtbldbf"
Finally process each dbf import with a make table query
Look at results and see what might have to be changed based on field types before and after.
Then .... edit your post and let us know how it went
In theory you could do something like this by searching the directory the DBF files reside in, writing those filenames to a table, then loop through the table and, for each filename, scan the DBF for tables and their fieldnames/datatypes and create those tables in your MDB. You could also bring in all the data from the tables, all within a series of loops.
In theory, you could.
In practice, you can't. And you can't, because DBF and MDB support different data types that aren't compatible.
I suppose you could create a "crosswalk" table such that for each datatype in DBF there is a corresponding, hand-picked datatype in MDB and use that when you're creating the table, but it's probably going to either fail to import some of the data or import corrupted data. And that's assuming you can open a DBF for reading the same way you can open an MDB for reading. Can you run OpenDatabase on a DBF from inside Access? I don't even have the answer to that.
I wouldn't recommend that you do this. The reason that you're doing it is because you want to keep the structure as similar as possible when migrating from dBase/FoxBase to Access. However, the file structure is different between them.
As you are aware, each .DBF ("Database file") file is a table, and the folder or directory in which the .DBF files reside constitutes the "database". With Access, all the tables in one database are in one .MDB ("Microsoft Database") file.
If you try to put each .DBF file in a separate .MDB file, you will have no end of trouble getting the .MDB files to interact. Access treats different .MDB files as different databases, not different tables in the same database, and you will have to do strange things like link all the separate databases just to have basic relational functionality. (I tried this about 25 years ago with Paradox files, which are also a one-file-per-table structure. It didn't take me long to decide it was easier to get used to the one-file-per-database concept.) Do yourself a favor, and migrate all of the .DBF files in one folder into a single .MDB file.
As for what you ought to do with your code, I'd first suggest that you use ADO rather than DAO. But if you want to stick with DAO because you've been using it, then you need to have one connection to the dBase file and another to the Access database. As far as I can tell, you don't have the dBase connection. I've never tried what you're doing before, but I doubt you can use a SQL statement to select directly from a .dbf file in the way you're doing. (I could be wrong, though; Microsoft has come up with stranger things over the years.)
It would

Neo4j: Import data from CSV (PostgreSQL) do not commence

I want to move one table with self reference from PostgreSQL to Neo4j.
PostgreSQL:
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
Result:
$ cat /tmp/empbase.csv | head
e_id,e_name,e_bossid
1,emp_no_1,
2,emp_no_2,
3,emp_no_3,
4,emp_no_4,
5,emp_no_5,3
6,emp_no_6,2
7,emp_no_7,3
8,emp_no_8,1
9,emp_no_9,4
Size:
$ du -h /tmp/empbase.csv
631M /tmp/empbase.csv
I import data to neo4j with:
neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
> CREATE (:EmpBase:_EmpBase { neo_eb_id: toInt(row.e_id),
> neo_eb_bossID: toInt(row.e_bossid),
> neo_eb_name: row.e_name});
and this works fine:
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 20505764
Properties set: 61517288
Labels added: 41011528
846284 ms
The Neo4j console says:
Location:
/home/neo4j/data/graph.db
Size:
5.54 GiB
But then I want to proceed with the relation that each emp has a boss. So simple emp->bossid SELF reference.
Now I do it like this:
LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
MATCH (employee:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_id)})
MATCH (manager:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_bossid)})
MERGE (employee)-[:REPORTS_TO]->(manager);
But this works for 5-6 hours and breaks in the end with system failures it freezez the system.
I think this might be terribly wrong.
1. Am I doing sth wrong or is it bug for No4j?
2. Why out of 631 MB csv now I get 5,5 GB?
EDIT1:
$ du -h /home/neo4j/data/
20K /home/neo4j/data/graph.db/index
899M /home/neo4j/data/graph.db/schema/index/lucene/1
899M /home/neo4j/data/graph.db/schema/index/lucene
899M /home/neo4j/data/graph.db/schema/index
27M /home/neo4j/data/graph.db/schema/label/lucene
27M /home/neo4j/data/graph.db/schema/label
925M /home/neo4j/data/graph.db/schema
6,5G /home/neo4j/data/graph.db
6,5G /home/neo4j/data/
SOLUTION:
Wait until the :schema in console says ONLINe not POPULATING
change log size in config file
Add USING PERIODIC COMMIT 1000 in second csv import
Index only on label
Only match on one Label: MATCH (employee:EmpBase {neo_eb_id: toInt(row.e_id)})
Did you create the index: CREATE INDEX ON :EmpBase(neo_eb_id);
then wait for the index to get online :schema in browser
OR if it is a unique id: CREATE CONSTRAINT ON (e:EmpBase) assert e.neo_eb_id is unique;
Otherwise your match will scan all nodes in the database for each MATCH.
For your second question, I think it's the transaction log files,
you can limit their size in conf/neo4j.properties with
keep_logical_logs=100M size
The actual nodes and properties files shouldn't be that large. Also you don't have to store the boss-id in the database. That's actually handled by the relationship :)

Create Neo4j database using CSV files

I have 2 CSV files which I want to convert into a Neo4j database. They look like this:
first file:
name,enzyme
Aminomonas paucivorans,M1.Apa12260I
Aminomonas paucivorans,M2.Apa12260I
Bacillus cellulosilyticus,M1.BceNI
Bacillus cellulosilyticus,M2.BceNI
second file
name,motif
Aminomonas paucivorans,GGAGNNNNNGGC
Aminomonas paucivorans,GGAGNNNNNGGC
Bacillus cellulosilyticus,CCCNNNNNCTC
As you can see the common factor is the Name of the organism and the. Each Organism will have a few Enzymes and each Enzyme will have 1 Motif. Motifs can be same between enzymes . I used the following statement to create my database:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file1.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(e:Enzyme { name: csvLine.enzyme})
CREATE (o)-[:has_enzyme]->(e) //or maybe CREATE UNIQUE?
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file2.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})
CREATE (o)-[:has_motif]->(m) //or maybe CREATE UNIQUE?
This gives me errors on the very first line at USING PERIODIC COMMIT which says Invalid input 'S': expected. If I get rid of ti, the next error I get is WITH is required between CREATE and LOAD CSV (line 6, column 1)
"MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})" . I googled this issue which led me to this answer . I tried the answer given ther (refreshing the browser cache) but the problem persists. WHat am I doing wrong here? Is the query correct? Is there an another solution to this issue? Any help will be greatly appreciated
Your queries have two issues at once:
You can't refer to a local file just with "file1.csv", because neo4j is expecting a URL
You're using MATCH in cases where the data may not originally exist; you need to use MERGE there instead, which basically acts like the create unique comment you added.
I don't know what the source of your specific error message is, but as written it doesn't look like these queries could possibly work. Here are your queries reformulated, so that they will work (I tested it on my machine with your CSV samples)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/myuser/tmp/file1.csv" AS csvLine
MERGE (o:Organism { name: coalesce(csvLine.name, "No Name")})
MERGE (e:Enzyme { name: csvLine.enzyme})
MERGE (o)-[:has_enzyme]->(e);
Notice here 3 merge statements (MERGE basically does MATCH + CREATE if it doesn't already exist), and the fact that I've used a file: URL.
The second query gets formulated basically the same way:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/myuser/tmp/file2.csv" AS csvLine
MERGE (o:Organism { name: coalesce(csvLine.name, "No Name")})
MERGE (m:Motif { name: csvLine.motif})
MERGE (o)-[:has_motif]->(m);
EDIT I added coalesce in the Organism's name property. If you have null values for name in the CSV, then the query would otherwise fail. Coalesce guarantees that if csvLine.name is null, then you'll get back "No Name" instead.