I am querying data from MSsql server and saving it into CSVs. With those CSVs,
I am modelling data into Neo4j. But Mssql database updates regularly. So,also wants to update Neo4j data on the regular basis. Neo4j has two types of nodes: 1.X and 2.Y . Below query and indexing, used to model the data:
CREATE INDEX ON :X(X_Number, X_Description,X_Type)
CREATE INDEX ON :Y(Y_Number, Y_Name)
using periodic commit
LOAD CSV WITH HEADERS FROM "file:///CR_n_Feature_new.csv" AS line
Merge(x:X{
X_Number : line.X_num,
X_Description: line.X_txt,
X_Type : line.X_Type,
})
Merge(y:Y{
Y_Number : line.Y_number,
Y_Name: line.Y_name,
})
Merge (y)-[:delivered_by]->(x)
Now there are two kinds of updates:
There may be new X and Y nodes which can be taken care by "Merge" command.
But there can be modifications in the properties of already created nodes X and Y....for e.g. if node X{X_Number : 1, X_Description : "abc", X_Type : "z"} but now in updated data X node properties got changed to X{X_Number : 1, X_Description : "def", X_Type : "y"}
So I don't want to create a new node for the X_Number:1 node but just want to update the existing node properties like X_Description and X_Type.
You could just re-write your query to support new nodes and changes to existing nodes by merging only on the X_Number or Y_Number attributes.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///CR_n_Feature_new.csv" AS line
MERGE (x:X {X_Number: line.X_num})
SET X_Description = line.X_txt, X_Type = line.X_Type
MERGE (y:Y {Y_Number: line.Y_number})
SET Y_Name=line.Y_name
MERGE (y)-[:delivered_by]->(x)
This way the MERGE statements will always match the existing X and Y nodes based on the X_Number and Y_Number attributes which presumably are immutable. Then the existing Description, X_Type and Y_Name attributes will be updated with the newer values.
Related
I have an ETL job that runs daily, uses bookmarks and writes the increment to some output s3 bucket. The output bucket is partitioned by one key.
Now, I want to have just one file by each partition. I can achieve that on the first run of the job as following:
datasource = datasource.repartition(1)
glueContext.write_dynamic_frame.from_options(
connection_type = "s3",
frame = datasource,
connection_options = {"path":output_path, "partitionKeys": ["a_key"]},
format = "glueparquet",format_options={"compression":"gzip"},
transformation_ctx = "write_dynamic_frame")
What I can't figure out is how to write and compress my increment with the files that are already in my output bucket/partition.
One option would be to read the table from the previous day and merge it with the increment, but it seems like an overkill.
Any smarter ideas?
I was running into the same issue, and discovered that the compression setting goes in the connection_options:
connection_options = {"path": file_path, "compression": "gzip", "partitionKeys": ["a_key"]}
I have some issues importing a large set of relationships (2M records) from a CSV file.
I'm running Neo4j 2.1.7 on Mac OSX (10.9.5), 16GB RAM.
The file has the following schema:
user_id, shop_id
1,230
1,458
1,783
2,942
2,123
etc.
As mentioned above - it contains about 2M records (relationships).
Here is the query I'm running using the browser UI (I was also trying to do the same with a REST call):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
This query takes ages to run, about 800 seconds. I do have indexes on :User(id) and :Shop(id). Created them with:
CREATE INDEX ON :User(id)
CREATE INDEX ON :Shop(id)
Any ideas on how to increase the performance?
Thanks
Remove the space before shop_id
try to run:
LOAD CSV WITH HEADERS FROM "file:test.csv" AS r
return r.user_id, r.shop_id limit 10;
to see if it is loaded correctly. On your original data r.shop_id is null as the column name is shop_id
Also make sure that you didn't store the id's as numeric values in the first place, then you have to use toInt(r.shop_id)
Try to profile your statement in Neo4j Browser (2.2.) or in Neo4j-Shell.
Remove the PERIODIC COMMIT for that purpose and limit the rows:
PROFILE
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
WITH relation LIMIT 10000
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
I want to move one table with self reference from PostgreSQL to Neo4j.
PostgreSQL:
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
Result:
$ cat /tmp/empbase.csv | head
e_id,e_name,e_bossid
1,emp_no_1,
2,emp_no_2,
3,emp_no_3,
4,emp_no_4,
5,emp_no_5,3
6,emp_no_6,2
7,emp_no_7,3
8,emp_no_8,1
9,emp_no_9,4
Size:
$ du -h /tmp/empbase.csv
631M /tmp/empbase.csv
I import data to neo4j with:
neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
> CREATE (:EmpBase:_EmpBase { neo_eb_id: toInt(row.e_id),
> neo_eb_bossID: toInt(row.e_bossid),
> neo_eb_name: row.e_name});
and this works fine:
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 20505764
Properties set: 61517288
Labels added: 41011528
846284 ms
The Neo4j console says:
Location:
/home/neo4j/data/graph.db
Size:
5.54 GiB
But then I want to proceed with the relation that each emp has a boss. So simple emp->bossid SELF reference.
Now I do it like this:
LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
MATCH (employee:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_id)})
MATCH (manager:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_bossid)})
MERGE (employee)-[:REPORTS_TO]->(manager);
But this works for 5-6 hours and breaks in the end with system failures it freezez the system.
I think this might be terribly wrong.
1. Am I doing sth wrong or is it bug for No4j?
2. Why out of 631 MB csv now I get 5,5 GB?
EDIT1:
$ du -h /home/neo4j/data/
20K /home/neo4j/data/graph.db/index
899M /home/neo4j/data/graph.db/schema/index/lucene/1
899M /home/neo4j/data/graph.db/schema/index/lucene
899M /home/neo4j/data/graph.db/schema/index
27M /home/neo4j/data/graph.db/schema/label/lucene
27M /home/neo4j/data/graph.db/schema/label
925M /home/neo4j/data/graph.db/schema
6,5G /home/neo4j/data/graph.db
6,5G /home/neo4j/data/
SOLUTION:
Wait until the :schema in console says ONLINe not POPULATING
change log size in config file
Add USING PERIODIC COMMIT 1000 in second csv import
Index only on label
Only match on one Label: MATCH (employee:EmpBase {neo_eb_id: toInt(row.e_id)})
Did you create the index: CREATE INDEX ON :EmpBase(neo_eb_id);
then wait for the index to get online :schema in browser
OR if it is a unique id: CREATE CONSTRAINT ON (e:EmpBase) assert e.neo_eb_id is unique;
Otherwise your match will scan all nodes in the database for each MATCH.
For your second question, I think it's the transaction log files,
you can limit their size in conf/neo4j.properties with
keep_logical_logs=100M size
The actual nodes and properties files shouldn't be that large. Also you don't have to store the boss-id in the database. That's actually handled by the relationship :)
I have 2 CSV files which I want to convert into a Neo4j database. They look like this:
first file:
name,enzyme
Aminomonas paucivorans,M1.Apa12260I
Aminomonas paucivorans,M2.Apa12260I
Bacillus cellulosilyticus,M1.BceNI
Bacillus cellulosilyticus,M2.BceNI
second file
name,motif
Aminomonas paucivorans,GGAGNNNNNGGC
Aminomonas paucivorans,GGAGNNNNNGGC
Bacillus cellulosilyticus,CCCNNNNNCTC
As you can see the common factor is the Name of the organism and the. Each Organism will have a few Enzymes and each Enzyme will have 1 Motif. Motifs can be same between enzymes . I used the following statement to create my database:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file1.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(e:Enzyme { name: csvLine.enzyme})
CREATE (o)-[:has_enzyme]->(e) //or maybe CREATE UNIQUE?
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file2.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})
CREATE (o)-[:has_motif]->(m) //or maybe CREATE UNIQUE?
This gives me errors on the very first line at USING PERIODIC COMMIT which says Invalid input 'S': expected. If I get rid of ti, the next error I get is WITH is required between CREATE and LOAD CSV (line 6, column 1)
"MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})" . I googled this issue which led me to this answer . I tried the answer given ther (refreshing the browser cache) but the problem persists. WHat am I doing wrong here? Is the query correct? Is there an another solution to this issue? Any help will be greatly appreciated
Your queries have two issues at once:
You can't refer to a local file just with "file1.csv", because neo4j is expecting a URL
You're using MATCH in cases where the data may not originally exist; you need to use MERGE there instead, which basically acts like the create unique comment you added.
I don't know what the source of your specific error message is, but as written it doesn't look like these queries could possibly work. Here are your queries reformulated, so that they will work (I tested it on my machine with your CSV samples)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/myuser/tmp/file1.csv" AS csvLine
MERGE (o:Organism { name: coalesce(csvLine.name, "No Name")})
MERGE (e:Enzyme { name: csvLine.enzyme})
MERGE (o)-[:has_enzyme]->(e);
Notice here 3 merge statements (MERGE basically does MATCH + CREATE if it doesn't already exist), and the fact that I've used a file: URL.
The second query gets formulated basically the same way:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/myuser/tmp/file2.csv" AS csvLine
MERGE (o:Organism { name: coalesce(csvLine.name, "No Name")})
MERGE (m:Motif { name: csvLine.motif})
MERGE (o)-[:has_motif]->(m);
EDIT I added coalesce in the Organism's name property. If you have null values for name in the CSV, then the query would otherwise fail. Coalesce guarantees that if csvLine.name is null, then you'll get back "No Name" instead.
Hi there i have some sql tables and i want to convert these in a "Drupal Node Format" but i don't know how to do it. Does someone knows at least which tables i have to write in order to have a full node with all the keys etc. ?
I will give an example :
I have theses Objects :
Anime
field animeID
field animeName
Producer
field producerID
field producerName
AnimeProducers
field animeID
field producerID
I have used the CCK module and i had created in my drupal a new Content Type Anime and a new Data Type Producer that exist in an Anime object.
How can i insert all the data from my simple mysql db into drupal ?
Sorry for the long post , i would like to give you the chance to understand my problem
Thx in advance for your time to read my post
You can use either the Feeds module to import flat CSV files, or there is a module called Migrate that seems promising (albiet pretty intense). Both work on Drupal 6 or 7.
mmmmm.... i think you can export CVS from your sql database and then use
http://drupal.org/project/node_import
to import this cvs data to nodes.....mmmm i don know if there is another non-programmatically way
The main tables for node property data are node and node_revision, have a look at the columns in those and it should be fairly obvious what needs to go in those.
As far as fields go, their storage is predictable so you would be able automate an import (although I don't envy you having to write that!). If your field is called 'field_anime' it's data will live in two tables: field_data_field_anime and field_revision_field_anime which are keyed by the entity ID (in this case node ID), entity type (in the case 'node' itself) and bundle (in this case the name of your node type). You should keep both tables up to date to ensure the revision system functions correctly.
The simplest way to do it though is with PHP and the node API functions:
/* This is for a single node, obviously you'd want to loop through your custom SQL data here */
$node = new stdClass;
$node->type = 'my_type';
$node->title = 'Title';
node_object_prepare($node);
// Fields
$node->field_anime[LANGUAGE_NONE] = array(0 => array('value' => $value_for_field));
$node->field_producer[LANGUAGE_NONE] = array(0 => array('value' => $value_for_field));
// And so on...
// Finally save the node
node_save($node);
If you use this method Drupal will handle a lot of the messy stuff for you (for example updating the taxonomy_index table automatically when adding a taxonomy term field to a node)