I have some issues importing a large set of relationships (2M records) from a CSV file.
I'm running Neo4j 2.1.7 on Mac OSX (10.9.5), 16GB RAM.
The file has the following schema:
user_id, shop_id
1,230
1,458
1,783
2,942
2,123
etc.
As mentioned above - it contains about 2M records (relationships).
Here is the query I'm running using the browser UI (I was also trying to do the same with a REST call):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
This query takes ages to run, about 800 seconds. I do have indexes on :User(id) and :Shop(id). Created them with:
CREATE INDEX ON :User(id)
CREATE INDEX ON :Shop(id)
Any ideas on how to increase the performance?
Thanks
Remove the space before shop_id
try to run:
LOAD CSV WITH HEADERS FROM "file:test.csv" AS r
return r.user_id, r.shop_id limit 10;
to see if it is loaded correctly. On your original data r.shop_id is null as the column name is shop_id
Also make sure that you didn't store the id's as numeric values in the first place, then you have to use toInt(r.shop_id)
Try to profile your statement in Neo4j Browser (2.2.) or in Neo4j-Shell.
Remove the PERIODIC COMMIT for that purpose and limit the rows:
PROFILE
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
WITH relation LIMIT 10000
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
Related
I have a data frame made up of 3 columns named INTERNAL_ID, NT_CLONOTYPE and SAMPLE_ID. I need to write a script in R that will transfer this data into the appropriate 3 columns with the exact names in a MySQL table. However, the table has more than 3 columns, say 5 (INTERNAL_ID, COUNT, NT_CLONOTYPE, AA_CLONOTYPE, and SAMPLE_ID). The MySQL table already exists and may or may not include preexisting rows of data.
I'm using the dbx and RMariaDB libraries in R. I've been able to connect to the MySQL database with dbxConnect(). When I try to run dbxUpsert()
-----
conx <- dbxConnect(adapter = "mysql", dbname = "TCR_DB", host = "127.0.0.1", user = "xxxxx", password = "xxxxxxx")
table <- "TCR"
records <- newdf #dataframe previously created with the update data.
dbxUpsert(conx, table, records, where_cols = c("INTERNAL_ID"))
dbxDisconnect(conx)
I expect to obtain an updated mysql table with the new rows, which may or may not have null entries in the columns not contained in the data frame.
Ex.
INTERNAL_ID COUNT NT_CLONOTYPE AA_CLONOTYPE SAMPLE_ID
Pxxxxxx.01 CTTGGAACTG PMA.01
The connection and disconnection all run fin, but instead of the output I obtain the following error:
Error in .local(conn, statement, ...) :
could not run statement: Field 'COUNT' doesn't have a default value
I'm suspecting it's because the number of columns in each file are not the same, but I'm not sure. And if such, how can I get around this.
I figured it out. I changed the table entry for "COUNT" to default to NULL. This allowed for the program to proceed by ignoring "COUNT".
I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns.
This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID})
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);
I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
I get following error with CALL apoc.periodic.iterate(""):
Neo.ClientError.Statement.SyntaxError: Invalid input 'f': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or ')' (line 2, column 29 (offset: 57))
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.
Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.
MOST IMPORTANT: Create Index/Constraint on the key property.
CREATE CONSTRAINT ON (t:Transaction) ASSERT t.id IS UNIQUE;
Don't set the max heap size to full of system RAM. Set it to 50%.
Try ON CREATE SET instead of SET.
You can also use apoc.periodic.iterate to load the data, but USING PERIODIC COMMIT is also fine.
Importantly, if you are 'USING PERIODIC COMMIT' and the query is not finishing or running out of memory, it is likely because of using Distinct. Avoid Distinct as duplicate transactions will be handled by MERGE.
NOTE: (If you use apoc.periodic.iterate to MERGE nodes/relationships with parameter parallel=true then it fails with NULL POINTER EXCEPTION. use it carefully)
Questioner edit: Removing Distinct in 3rd line for Transaction node and re-running the query worked!
I am doing something like this:
data = Model.where('something="something"')
random_data = data.rand(100..200)
returns:
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Once I get this random data, I need to iterate through that data, like this:
random_data.each do |rd|
...
I know there's a way to fetch random data in MySQL, but I need to pick the random data like 400 times, so I think to load data once from database and 400 times to pick random number is more efficient than to run the query 400 times on MySQL.
But - how to get rid of that error?
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Thank you in advance
I would add the following scope to the model (depends on the database you are using):
# to model/model.rb
# 'RANDOM' works with postgresql and sqlite, whereas mysql uses 'RAND'
scope :random, -> { order('RAND()') }
Then the following query would load a random number (in the range of 200-400) of objects in one query:
Model.random.limit(rand(200...400))
If you really want to do that in Rails and not in the database, then load all records and use sample:
Model.all.sample(rand(200..400))
But that to be slower (depending on the number of entries in the database), because Rails would load all records from the database and instantiate them what might take loads of memory.
It really depends how much effort you want to put into optimizing this, because there's more than one solution. Here's 2 options..
Something simple is to use ORDER BY RAND() LIMIT 400 to randomly select 400 items.
Alternatively, just select everything under the moon and then use Ruby to randomly pick 400 out of the total result set, ex:
data = Model.where(something: 'something').all # all is necessary to exec query
400.times do
data.sample # returns a random model
end
I wouldn't recommend the second method, but it should work.
Another way, which is not DB specific is :
def self.random_record
self.where('something = ? and id = ?', "something", rand(self.count))
end
The only things here is - 2 queries are being performed. self.count is doing one query - SELECT COUNT(*) FROM models and the other is your actual query to get a random record.
Well, now suppose you want n random records. Then write it like :
def self.random_records n
records = self.count
rand_ids = Array.new(n) { rand(records) }
self.where('something = ? and id IN (?)',
"something", rand_ids )
end
Use data.sample(rand(100..200))
for more info why rand is not working, read here https://rails.lighthouseapp.com/projects/8994-ruby-on-rails/tickets/4555
I want to move one table with self reference from PostgreSQL to Neo4j.
PostgreSQL:
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
Result:
$ cat /tmp/empbase.csv | head
e_id,e_name,e_bossid
1,emp_no_1,
2,emp_no_2,
3,emp_no_3,
4,emp_no_4,
5,emp_no_5,3
6,emp_no_6,2
7,emp_no_7,3
8,emp_no_8,1
9,emp_no_9,4
Size:
$ du -h /tmp/empbase.csv
631M /tmp/empbase.csv
I import data to neo4j with:
neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
> CREATE (:EmpBase:_EmpBase { neo_eb_id: toInt(row.e_id),
> neo_eb_bossID: toInt(row.e_bossid),
> neo_eb_name: row.e_name});
and this works fine:
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 20505764
Properties set: 61517288
Labels added: 41011528
846284 ms
The Neo4j console says:
Location:
/home/neo4j/data/graph.db
Size:
5.54 GiB
But then I want to proceed with the relation that each emp has a boss. So simple emp->bossid SELF reference.
Now I do it like this:
LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
MATCH (employee:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_id)})
MATCH (manager:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_bossid)})
MERGE (employee)-[:REPORTS_TO]->(manager);
But this works for 5-6 hours and breaks in the end with system failures it freezez the system.
I think this might be terribly wrong.
1. Am I doing sth wrong or is it bug for No4j?
2. Why out of 631 MB csv now I get 5,5 GB?
EDIT1:
$ du -h /home/neo4j/data/
20K /home/neo4j/data/graph.db/index
899M /home/neo4j/data/graph.db/schema/index/lucene/1
899M /home/neo4j/data/graph.db/schema/index/lucene
899M /home/neo4j/data/graph.db/schema/index
27M /home/neo4j/data/graph.db/schema/label/lucene
27M /home/neo4j/data/graph.db/schema/label
925M /home/neo4j/data/graph.db/schema
6,5G /home/neo4j/data/graph.db
6,5G /home/neo4j/data/
SOLUTION:
Wait until the :schema in console says ONLINe not POPULATING
change log size in config file
Add USING PERIODIC COMMIT 1000 in second csv import
Index only on label
Only match on one Label: MATCH (employee:EmpBase {neo_eb_id: toInt(row.e_id)})
Did you create the index: CREATE INDEX ON :EmpBase(neo_eb_id);
then wait for the index to get online :schema in browser
OR if it is a unique id: CREATE CONSTRAINT ON (e:EmpBase) assert e.neo_eb_id is unique;
Otherwise your match will scan all nodes in the database for each MATCH.
For your second question, I think it's the transaction log files,
you can limit their size in conf/neo4j.properties with
keep_logical_logs=100M size
The actual nodes and properties files shouldn't be that large. Also you don't have to store the boss-id in the database. That's actually handled by the relationship :)
Due to an older version of MySQL I'm having to use some pretty outdated methods to get things done.
At the moment I am trying to copy similar rows to another table based on a few distinct columns. The table holddups will be taking data from assets where the SKU and Description match those of one in holdkey. The command I'm running is:
INSERT INTO holddups
SELECT *
FROM assets, holdkey
WHERE assets.SKU = holdkey.SKU
AND assets.Description = holdkey.Description
And the error I'm getting is:
#1136 - Column count doesn't match value count at row 1
I hope this is enough to sort this all out, but if not feel free to ask more.
Selecting just * will take all columns from assets and holdkey and try to put it in holdups. But holdups does not have that much columns. Using assets.*will only take all columns of assets and that is what you want, right?
INSERT INTO holddups
SELECT assets.*
FROM assets, holdkey
WHERE assets.SKU = holdkey.SKU
AND assets.Description = holdkey.Description