Creating unique node and relationship NEO4J over huge dataset

Creating unique node and relationship NEO4J over huge dataset - csv

My question is very similar to this one:
How to create unique nodes and relationships by csv file imported in neo4j?
I have a textfile with around 2.5 million lines that has two columns, each one being node ids:
1234 345
1234 568
345 984
... ...
Each line represents a relationship (so 2.5 million relationships): first_column nodeid-> FOLLOWS -> second_column nodeid. There are around 80,000 unique nodes in this file.
Based on the link above, I did:
USING PERIODIC COMMIT 1000
LOAD CSV FROM 'file:///home/user_name/Desktop/bigfile.csv' AS line FIELDTERMINATOR ' '
MERGE (n:Userid { id: toInt(line[0]) })
WITH line, n
MERGE (m:Userid { id: toInt(line[1]) })
WITH m,n
MERGE (n)-[:FOLLOWS]->(m)
I am assuming this code
creates node n or m if it doesn't exist (and finds it if it does exist), and creates a relationship from n to m.
If n or m exists and already has many other edges (relationships) pointing to and from other nodes, this would just add another edge from n to m (not creating a brand new node when it already exists)
My main question is I am wondering how to make this process faster.
This is being done on Ubuntu, and I changed the values from 512 to 2048 MB for memory in the conf/neo4j-wrapper.conf file. (maximum I can increase on my Virtual Machine)
Should I try doing the Import tool?
Based on example on this website, neo4j.com/developer/guide-import-csv/ under "Super Fast Batch Importer For Huge Datasets",
./bin/neo4j-import --into mydatabase.db --id-type INTEGER \
--nodes allnodes.csv \
--delimiter " " \
--relationships:FOLLOWS bigfile.csv
And to do this, I need to reformat files so that:
allnodes.csv shows
userID:ID(Userid)
1234
5678
...
And bigfile.csv shows
:START_ID(Userid) :END_ID(Userid)
1234 345
1234 568
345 984
*Two columns delimited by space*
And when I run this import, I get this error:
Input error: Expected '--nodes' to have at least 1 valid item, but had 0 []
Caused by:Expected '--nodes' to have at least 1 valid item, but had 0 []
java.lang.IllegalArgumentException: Expected '--nodes' to have at least 1 valid item, but had 0 []
How do I fix this error? And for the csv files, do I put them in same folder where I run this command (neo4j folder)?

Your command line probably has the wrong paths for your two CSV files.

Related

Convert 10000-element Vector{BitVector} into a matrix of 1 and 0 and save it in Julia

I have a "10000-element Vector{BitVector}", each of those vector has a fixed length of 100 and I just want to save it into a csv file of 0 and 1 that is all. When I type my variable I almost see the kind of output I want in my csv file.
Amongst the many things I have tried, the closest to success was:
CSV.write("\\Folder\\file.csv", Tables.table(variable), writeheader=false)
But my csv file has a 10000 rows and 1 column where each row is something like Bool[0,1,0,0,1,1,0,1,0].

This is not the most efficient option, but I hope it should be good enough for you and it is relatively simple and does not require any packages:
open("out.csv", "w") do io
foreach(v -> println(io, join(Int8.(v), ',')), variable)
end
(the Int8 part is needed to make sure 1 and 0 are printed and not true and false)

LMDB: How to interpret output from mdb_stat and mdb_dump utilities

I have a functional LMDB that, for test purposes, currently contains only 21 key / value records. I've successfully tested inserting and reading records, and I'm comfortable with the database working as intended.
However, when I use the mdb_stat and mdb_dump utilities, I see the following output, respectively:
Status of Main DB
Tree depth: 1
Branch pages: 0
Leaf pages: 1
Overflow pages: 0
Entries: 1
VERSION=3
format=bytevalue
type=btree
mapsize=1073741824
maxreaders=126
db_pagesize=4096
HEADER=END
4d65737361676573
000000000000010000000000000000000100000000000000d81e0000000000001500000000000000ba1d000000000000
DATA=END
In particular, why would mdb_stat indicate only one entry when I have 21? Moreover, each entry comprises 1024 x 300 values of five bytes per value. mdb_dump obviously doesn't show anywhere near the 1,536,000 bytes I'd expect to see, yet the values I mdb_put() and mdb_get() on the fly are correct. Anyone know what's going on?

The relationship between an operating system's directory and an LMDB environment's data.mdb and lock.mdb files is one-to-one.
If the LMDB environment (in the OS directory) has more than one database, then the environment also contains a separate LMDB database containing all of its named databases.
The mdb_stat and mdb_dump utilities appear to contain minimal logic, so when they are fed a given directory via the command line, they appear to produce results only for the database storing database names and not the database(s) storing the actual data of interest.

4d65737361676573 is the Ascii for "Messages", which is the name of table ("sub-db" in lmdb terminology) storing the actual data in your case.
The mdb_dump command only dumps the main db by default. You can use the -s option to dump that sub-db, i.e.
mdb_dump -s Messages
or you can use the -a option to dump all the sub-dbs.

Since you are using a sub-database, the number of entries in the main database corresponds to the number of sub-databases you've created (ie just 1).
Try using mdb_stat -a. This will show you a break-down of all the sub-databases (as well as the main DB). In this breakdown it will list the number of entries for each sub-database. Here you should see your 21 entries.

splitting CSV file by columns

I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:
x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)
I need to split this CSV file into multiple files which contain a less number of columns like:
# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)
# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)
#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)
Is there any program or library doing this?
I've googled for it , but programs that I found only split a file by rows not by columns.
Or which language could I use to do this efficiently?
I can use R, shell script, Python, C/C++, Java

A one-line solution for your example data and desired output:
cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv
The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.
update in python, since the OP asked for a program in an acceptable language:
# python 3 (or python 2, if you must)
import csv
import fileinput
output_specifications = ( # csv file name, selector function
('file1.csv', slice(3)),
('file2.csv', slice(3, 1003)),
('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
(
csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
selector,
) for file_name, selector in output_specifications
]
reader = csv.reader(fileinput.input())
for row in reader:
for row_writer, selector in output_row_writers:
row_writer(row[selector])
This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.

Use a small python script like:
fin = 'file_in.csv'
fout1 = 'file_out1.csv'
fout1_fd = open(fout1,'w')
...
lines = []
with open(fin) as fin_fd:
lines = fin_fd.read().split('\n')
for l in lines:
l_arr = l.split(',')
fout1_fd.write(','.join(l_arr[0:3]))
fout1_fd.write('\n')
...
...
fout1_fd.close()
...

You can open the file in Microsoft Excel, delete the extra columns, save as csv for file #1. Repeat the same procedure for the other 2 tables.

I usually use open office ( or microsof excel in case you are using windows) to do that without writing any program and change the file and save it. Following are two useful links showing how to do that.
https://superuser.com/questions/407082/easiest-way-to-open-csv-with-commas-in-excel
http://office.microsoft.com/en-us/excel-help/import-or-export-text-txt-or-csv-files-HP010099725.aspx

Import csv file data to populate a Prolog knowledge base

I have a csv file example.csv which contains two columns with header var1 and var2.
I want to populate an initially empty Prolog knowledge base file import.pl with repeated facts, while each row of example.csv is treated same:
fact(A1, A2).
fact(B1, B2).
fact(C1, C2).
How can I code this in SWI-Prolog ?
EDIT, based on answer from #Shevliaskovic:
:- use_module(library(csv)).
import:-
csv_read_file('example.csv', Data, [functor(fact), separator(0';)]),
maplist(assert, Data).
When import. is run in console, we update the knowledge base exactly the way it is requested (except for the fact that the knowledge base is directly updated in memory, rather than doing this via a file and subsequent consult).
Check setof([X, Y], fact(X,Y), Z). :
Z = [['A1', 'A2'], ['B1', 'B2'], ['C1', 'C2'], [var1, var2]].

SWI Prolog has a built in process for this.
It is
csv_read_file(+File, -Rows)
Or you can add some options:
csv_read_file(+File, -Rows, +Options)
You can see it at the documentation. For more information
Here is the example that the documentation has:
Suppose we want to create a predicate table/6 from a CSV file that we
know contains 6 fields per record. This can be done using the code
below. Without the option arity(6), this would generate a predicate
table/N, where N is the number of fields per record in the data.
?- csv_read_file(File, Rows, [functor(table), arity(6)]),
maplist(assert, Rows).
For example:
If you have a File.csv that looks like:
A1 A2
B1 B2
C1 C2
You can import it to SWI like:
9 ?- csv_read_file('File.csv', Data).
The result would be:
Data = [row('A1', 'A2'), row('B1', 'B2'), row('C1', 'C2')].

How to merge two ipython notebooks correctly without getting json error?

I have tried:
cat file1.ipynb file2.ipynb > filecomplete.ipynb
since the notebooks are simply json files, but this gives me the error
Unreadable Notebook: Notebook does not appear to be JSON: '{\n "metadata": {'
I think these must be valid json files because file1 and file2 each load individually into nbviewer, and so I am not entirely sure what I am doing wrong.

This Python script concatenates all the notebooks named with a given prefix and present at the first level of a given folder. The resulting notebook is saved in the same folder under the name "compil_" + prefix + ".ipynb".
import json
import os
folder = "slides"
prefix = "quiz"
paths = [os.path.join(folder, name) for name in os.listdir(folder) if name.startswith(prefix) and name.endswith(".ipynb")]
result = json.loads(open(paths.pop(0), "r").read())
for path in paths:
result["worksheets"][0]["cells"].extend(json.loads(open(path, "r").read())["worksheets"][0]["cells"])
open(os.path.join(folder, "compil_%s.ipynb" % prefix), "w").write(json.dumps(result, indent = 1))
Warning: the metadata are those of the first notebook, and the cells those of the first worksheet only (which seems to contain all the cells, in my notebook at least).

Concatenating 2 object with some properties does not always yield object with the same properties. Here is a sequence of increasing number : 4 8 15 16 23 42, here is another one 1 2 3 4 5 6 7. The concatenation of the two is not strictly increasing :4 8 15 16 23 42 1 2 3 4 5 6 7. Same goes for Json.
You need to load json file using json lib and do the merge you want to do yourself. I suppose you "just" want to concatenate the cells, but maybe you want to concatenate worksheet; maybe you want to merge metadata.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Creating unique node and relationship NEO4J over huge dataset - csv

Your command line probably has the wrong paths for your two CSV files.

Related

Convert 10000-element Vector{BitVector} into a matrix of 1 and 0 and save it in Julia

LMDB: How to interpret output from mdb_stat and mdb_dump utilities

splitting CSV file by columns

Import csv file data to populate a Prolog knowledge base

How to merge two ipython notebooks correctly without getting json error?

Categories

Resources