I'm new in neo4j.
I'm trying to load csv files using the import.bat,
with shell.
(in windows)
I have 500,000 nodes
and 37 million relationships.
The import.bat is not working.
The code in shell cmd:
../neo4j-community-3.0.4/bin/neo4j-import \
--into ../neo4j-community-3.0.4/data/databases/graph.db \
--nodes:Chain import\entity.csv
--relationships import\roles.csv
but I did not know where to keep the csv files
and how to use the import.bat with shell.
I'm not sure I'm in the right place:
neo4j-sh(?)$
(I looked at a lot of examples, for me it just does not work)
I try to start the server with the cmd line and it's not working. That's what I did:
neo4j-community-3.0.4/bin/neo4j.bat start
I want to work with indexes I set the index, but when I try to use it,
it's not working:
start n= node:Chain(entity_id='1') return n;
I set the properties:
node_keys_indexable=entity_id
and also:
node_auto_indexing=true
Without indexes this query:
match p = (a:Chain)-[:tsuma*1..3]->(b:Chain)
where a.entity_id= 1
return p;
try to get one node with 3 levels
it's returned 49 relationships in 5 minutes.
It's a lot of time!!!!!
Your import command looks correct. You point to the csv files where they are, just like with how you point to --into directory. If you're unsure then use fully qualified names like /home/me/some-directory/entities.csv. What does it say (really hard to help you without knowing the error).
What's the error?
Legacy indexes doesn't go well with the importer and so enabling the legacy indexes afterwards doesn't index your data, could you instead use a index (CREATE INDEX ...)?
Related
I am trying to load multiple csv files into a new db using the neo4j-admin import tool on a machine running Debian 11. To try to ensure there's no collisions in the ID fields, I've given every one of my node and relationship files.
However, I'm getting this error:
org.neo4j.internal.batchimport.input.HeaderException: Group 'INVS' not found. Available groups are: [CUST]
This is super frustrating, as I know that the INV group definitely exists. I've checked every file that uses that ID Space and they all include it.Another strange thing is that there are more ID spaces than just the CUST and INV ones. It feels like it's trying to load in relationships before it finishes loading in all of the nodes for some reason.
Here is what I'm seeing when I search through my input files
$ grep -r -h "(INV" ./import | sort | uniq
:ID(INVS),total,:LABEL
:START_ID(INVS),:END_ID(CUST),:TYPE
:START_ID(INVS),:END_ID(ITEM),:TYPE
The top one is from my $NEO4J_HOME/import/nodes folder, the other two are in my $NEO4J_HOME/import/relationships folder.
Is there a nice solution to this? Or have I just stumbled upon a bug here?
Edit: here's the command I've been using from within my $NEO4J_HOME directory:
neo4j-admin import --force=true --high-io=true --skip-duplicate-nodes --nodes=import/nodes/\.* --relationships=import/relationships/\.*
Indeed, such a thing would be great, but i don't think it's possible at the moment.
Anyway it doesn't seems a bug.
I suppose it may be a wanted behavior and / or a feature not yet foreseen.
In fact, on the documentation regarding the regular expression it says:
Assume that you want to include a header and then multiple files that matches a pattern, e.g. containing numbers.
In this case a regular expression can be used
while on the description of --nodes command:
Node CSV header and data. Multiple files will be
logically seen as one big file from the
perspective of the importer. The first line must
contain the header. Multiple data sources like
these can be specified in one import, where each
data source has its own header.
So, it appears that the neo4j-admin import considers the --nodes=import/nodes/\.* as a single .csv with the first header found, hence the error.
Contrariwise with more --nodes there are no problems.
What I'm trying to import is a CSV file with phone calls, and represent it as phone numbers in nodes and each call as an arrow.
The file is separated by pipes.
I have tried a first version:
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
return line
limit 5
and worked as expected, one node (outgoing number) is created for each row.
After that I could test what I've done with a simple
Match (a) return a
So I've tried the following step is creating the second node of the call (receiver)
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
merge (b:line {number:COALESCE(line[2],"" )})
return line
limit 5
After I run this code I receive no answer (I'm using the browser GUI at localhost:7474/broser) of this operation and if I try to perform any query on this server I get no result either.
So again if I run
match (a) return a
nothing happens.
The only way I've got to go back to life is stoping the server and starting it again.
Any ideas?
It is possible, that opening that big file twice will cause the problem because it is heavily based on the operational system how to handle big files.
Anyway, if you run it accidentally without the 'limit 5' clause then It can happen, since you are trying to load the 26GB in a single transaction.
Since LOAD CSV is for medium sized datasets, I recommend two solutions:
- Using the neo4j-import tool, or
- I would try to split up the file to smaller parts, and you should use periodic commit to prevent the out of memory situations and hangs, like this:
USING PERIODIC COMMIT 100000
LOAD CSV FROM ...
I have a MySQL dump file over 1 terabyte big. I need to extract the CREATE TABLE statements from it so I can provide the table definitions.
I purchased Hex Editor Neo but I'm kind of disappointed I did. I created a regex CREATE\s+TABLE(.|\s)*?(?=ENGINE=InnoDB) to extract the CREATE TABLE clause, and that seems to be working well testing in NotePad++.
However, the ETA of extracting all instances is over 3 hours, and I cannot even be sure that it is doing it correctly. I don't even know if those lines can be exported when done.
Is there a quick way I can do this on my Ubuntu box using grep or something?
UPDATE
Ran this overnight and output file came blank. I created a smaller subset of data and the procedure is still not working. It works in regex testers however, but grep is not liking it and yielding an empty output. Here is the command I'm running. I'd provide the sample but I don't want to breach confidentiality for my client. It's just a standard MySQL dump.
grep -oP "CREATE\s+TABLE(.|\s)+?(?=ENGINE=InnoDB)" test.txt > plates_schema.txt
UPDATE
It seems to not match on new lines right after the CREATE\s+TABLE part.
You can use Perl for this task... this should be really fast.
Perl's .. (range) operator is stateful - it remembers state between evaluations.
What it means is: if your definition of table starts with CREATE TABLE and ends with something like ENGINE=InnoDB DEFAULT CHARSET=utf8; then below will do what you want.
perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' INPUT_FILE.sql > OUTPUT_FILE.sql
EDIT:
Since you are working with a really large file and would probably like to know the progress, pv can give you this also:
pv INPUT_FILE.sql | perl -ne 'print if /CREATE TABLE/../ENGINE=InnoDB/' > OUTPUT_FILE.sql
This will show you progress bar, speed and ETA.
You can use the following:
grep -ioP "^CREATE\s+TABLE[\s\S]*?(?=ENGINE=InnoDB)" file.txt > output.txt
If you can run mysqldump again, simply add --no-data.
Got it! grep does not support matching across multiple lines. I found this question helpul and I ended up using pcregrep instead.
pcregrep -M "CREATE\s+TABLE(.|\n|\s)+?(?=ENGINE=InnoDB)" test.txt > plates.schema.txt
I am trying to import a CSV into a column family in Cassandra using the following syntax:
copy data (id, time, vol, speed, occupancy, status, flags) from 'C:\Users\Foo\Documents\reallybig.csv' with header = true;
The CSV file is about 700 MB, and for some reason when I run this command in cqlsh I get the following error:
"Request did not complete within rpc_timeout."
What is going wrong? There are no errors in the CSV, and it seems to me that Cassandra should be suck in this CSV without a problem.
Cassandra installation folder has a .yaml file to set rpc timeout value which is "rpc_timeout_in_ms ", you could modify the value and restart cassandra.
But another way is cut your big csv to multiply files and input the files one by one.
This actually ended up being my own misinterpretation of COPY-FROM as the CSV was about 17 million rows. Which in this case the best option was to use the bulk loader example and run sstableloader. However, the answer above would certainly work if I wanted to break the CSV into 17 different CSV's which is an option.
I want to access every value (~10000) in .txt files (~1000) stored in directories (~20) in the most efficient manner possible. When the data is grabbed I would like to place them in a HTML string. I do this in order to display a HTML page with tables for each file. Pseudo:
fh=open('MyHtmlFile.html','w')
fh.write('''<head>Lots of tables</head><body>''')
for eachDirectory in rootFolder:
for eachFile in eachDirectory:
concat=''
for eachData in eachFile:
concat=concat+<tr><td>eachData</tr></td>
table='''
<table>%s</table>
'''%(concat)
fh.write(table)
fh.write('''</body>''')
fh.close()
There must be a better way (I imagine it would take forever)! I've checked out set() and read a bit about hashtables but rather ask the experts before the hole is dug.
Thank you for your time!
/Karl
import os, os.path
# If you're on Python 2.5 or newer, use 'with'
# needs 'from __future__ import with_statement' on 2.5
fh=open('MyHtmlFile.html','w')
fh.write('<html>\r\n<head><title>Lots of tables</title></head>\r\n<body>\r\n')
# this will recursively descend the tree
for dirpath, dirname, filenames in os.walk(rootFolder):
for filename in filenames:
# again, use 'with' on Python 2.5 or newer
infile = open(os.path.join(dirpath, filename))
# this will format the lines and join them, then format them into the table
# If you're on Python 2.6 or newer you could use 'str.format' instead
fh.write('<table>\r\n%s\r\n</table>' %
'\r\n'.join('<tr><td>%s</tr></td>' % line for line in infile))
infile.close()
fh.write('\r\n</body></html>')
fh.close()
Why do you "imagine it would take forever"? You are reading the file and then printing it out - that's pretty much the only thing you have as a requirement - and that's all you're doing.
You could tweak the script in a couple of ways (read blocks not lines, adjust buffers, print out instead of concatenating, etc.), but if you don't know how much time do you take now, how do you know what is better/worse?
Profile first, then find if the script is too slow, then find a place where it's slow, and only then optimise (or ask about optimisation).