Very weird behaviour in Neo4j load CSV - csv

What I'm trying to import is a CSV file with phone calls, and represent it as phone numbers in nodes and each call as an arrow.
The file is separated by pipes.
I have tried a first version:
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
return line
limit 5
and worked as expected, one node (outgoing number) is created for each row.
After that I could test what I've done with a simple
Match (a) return a
So I've tried the following step is creating the second node of the call (receiver)
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
merge (b:line {number:COALESCE(line[2],"" )})
return line
limit 5
After I run this code I receive no answer (I'm using the browser GUI at localhost:7474/broser) of this operation and if I try to perform any query on this server I get no result either.
So again if I run
match (a) return a
nothing happens.
The only way I've got to go back to life is stoping the server and starting it again.
Any ideas?

It is possible, that opening that big file twice will cause the problem because it is heavily based on the operational system how to handle big files.
Anyway, if you run it accidentally without the 'limit 5' clause then It can happen, since you are trying to load the 26GB in a single transaction.
Since LOAD CSV is for medium sized datasets, I recommend two solutions:
- Using the neo4j-import tool, or
- I would try to split up the file to smaller parts, and you should use periodic commit to prevent the out of memory situations and hangs, like this:
USING PERIODIC COMMIT 100000
LOAD CSV FROM ...

Related

Error in a line on big CSV imported to BigQuery

I'm trying to import a big CSV file to BigQuery (2.2 GB+). This is the error I get:
"Error while reading data, error message: CSV table references column position 33, but line starting at position:254025076 contains only 26 columns."
There are more errors on that file – and on that file only, out of one per state. Usually I would skip the faulty lines, but then I would lose a lot of data.
What can be a good way to check and correct the errors in a file that big?
EDIT: This is what seems to happen in the file. It's one single line and it breaks between "Instituto" and "Butantan". As a result, BigQuery parses it as one line with 26 columns and another with nine columns. That repeats a lot.
As far as I've seen, it's just with Butantan, but sometimes the first word is described differently (I caught "Instituto" and "Fundação"). Can I correct that maybe with grep on the command line? If so, what syntax?
Actually 2.2GB is quite manageble size. It can be quickly pre-processed with command line tools or simple python script on any +/- modern laptop/desktop or on a small VM in GCP.
You can start from looking at the problematic row:
head -n 254025076 your_file.csv | tail -n 1
If problematic rows just have missing values for last columns - you can use "--allow_jagged_rows" loading CSV option.
Otherwise I'm usually using simple python script like this:
import fileinput
def process_line(line):
# your logic to fix line
return line
if __name__ == '__main__':
for line in fileinput.input():
print(process_line(line))
and run it with:
cat your_file.csv | python3 preprocess.py > new_file.csv
UPDATE:
For newline characters in value - try BigQuery "Allow quoted newlines" option.

skip error rows from flat file in ssis

I am trying to load data from a flat file. The file is around 2.5 GB in size and row count is close to billion. I am using a flat file source inside DFT. Few rows inside the file does not follow the column pattern, for example there is a extra delimiter or say text qualifier as value of one column. I want to skip those rows and load rest of the rows which has correct format. I am using SSIS 2014. Flat file source inside DFT is failing. I have set alwaysCheckforrowdelimiter property to false but still does not work. Since the file is too huge manually opening and changing is not possible. Kindly help.
I have the same idea as Nick.McDermaid but I can maybe help you a bit more.
You can clean your file with a regular expression. (In a script)
You just need to define a regex to match lines with the number of delimiter you want. Other lines should be deleted.
Here is a visual example executed in Notepad++
Notepad++ Example screenshot
Here is the pattern used for my example:
^[A-Z]*;[A-Z]*;[A-Z]*;[A-Z]*$
And the data sample:
AA;BB;CC;DD
AA;BB;CC;DD
AA;BB;CC;DD;EE
AA;BB;CC;DD
AA;BB;CC
AA;BB;CC;DD
AA;BB;CC;DD
You can try it on line: https://regex101.com/r/PIYIcY/1
Regards,
Arnaud

how to use import bat

I'm new in neo4j.
I'm trying to load csv files using the import.bat,
with shell.
(in windows)
I have 500,000 nodes
and 37 million relationships.
The import.bat is not working.
The code in shell cmd:
../neo4j-community-3.0.4/bin/neo4j-import \
--into ../neo4j-community-3.0.4/data/databases/graph.db \
--nodes:Chain import\entity.csv
--relationships import\roles.csv
but I did not know where to keep the csv files
and how to use the import.bat with shell.
I'm not sure I'm in the right place:
neo4j-sh(?)$
(I looked at a lot of examples, for me it just does not work)
I try to start the server with the cmd line and it's not working. That's what I did:
neo4j-community-3.0.4/bin/neo4j.bat start
I want to work with indexes I set the index, but when I try to use it,
it's not working:
start n= node:Chain(entity_id='1') return n;
I set the properties:
node_keys_indexable=entity_id
and also:
node_auto_indexing=true
Without indexes this query:
match p = (a:Chain)-[:tsuma*1..3]->(b:Chain)
where a.entity_id= 1
return p;
try to get one node with 3 levels
it's returned 49 relationships in 5 minutes.
It's a lot of time!!!!!
Your import command looks correct. You point to the csv files where they are, just like with how you point to --into directory. If you're unsure then use fully qualified names like /home/me/some-directory/entities.csv. What does it say (really hard to help you without knowing the error).
What's the error?
Legacy indexes doesn't go well with the importer and so enabling the legacy indexes afterwards doesn't index your data, could you instead use a index (CREATE INDEX ...)?

Cassandra RPC Timeout on import from CSV

I am trying to import a CSV into a column family in Cassandra using the following syntax:
copy data (id, time, vol, speed, occupancy, status, flags) from 'C:\Users\Foo\Documents\reallybig.csv' with header = true;
The CSV file is about 700 MB, and for some reason when I run this command in cqlsh I get the following error:
"Request did not complete within rpc_timeout."
What is going wrong? There are no errors in the CSV, and it seems to me that Cassandra should be suck in this CSV without a problem.
Cassandra installation folder has a .yaml file to set rpc timeout value which is "rpc_timeout_in_ms ", you could modify the value and restart cassandra.
But another way is cut your big csv to multiply files and input the files one by one.
This actually ended up being my own misinterpretation of COPY-FROM as the CSV was about 17 million rows. Which in this case the best option was to use the bulk loader example and run sstableloader. However, the answer above would certainly work if I wanted to break the CSV into 17 different CSV's which is an option.

Load csv file with integers in Octave 3.2.4 under Windows

I am trying to import in Octave a file (i.e. data.txt) containing 2 columns of integers, such as:
101448,1077
96906,924
105704,1017
I use the following command:
data = load('data.txt')
However, the "data" matrix that results has a 1 x 1 dimension, with all the content of the data.txt file saved in just one cell. If I adjust the numbers to look like floats:
101448.0,1077.0
96906.0,924.0
105704.0,1017.0
the loading works as expected, and I obtain a matrix with 3 rows and 2 columns.
I looked at the various options that can be set for the load command but none of them seem to help. The data file has no headers, just plain integers, comma separated.
Any suggestions on how to load this type of data? How can I force Octave to cast the data as numeric?
The load function is not to read csv files. It is meant to load files saved from Octave itself which define variables.
To read a csv file use csvread ("data.txt"). Also, 3.2.4 is a very old version no longer supported, you should upgrade.