Neo4j jexp/batch-import weird error: java.lang.NumberFormatException - csv

I'm trying to import around 6M nodes using Michael Hunger's batch importer but I'm getting this weird error:
java.lang.NumberFormatException: For input string: "78rftark42lp5f8nadc63l62r3" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
It is weird because 78rftark42lp5f8nadc63l62r3 is the very first value of the big CSV file that I'm trying to import and its datatype is set to string.
These are the first three lines of that file:
name:string:sessions labels:label timestamp:long:timestamps visitor_pid referrer_url
78rftark42lp5f8nadc63l62r3 Session 1401277353000 cd7b76ef09b498e95b35b49de2925c5f http://someurl.com/blah?t=123
dt2gshq5pao8fg7bka8fdri123 Session 1401277329000 4036ac507698e4daf2ada98664da6d58 http://enter.url.com/signup/signup.php
As you can see here name:string:session the datatype of that column is set to string, why is the importer trying to parse the value as long?
I'm completely new to Neo4j and its ecosystem so I'm sure I'm missing something here.
This is the command I ran to import a bunch of nodes and relations:
./import.sh \
-db-directory sessions.db \
-nodes "toImport/browser-nodes.csv.gz,toImport/country-nodes.csv.gz,toImport/device-nodes.csv.gz,toImport/ip-nodes.csv.gz,toImport/language-nodes.csv.gz,toImport/operatingSystem-nodes.csv.gz,toImport/referrerType-nodes.csv.gz,toImport/resolution-nodes.csv.gz,toImport/session-nodes.csv" \
-rels "toImport/rel-session-browser.csv.gz,toImport/rel-session-country.csv.gz,toImport/rel-session-device.csv.gz,toImport/rel-session-ip.csv.gz,toImport/rel-session-language.csv.gz,toImport/rel-session-operatingSystem.csv.gz,toImport/rel-session-referrerType.csv.gz,toImport/rel-session-resolution.csv.gz"
The file that fails is the last one in the list of nodes toImport/session-nodes.csv
The other files were successfully processed by the importer.
This is the content of the batch.properties file:
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=1G
neostore.propertystore.db.index.mapped_memory=3G
neostore.nodestore.db.mapped_memory=1G
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=1G
neostore.propertystore.db.strings.mapped_memory=1G
batch_import.node_index.sessions=exact
batch_import.node_index.browsers=exact
batch_import.node_index.operatingsystems=exact
batch_import.node_index.referrertypes=exact
batch_import.node_index.devices=exact
batch_import.node_index.resolutions=exact
batch_import.node_index.countries=exact
batch_import.node_index.languages=exact
batch_import.node_index.ips=exact
batch_import.node_index.timestamps=exact
Any thoughts?
I can't see what's the problem here so any help will be appreciated.
EDIT:
I'm using this binary:
https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip

Related

drop_duplicates() got an unexpected keyword argument 'ignore_index'

In my machine, the code can run normally. But in my friend's machine, there is an error about drop_duplicates(). The error type is the same as the title.
Open your command prompt, type pip show pandas to check the current version of your pandas.
If it's lower than 1.0.0 as #paulperry says, then type pip install --upgrade pandas --user
(substitute user with your windows account name)
Type import pandas as pd; pd.__version__ and see what version of Pandas you are using and make sure it's >= 1.0 .
I was having the same problem as Wzh -- but am running pandas version 1.1.3. So, it was not a version problem.
Ilya Chernov's comment pointed me in the right direction. I needed to extract a list of unique names from a single column in a more complicated DataFrame so that I could use that list in a lookup table. This seems like something others might need to do, so I will expand a bit on Chernov's comment with this example, using the sample csv file "iris.csv" that isavailable on GitHub. The file lists sepal and petal length for a number of iris varieties. Here we extract the variety names.
df = pd.read_csv('iris.csv')
# drop duplicates BEFORE extracting the column
names = df.drop_duplicates('variety', inplace=False, ignore_index=True)
# THEN extract the column you want
names = names['variety']
print(names)
Here is the output:
0 Setosa
1 Versicolor
2 Virginica
Name: variety, dtype: object
The key idea here is to get rid of the duplicate variety names while the object is still a DataFrame (without changing the original file), and then extract the one column that is of interest.

Import huge data from CSV into Neo4j

When I try to import huge data into neo4j it gives following error:
there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Hello! I am trying to combine 2 variables to one variable. The variables are Public Folder Names and the ParentPath. Both can be found using Get-PublicFolder
Basically I want an array of Public Folders Path and Name so I will have an array like /Engineering/NewUsers
Below is my code
$parentpath = Get-PublicFolder -ResultSize Unlimited -Identity """ "'
It seems that there may be some information lacking from your question, especially about the data that is getting parsed, stack trace a.s.o.
Anyway, I think you can get around this by changing which character is treated as quote character. How are you calling the import tool and which version of Neo4j are you doing this on?
Try including argument --quote %, and I'm making this up by just using another character % as quote character. Would that help you?

How to load CSV dataset with corrupted columns?

I've exported a client database to a csv file, and tried to import it to Spark using:
spark.sqlContext.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("table.csv")
After doing some validations, I find out that some ids were null because a column sometimes has a carriage return. And that dislocated all next columns, with a domino effect, corrupting all the data.
What is strange is that when calling printSchema the resulting table structure is good.
How to fix the issue?
You seemed to have had a lot of luck with inferSchema that it worked fine (since it only reads few records to infer the schema) and so printSchema gives you a correct result.
Since the CSV export file is broken and assuming you want to process the file using Spark (given its size for example) read it using textFile and fix the ids. Save it as CSV format and load it back.
I'm not sure what version of spark you are using, but beginning in 2.2 (I believe), there is a 'multiLine' option that can be used to keep fields together that have line breaks in them. From some other things I've read, you may need to apply some quoting and/or escape character options to get it working just how you want it.
spark.read
.csv("table.csv")
.option("header", "true")
.option("inferSchema", "true")
**.option("multiLine", "true")**

Ways to parse JSON using KornShell

I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.
The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.
Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID

Sqoop HDFS to Couchbase: json file format

I'm trying the export data form HDFS to Couchbase and I have a problem with my file format.
My configuration:
Couchbase server 2.0
Stack hadoop cdh4.1.2
sqoop 1.4.2 (compiled with hadoop2.0.0)
couchbase/hadoop connector (compiled with hadoop2.0.0)
When I run the export command, I can easily export files with this kind of format:
id,"value"
or
id,42
or
id,{"key":"value"}
But when I want to apply a Json object it doesn't work!
id,{"key1":"value1,"key2":"value2"}
The content is truncated at the first comma and diplay in base64 by couchbase because now the content is not a correct JSON...
So, my question is how the file must by formated to be stored as a json document?
Can we only export a key/value file?
I want to export json files form HDFS like the cbdocloader do it with files from local file system...
I'm afraid that this expected behavior as Sqoop is parsing your input file as CSV with comma as a separator. You might need tweak your file format to either escape separator or enclose entire JSON string. I would recommend reading how exactly Sqoop is dealing with escaping separators and enclosing strings in the user guide [1].
Links:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#id387098
I think your best bet is to convert the files to tab-delimited, if you're still working on this. If you look at the Sqoop documentation (http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_large_objects), there's an option --fields-terminated-by which allows you to specify which characters Sqoop splits fields on.
If you passed it --fields-terminated-by '\t', and a tab-delimited file, it would leave the commas in place in your JSON.
#mpiffaretti can you post your sqoop export command? I think each JSON object should have its own key value.
key1 {"dataOne":"ValueOne"}
key2 {"dataTwo":"ValueTwo"}
http://ajanacs.weebly.com/blog
In your case change the datea like below may help you solve the issue.
id,{"key":"value"}
id2,{"key2":"value2"}
Let me know if you have further questions on it.
[json] [sqoopexport] [couchbase]