drop_duplicates() got an unexpected keyword argument 'ignore_index' - duplicates

In my machine, the code can run normally. But in my friend's machine, there is an error about drop_duplicates(). The error type is the same as the title.

Open your command prompt, type pip show pandas to check the current version of your pandas.
If it's lower than 1.0.0 as #paulperry says, then type pip install --upgrade pandas --user
(substitute user with your windows account name)

Type import pandas as pd; pd.__version__ and see what version of Pandas you are using and make sure it's >= 1.0 .

I was having the same problem as Wzh -- but am running pandas version 1.1.3. So, it was not a version problem.
Ilya Chernov's comment pointed me in the right direction. I needed to extract a list of unique names from a single column in a more complicated DataFrame so that I could use that list in a lookup table. This seems like something others might need to do, so I will expand a bit on Chernov's comment with this example, using the sample csv file "iris.csv" that isavailable on GitHub. The file lists sepal and petal length for a number of iris varieties. Here we extract the variety names.
df = pd.read_csv('iris.csv')
# drop duplicates BEFORE extracting the column
names = df.drop_duplicates('variety', inplace=False, ignore_index=True)
# THEN extract the column you want
names = names['variety']
print(names)
Here is the output:
0 Setosa
1 Versicolor
2 Virginica
Name: variety, dtype: object
The key idea here is to get rid of the duplicate variety names while the object is still a DataFrame (without changing the original file), and then extract the one column that is of interest.

Related

Admin import - Group not found

I am trying to load multiple csv files into a new db using the neo4j-admin import tool on a machine running Debian 11. To try to ensure there's no collisions in the ID fields, I've given every one of my node and relationship files.
However, I'm getting this error:
org.neo4j.internal.batchimport.input.HeaderException: Group 'INVS' not found. Available groups are: [CUST]
This is super frustrating, as I know that the INV group definitely exists. I've checked every file that uses that ID Space and they all include it.Another strange thing is that there are more ID spaces than just the CUST and INV ones. It feels like it's trying to load in relationships before it finishes loading in all of the nodes for some reason.
Here is what I'm seeing when I search through my input files
$ grep -r -h "(INV" ./import | sort | uniq
:ID(INVS),total,:LABEL
:START_ID(INVS),:END_ID(CUST),:TYPE
:START_ID(INVS),:END_ID(ITEM),:TYPE
The top one is from my $NEO4J_HOME/import/nodes folder, the other two are in my $NEO4J_HOME/import/relationships folder.
Is there a nice solution to this? Or have I just stumbled upon a bug here?
Edit: here's the command I've been using from within my $NEO4J_HOME directory:
neo4j-admin import --force=true --high-io=true --skip-duplicate-nodes --nodes=import/nodes/\.* --relationships=import/relationships/\.*
Indeed, such a thing would be great, but i don't think it's possible at the moment.
Anyway it doesn't seems a bug.
I suppose it may be a wanted behavior and / or a feature not yet foreseen.
In fact, on the documentation regarding the regular expression it says:
Assume that you want to include a header and then multiple files that matches a pattern, e.g. containing numbers.
In this case a regular expression can be used
while on the description of --nodes command:
Node CSV header and data. Multiple files will be
logically seen as one big file from the
perspective of the importer. The first line must
contain the header. Multiple data sources like
these can be specified in one import, where each
data source has its own header.
So, it appears that the neo4j-admin import considers the --nodes=import/nodes/\.* as a single .csv with the first header found, hence the error.
Contrariwise with more --nodes there are no problems.

Load csv file with integers in Octave 3.2.4 under Windows

I am trying to import in Octave a file (i.e. data.txt) containing 2 columns of integers, such as:
101448,1077
96906,924
105704,1017
I use the following command:
data = load('data.txt')
However, the "data" matrix that results has a 1 x 1 dimension, with all the content of the data.txt file saved in just one cell. If I adjust the numbers to look like floats:
101448.0,1077.0
96906.0,924.0
105704.0,1017.0
the loading works as expected, and I obtain a matrix with 3 rows and 2 columns.
I looked at the various options that can be set for the load command but none of them seem to help. The data file has no headers, just plain integers, comma separated.
Any suggestions on how to load this type of data? How can I force Octave to cast the data as numeric?
The load function is not to read csv files. It is meant to load files saved from Octave itself which define variables.
To read a csv file use csvread ("data.txt"). Also, 3.2.4 is a very old version no longer supported, you should upgrade.

Neo4j jexp/batch-import weird error: java.lang.NumberFormatException

I'm trying to import around 6M nodes using Michael Hunger's batch importer but I'm getting this weird error:
java.lang.NumberFormatException: For input string: "78rftark42lp5f8nadc63l62r3" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
It is weird because 78rftark42lp5f8nadc63l62r3 is the very first value of the big CSV file that I'm trying to import and its datatype is set to string.
These are the first three lines of that file:
name:string:sessions labels:label timestamp:long:timestamps visitor_pid referrer_url
78rftark42lp5f8nadc63l62r3 Session 1401277353000 cd7b76ef09b498e95b35b49de2925c5f http://someurl.com/blah?t=123
dt2gshq5pao8fg7bka8fdri123 Session 1401277329000 4036ac507698e4daf2ada98664da6d58 http://enter.url.com/signup/signup.php
As you can see here name:string:session the datatype of that column is set to string, why is the importer trying to parse the value as long?
I'm completely new to Neo4j and its ecosystem so I'm sure I'm missing something here.
This is the command I ran to import a bunch of nodes and relations:
./import.sh \
-db-directory sessions.db \
-nodes "toImport/browser-nodes.csv.gz,toImport/country-nodes.csv.gz,toImport/device-nodes.csv.gz,toImport/ip-nodes.csv.gz,toImport/language-nodes.csv.gz,toImport/operatingSystem-nodes.csv.gz,toImport/referrerType-nodes.csv.gz,toImport/resolution-nodes.csv.gz,toImport/session-nodes.csv" \
-rels "toImport/rel-session-browser.csv.gz,toImport/rel-session-country.csv.gz,toImport/rel-session-device.csv.gz,toImport/rel-session-ip.csv.gz,toImport/rel-session-language.csv.gz,toImport/rel-session-operatingSystem.csv.gz,toImport/rel-session-referrerType.csv.gz,toImport/rel-session-resolution.csv.gz"
The file that fails is the last one in the list of nodes toImport/session-nodes.csv
The other files were successfully processed by the importer.
This is the content of the batch.properties file:
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=1G
neostore.propertystore.db.index.mapped_memory=3G
neostore.nodestore.db.mapped_memory=1G
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=1G
neostore.propertystore.db.strings.mapped_memory=1G
batch_import.node_index.sessions=exact
batch_import.node_index.browsers=exact
batch_import.node_index.operatingsystems=exact
batch_import.node_index.referrertypes=exact
batch_import.node_index.devices=exact
batch_import.node_index.resolutions=exact
batch_import.node_index.countries=exact
batch_import.node_index.languages=exact
batch_import.node_index.ips=exact
batch_import.node_index.timestamps=exact
Any thoughts?
I can't see what's the problem here so any help will be appreciated.
EDIT:
I'm using this binary:
https://dl.dropboxusercontent.com/u/14493611/batch_importer_20.zip

Ways to parse JSON using KornShell

I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.
The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.
Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID

importing categorical data from CSV into scikit-learn

I would like to import data from a CSV file to use in scikit-learn. It has a mix of numerical data categorical data, e.g.
someValue,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5
I need to convert this representation into a purely numerical one where categorical data points get converted into multiple binary columns, e.g.
someValue,colorIsRed,colorIsBlue,someOtherValue
1.2,1,0,55.6
1.9,0,1,20.5
3.2,1,0,16.5
Is there any utility that does this for me, or an easy way to iterate through the data and get this representation?
scikit-learn doesn't offer data-loading functions as far as I know, but it does prefer Numpy arrays as input. Numpy's loadtxt function together with its converters parameter can be used to load your csv and specify the types of each column. It does not binarize your second column though.
In this answer, I'm assuming that you're trying to convert your CSV into a file that LibSVM, LIBLINEAR, or scikit-learn can load.
You can use csv2libsvm, which is provided as part of the Ruby gem vector_embed:
$ gem install vector_embed
Successfully installed vector_embed-0.1.0
1 gem installed
You need Ruby 1.9+...
$ ruby -v
ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
If you don't have Ruby 1.9, it's easy to install with rvm, which does not require (or recommend using) root:
$ curl -#L https://get.rvm.io | bash -s stable
$ rvm install 1.9.3
Once you have successfully run gem install vector_embed, make sure your first column is called "label":
$ cat example.csv
label,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5
$ csv2libsvm example.csv > example.libsvm
$ cat example.libsvm
1.2 1139043:55.6 1997960:1
1.9 1089740:1 1139043:20.5
3.2 1139043:16.5 1997960:1
Note that it handles both categorical and continuous data, and that it uses MurmurHash version 3 to generate the feature names ("colorIsBlue" corresponds to 1089740, "colorIsRed" is 1997960... though the Ruby code is really hashing something like "color\0red").
If you're using svm, be sure to scale your data like they recommend in "A practical guide to SVM classification".
Finally, let's say you're using scikit-learn's svmlight/libsvm loader:
>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/example.libsvm")