importing categorical data from CSV into scikit-learn - csv

I would like to import data from a CSV file to use in scikit-learn. It has a mix of numerical data categorical data, e.g.
someValue,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5
I need to convert this representation into a purely numerical one where categorical data points get converted into multiple binary columns, e.g.
someValue,colorIsRed,colorIsBlue,someOtherValue
1.2,1,0,55.6
1.9,0,1,20.5
3.2,1,0,16.5
Is there any utility that does this for me, or an easy way to iterate through the data and get this representation?

scikit-learn doesn't offer data-loading functions as far as I know, but it does prefer Numpy arrays as input. Numpy's loadtxt function together with its converters parameter can be used to load your csv and specify the types of each column. It does not binarize your second column though.

In this answer, I'm assuming that you're trying to convert your CSV into a file that LibSVM, LIBLINEAR, or scikit-learn can load.
You can use csv2libsvm, which is provided as part of the Ruby gem vector_embed:
$ gem install vector_embed
Successfully installed vector_embed-0.1.0
1 gem installed
You need Ruby 1.9+...
$ ruby -v
ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.0]
If you don't have Ruby 1.9, it's easy to install with rvm, which does not require (or recommend using) root:
$ curl -#L https://get.rvm.io | bash -s stable
$ rvm install 1.9.3
Once you have successfully run gem install vector_embed, make sure your first column is called "label":
$ cat example.csv
label,color,someOtherValue
1.2,red,55.6
1.9,blue,20.5
3.2,red,16.5
$ csv2libsvm example.csv > example.libsvm
$ cat example.libsvm
1.2 1139043:55.6 1997960:1
1.9 1089740:1 1139043:20.5
3.2 1139043:16.5 1997960:1
Note that it handles both categorical and continuous data, and that it uses MurmurHash version 3 to generate the feature names ("colorIsBlue" corresponds to 1089740, "colorIsRed" is 1997960... though the Ruby code is really hashing something like "color\0red").
If you're using svm, be sure to scale your data like they recommend in "A practical guide to SVM classification".
Finally, let's say you're using scikit-learn's svmlight/libsvm loader:
>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/example.libsvm")

Related

How do I get jq to preserve bigint values?

I have a large JSON file that contains bigints with their full values--not rounded like JavaScript loves to do by default.
We have a workaround to deal with the bigints in Node.js, but I'm trying to use jq (the command-line tool) to clean up our data.
However, when I ran jq on our JSON file, it rounded all of our bigints.
Is there a way to use jq so that it doesn't round the bigints or is there perhaps another command-line tool that works on a Mac that I may use instead?
As of right now, the best jq has to offer with respect to JSON numbers is the "master" version, which preserves the external numerical value very well. The updates were made on or about 22 Oct 2019, and the "master" version of jq seems to be as safe to use as the most recent release (jq 1.6).
Examples using a recent "master" version:
jqMaster -n -M '
[0000,
10000000000000000000000000000000000000012,
1.0000000000000000000000000000000000000012,
1000000000000000000000000000000000000001210000000000000000000000000000000000000012,
0.1e123456]'
Output
[
0,
10000000000000000000000000000000000000012,
1.0000000000000000000000000000000000000012,
1000000000000000000000000000000000000001210000000000000000000000000000000000000012,
1E+123455
]
Another option would be to use “gojq”, the Go implementation of jq that uses unbounded-precision representation of integer literals.
In fact, except for one bug that has only been fixed in the “master” version of gojq as of this writing, gojq supports unbounded-precision integer arithmetic. The bug fix: https://github.com/itchyny/gojq/commit/7a1840289029c9c038d61274ceac9b8d307c0358

drop_duplicates() got an unexpected keyword argument 'ignore_index'

In my machine, the code can run normally. But in my friend's machine, there is an error about drop_duplicates(). The error type is the same as the title.
Open your command prompt, type pip show pandas to check the current version of your pandas.
If it's lower than 1.0.0 as #paulperry says, then type pip install --upgrade pandas --user
(substitute user with your windows account name)
Type import pandas as pd; pd.__version__ and see what version of Pandas you are using and make sure it's >= 1.0 .
I was having the same problem as Wzh -- but am running pandas version 1.1.3. So, it was not a version problem.
Ilya Chernov's comment pointed me in the right direction. I needed to extract a list of unique names from a single column in a more complicated DataFrame so that I could use that list in a lookup table. This seems like something others might need to do, so I will expand a bit on Chernov's comment with this example, using the sample csv file "iris.csv" that isavailable on GitHub. The file lists sepal and petal length for a number of iris varieties. Here we extract the variety names.
df = pd.read_csv('iris.csv')
# drop duplicates BEFORE extracting the column
names = df.drop_duplicates('variety', inplace=False, ignore_index=True)
# THEN extract the column you want
names = names['variety']
print(names)
Here is the output:
0 Setosa
1 Versicolor
2 Virginica
Name: variety, dtype: object
The key idea here is to get rid of the duplicate variety names while the object is still a DataFrame (without changing the original file), and then extract the one column that is of interest.

MATLAB: Read HTML-Codes (within XML)

I'm trying to read the following XML-file of a Polish treebank using MATLAB: http://zil.ipipan.waw.pl/Sk%C5%82adnica?action=AttachFile&do=view&target=Sk%C5%82adnica-frazowa-0.5-TigerXML.xml.gz
Polish letters seem to be encoded as HTML-codes: http://webdesign.about.com/od/localization/l/blhtmlcodes-pl.htm
For instance, ł stands for 'ł'. If I open the treebank using 'UTF-8', I get words like kłaniał, which should actually be displayed as 'kłaniał'
Now, I see 2 options to read the treebank correctly:
Directly read the XML-file such that HTML-codes are transformed into the corresponding characters.
First save the words in non-decoded format (e.g. as kłaniał) and then transform the characters afterwards.
Is it possible to do one of the 2 options (or both) in MATLAB?
A non-MATLAB solution is to preprocess the file through some external utility. For instance, with Ruby installed, one could use the HTMLentities gem to unescape all the special characters.
sudo gem install htmlentities
Let file.xml be the filename which should consist of ascii-only chars. The Ruby code to convert the file could be like this:
#!/usr/bin/env ruby
require 'htmlentities'
xml = File.open("file.xml").read
converted_xml = HTMLEntities.new.decode xml
IO.write "decoded_file.xml", xml
(To run the file, don't forget to chmod +x it to make it executable).
Or more compactly, as a one-liner
ruby -e "require 'htmlentities';IO.write(\"decoded_file.xml\",HTMLEntities.new.decode(File.open(\"file.xml\").read))"
You could then postprocess the xml however you wish.

Is there any way in Elasticsearch to get results as CSV file in curl API?

I am using elastic search.
I need results from elastic search as a CSV file.
Any curl URL or any plugins to achieve this?
I've done just this using cURL and jq ("like sed, but for JSON"). For example, you can do the following to get CSV output for the top 20 values of a given facet:
$ curl -X GET 'http://localhost:9200/myindex/item/_search?from=0&size=0' -d '
{"from": 0,
"size": 0,
"facets": {
"sourceResource.subject.name": {
"global": true,
"terms": {
"order": "count",
"size": 20,
"all_terms": true,
"field": "sourceResource.subject.name.not_analyzed"
}
}
},
"sort": [
{
"_score": "desc"
}
],
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
}
}' | jq -r '.facets["subject"].terms[] | [.term, .count] | #csv'
"United States",33755
"Charities--Massachusetts",8304
"Almshouses--Massachusetts--Tewksbury",8304
"Shields",4232
"Coat of arms",4214
"Springfield College",3422
"Men",3136
"Trees",3086
"Session Laws--Massachusetts",2668
"Baseball players",2543
"Animals",2527
"Books",2119
"Women",2004
"Landscape",1940
"Floral",1821
"Architecture, Domestic--Lowell (Mass)--History",1785
"Parks",1745
"Buildings",1730
"Houses",1611
"Snow",1579
I've used Python successfully, and the scripting approach is intuitive and concise. The ES client for python makes life easy. First grab the latest Elasticsearch client for Python here:
http://www.elasticsearch.org/blog/unleash-the-clients-ruby-python-php-perl/#python
Then your Python script can include calls like:
import elasticsearch
import unicodedata
import csv
es = elasticsearch.Elasticsearch(["10.1.1.1:9200"])
# this returns up to 500 rows, adjust to your needs
res = es.search(index="YourIndexName", body={"query": {"match": {"title": "elasticsearch"}}},500)
sample = res['hits']['hits']
# then open a csv file, and loop through the results, writing to the csv
with open('outputfile.tsv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter='\t', # we use TAB delimited, to handle cases where freeform text may have a comma
quotechar='|', quoting=csv.QUOTE_MINIMAL)
# create column header row
filewriter.writerow(["column1", "column2", "column3"]) #change the column labels here
for hit in sample:
# fill columns 1, 2, 3 with your data
col1 = hit["some"]["deeply"]["nested"]["field"].decode('utf-8') #replace these nested key names with your own
col1 = col1.replace('\n', ' ')
# col2 = , col3 = , etc...
filewriter.writerow([col1,col2,col3])
You may want to wrap the calls to the column['key'] references in try / catch error handling, since documents are unstructured, and may not have the field from time to time (depends on your index).
I have a complete Python sample script using the latest ES python client available here:
https://github.com/jeffsteinmetz/pyes2csv
You can use elasticsearch head plugin.
You can install from elasticsearch head plugin
http://localhost:9200/_plugin/head/
Once you have the plugin installed, navigate to the structured query tab, provide query details and you can select 'csv' format from the 'Output Results' dropdown.
I don't think there is a plugin that will give you CSV results directly from the search engine, so you will have to query ElasticSearch to retrieve results and then write them to a CSV file.
Command line
If you're on a Unix-like OS, then you might be able to make some headway with es2unix which will give you search results back in raw text format on the command line and so should be scriptable.
You could then dump those results to text file or pipe to awk or similar to format as CSV. There is a -o flag available, but it only gives 'raw' format at the moment.
Java
I found an example using Java - but haven't tested it.
Python
You could query ElasticSearch with something like pyes and write the results set to a file with the standard csv writer library.
Perl
Using Perl then you could use Clinton Gormley's GIST linked by Rakesh - https://gist.github.com/clintongormley/2049562
Shameless plug. I wrote estab - a command line program to export elasticsearch documents to tab-separated values.
Example:
$ export MYINDEX=localhost:9200/test/default/
$ curl -XPOST $MYINDEX -d '{"name": "Tim", "color": {"fav": "red"}}'
$ curl -XPOST $MYINDEX -d '{"name": "Alice", "color": {"fav": "yellow"}}'
$ curl -XPOST $MYINDEX -d '{"name": "Brian", "color": {"fav": "green"}}'
$ estab -indices "test" -f "name color.fav"
Brian green
Tim red
Alice yellow
estab can handle export from multiple indices, custom queries, missing values, list of values, nested fields and it's reasonably fast.
If you are using kibana (app/discover in general), you can make your query in the UI, then save it and share -> CSV Reports. This creates a csv with a line for each record and columns will be comma separated
I have been using https://github.com/robbydyer/stash-query stash-query for this.
I find it quite convenient and working well, though i struggle with the install every time I redo it (this is due to me not being very fluent with gem's and ruby).
On Ubuntu 16.04 though, what seemed to work was:
apt install ruby
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev
gem install stash-query
and then you should be good to go
Installs Ruby
Install curl dependencies for Ruby, because the stash-query tool is working via the REST API of elasticsearch
Installs stash query
This blog post describes how to build it as well:
https://robbydyer.wordpress.com/2014/08/25/exporting-from-kibana/
you can use elasticsearch2csv is a small and effective python3 script that uses Elasticsearch scroll API and handle a big query response.
You can use GIST. Its simple.
Its in Perl and you can get some help from it.
Please download and see the usage on GitHub. Here is the link.
GIST GitHub
Or if you want in Java then go for elasticsearch-river-csv
elasticsearch-river-csv

Ways to parse JSON using KornShell

I have a working code for parsing a JSON output using KornShell by treating it as a string of characters. The issue I have is that the vendor keeps changing the position of the field that I am intersted in. I understand in JSON, we can parse it by key-value pairs.
Is there something out there that can do this? I am intersted in a specific field and I would like to use it to run the checks on the status of another RESTAPI call.
My sample json output is like this:
JSONDATA value :
{
"status": "success",
"job-execution-id": 396805,
"job-execution-user": "flexapp",
"job-execution-trigger": "RESTAPI"
}
I would need the job-execution-id value to monitor this job through the rest of the script.
I am using the following command to parse it:
RUNJOB=$(print ${DATA} |cut -f3 -d':'|cut -f1 -d','| tr -d [:blank:]) >> ${LOGDIR}/${LOGFILE}
The problem with this is, it is field delimited by :. The field position has been known to be changed by the vendors during releases.
So I am trying to see if I can use a utility out there that would always give me the key-value pair of "job-execution-id": 396805, no matter where it is in the json output.
I started looking at jsawk, and it requires the js interpreter to be installed on our machines which I don't want. Any hint on how to go about finding which RPM that I need to solve it?
I am using RHEL5.5.
Any help is greatly appreciated.
The ast-open project has libdss (and a dss wrapper) which supposedly could be used with ksh. Documentation is sparse and is limited to a few messages on the ast-user mailing list.
The regression tests for libdss contain some json and xml examples.
I'll try to find more info.
Python is included by default with CentOS so one thing you could do is pass your JSON string to a Python script and use Python's JSON parser. You can then grab the value written out by the script. An example you could modify to meet your needs is below.
Note that by specifying other dictionary keys in the Python script you can get any of the values you need without having to worry about the order changing.
Python script:
#get_job_execution_id.py
# The try/except is because you'll probably have Python 2.4 on CentOS 5.5,
# and the straight "import json" statement won't work unless you have Python 2.6+.
try:
import json
except:
import simplejson as json
import sys
json_data = sys.argv[1]
data = json.loads(json_data)
job_execution_id = data['job-execution-id']
sys.stdout.write(str(job_execution_id))
Kornshell script that executes it:
#get_job_execution_id.sh
#!/bin/ksh
JSON_DATA='{"status":"success","job-execution-id":396805,"job-execution-user":"flexapp","job-execution-trigger":"RESTAPI"}'
EXECUTION_ID=`python get_execution_id.py "$JSON_DATA"`
echo $EXECUTION_ID