scikit learn - converting features stored as strings into numbers - csv

I'm currently dipping my toes into machine learning using the scikit-learn python library and am trying to use some .CSV data in the format
Date Name Average_Price_SA
1995-01-01 Barking And Dagenham 70885.331285935
1995-01-01 Barnet 99567.4268042005
1995-01-01 Barnsley 49608.33494746
....
....
....
2005-01-01 Barking And Dagenham 13294.12321312
I have read them in using panda using the line
data = pd.read_csv('data.csv')
From what I have learned so far, I think I'm supposed to convert those 'Name' category strings into floats so that they can be accepted into a model.
I'm not sure how to go about this. Any help would be greatly appreciated.
Thanks

You can use scikit's LabelBinarizer to convert the strings to one hot vectors. These have N zeros (where N is the number of unique strings) with a one at a single component.
from __future__ import print_function
from sklearn import preprocessing
names = ["Barking And Dagenham", "Barnet", "Barnsley"]
lb = preprocessing.LabelBinarizer()
vectors = lb.fit_transform(names)
for name, vector in zip(names, vectors):
print("%s => %s" % (name, str(vector)))
Output:
Barking And Dagenham => [1 0 0]
Barnet => [0 1 0]
Barnsley => [0 0 1]

Related

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self
some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)
If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

Python, how to import datasets with vertically stacked columns headers, #relation,#attribute,#data?

I'm trying to load a dataset from timeseriesclassification.com, but the datasets are formatted in a way that I've never seen before.
The .csv file looks as follows,
#relation Wine
#attribute att0 numeric
#attribute att1 numeric
#attribute target {1 2}
#data
0,1,1
0,0,0
1,0,0
This is how the data should be formatted.
att0,att1,target
0,1,1
0,0,0
1,0,0
This is my current strategy:
read the file with file('filename.csv)
count the number of rows until #data appears
remove all the headers, and import the data with pandas
add new column names
Does anyone know what type of formatting this dataset is in? Also could anyone point me to a resource where I can reference different dataset formats.
Use Scipy's scipy.io.arff.loadarff to read ARFF files.
In [94]: from scipy.io.arff import loadarff
In [95]: dataset = loadarff(open('filename.csv','r'))
In [96]: df = pd.DataFrame(dataset[0], columns=dataset[1].names())
In [97]: df
Out[97]:
att0 att1 target
0 0.0 1.0 1
1 0.0 0.0 0
2 1.0 0.0 0
That format is a .arff (Attribute-Relation File format) file. You can read it with the scipy.io.arff python module.

Summing binary numbers representing fractions in Sagemath

I'm just starting to learn how to code in Sagemath, I know it's similar to python but I don't have much experience with that either.
I'm trying to add two binary numbers representing fractions. That is, something like
a = '110'
b = '011'
bin(int(a,2) + int(b,2))
But using values representing fractions, such as '1.1'.
Thanks in advance!
If you want to do this in vanilla Python, parsing the binary fractions by hand isn't too bad (the first part being from this answer);
def binstr_to_float(s):
t = s.split('.')
return int(t[0], 2) + int(t[1], 2) / 2.**len(t[1])
def float_to_binstr(f):
i = 0
while int(f) != f:
f *= 2
i += 1
as_str = str(bin(int(f)))
if i == 0:
return as_str[2:]
return as_str[2:-i] + '.' + as_str[-i:]
float_to_binstr(parse_bin('11.1') + parse_bin('0.111')) # is '100.011'
In python you can use the Binary fractions package. With this package you can convert binary-fraction strings into floats and vice-versa. Then, you can perform operations on them.
Example:
>>> from binary_fractions import Binary
>>> sum = Binary("1.1") + Binary("10.01")
>>> str(sum)
'0b11.11'
>>> float(sum)
3.75
>>>
It has many more helper functions to manipulate binary strings such as: shift, add, fill, to_exponential, invert...
PS: Shameless plug, I'm the author of this package.

Scientific data

I want to import data from a corrupted CSV file. It contains scientific numbers and it's a big data set with about 300000 rows and 27 columns. When I import it using,
Import["data.csv","HeaderLines"->1]
the data format is string. So I change it to data table format by
StringSplit[ToString[data[[#]]], ";"] & /#
Range[Dimensions[
Import["data.csv"]][[1]]]
and I need to use the first column to analyse the data. But the problem is that this row is
scientific numbers in string type!! I want to change it to numbers. I used this command:
ToExpression[Internal`StringToDouble[fdata[[All, 1]][[#]]]] & /#
Range[291407];
But it takes more than hours to do so!!! Do you have any idea how I can do this without wasting of time??
You could try the following:
(* read the first 5 rows *)
d = ReadList["data.csv", Table[Number, {27}], 5]
(* read the rows 100 to 150 *)
s = OpenRead["data.csv"];
Skip[s, Record, 99]
d = ReadList[s, Table[Number, {27}], 51]
Close[s]
And d[[All,1]] will get you the first column.

How do I feed in my own data into PyAlgoTrade?

I'm trying to use PyAlogoTrade's event profiler
However I don't want to use data from yahoo!finance, I want to use my own but can't figure out how to parse in the CSV, it is in the format:
Timestamp Low Open Close High BTC_vol USD_vol [8] [9]
2013-11-23 00 800 860 847.666666 886.876543 853.833333 6195.334452 5248330 0
2013-11-24 00 745 847.5 815.01 860 831.255 10785.94131 8680720 0
The complete CSV is here
I want to do something like:
def main(plot):
instruments = ["AA", "AES", "AIG"]
feed = yahoofinance.build_feed(instruments, 2008, 2009, ".")
Then replace yahoofinance.build_feed(instruments, 2008, 2009, ".") with my CSV
I tried:
import csv
with open( 'FinexBTCDaily.csv', 'rb' ) as csvfile:
data = csv.reader( csvfile )
def main( plot ):
feed = data
But it throws an attribute error. Any ideas how to do this?
I suggest to create your own Rowparser and Feed, which is much easier than it sounds, have a look here: yahoofeed
This also allows you to work with intraday data and cleanup the data if needed, like your timestamp.
Another possibility, of course, would be to parse your file and save it, so it looks like a yahoo feed. In your case, you would have to adapt the columns and the Timestamp.
Step A: follow PyAlgoTrade doc on GenericBarFeed class
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.16
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.17
Note
- The CSV file must have the column names in the first row.
- It is ok if the Adj Close column is empty.
- When working with multiple instruments:
--- If all the instruments loaded are in the same timezone, then the timezone parameter may not be specified.
--- If any of the instruments loaded are in different timezones, then the timezone parameter should be set.
addBarsFromCSV( instrument, path, timezone = None )
Loads bars for a given instrument from a CSV formatted file. The instrument gets registered in the bar feed.
Parameters:
(string) instrument – Instrument identifier.
(string) path – The path to the CSV file.
(pytz) timezone – The timezone to use to localize bars.Check pyalgotrade.marketsession.
Next:
A BarFeed loads bars from CSV files that have the following format:
Date Time, Open, High, Low, Close, Volume, Adj Close
2013-01-01 13:59:00,13.51001,13.56,13.51,13.56789,273.88014126,13.51001
Step B: implement a documented CSV-file pre-formatting
Your CSV data will need a bit of sanity ( before will be able to be used in PyAlgoTrade methods ),however it is doable and you can create an easy transformator either by hand or with a use of a powerful numpy.genfromtxt() lambda-based converters facilities.
This sample code is intended for an illustration purpose, to see immediately the powers of converters for your own transformations, as CSV-structure differs.
with open( getCsvFileNAME( ... ), "r" ) as aFH:
numpy.genfromtxt( aFH,
skip_header = 1, # Ref. pyalgotrade
delimiter = ",",
# v v v v v v
# 2011.08.30,12:00,1791.20,1792.60,1787.60,1789.60,835
# 2011.08.30,13:00,1789.70,1794.30,1788.70,1792.60,550
# 2011.08.30,14:00,1792.70,1816.70,1790.20,1812.10,1222
# 2011.08.30,15:00,1812.20,1831.50,1811.90,1824.70,2373
# 2011.08.30,16:00,1824.80,1828.10,1813.70,1817.90,2215
converters = { 0: lambda aString: mPlotDATEs.date2num( datetime.datetime.strptime( aString, "%Y.%m.%d" ) ), #_______________________________________asFloat ( 1.0, +++ )
1: lambda aString: ( ( int( aString[0:2] ) * 60 + int( aString[3:] ) ) / 60. / 24. ) # ( 15*60 + 00 ) / 60. / 24.__asFloat < 0.0, 1.0 )
# HH: :MM HH MM
}
)
You can use pyalgotrade.barfeed.addBarsFromSequence with list comprehension to feed in data from CSV row by row/bar by bar. Basically you create a bar from each row, pass OHLCV as init parameters and extra columns with additional data in a dictionary. You can try something like this (with all the required imports):
data = pd.DataFrame(index=pd.date_range(start='2021-11-01', end='2021-11-05'), columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'ExtraCol1', 'ExtraCol3', 'ExtraCol4', 'ExtraCol5'], data=np.random.rand(5, 10))
feed = yahoofeed.Feed()
feed.addBarsFromSequence('instrumentID', data.index.map(lambda i:
BasicBar(
i,
data.loc[i, 'Open'],
data.loc[i, 'High'],
data.loc[i, 'Low'],
data.loc[i, 'Close'],
data.loc[i, 'Volume'],
data.loc[i, 'Adj Close'],
Frequency.DAY,
data.loc[i, 'ExtraCol1':].to_dict())
).values)
The input data frame was created with random values to make this example easier to reproduce, but the part where the bars are added to the feed should work the same for data frames from CSVs given that the valid column names are used.