Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier - deep-learning

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self

some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)

If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

Related

what type of model can i use to train this data

I have downloaded and labeled data from
http://archive.ics.uci.edu/ml/datasets/pamap2+physical+activity+monitoring
my task is to gain an insight into the data from what is given, I have round 34 attributes in a data frame(all clean no nan values)
and want to train a model based on one target attribute 'heart_rate' given the rest of the attributes(all are numbers of a participant performing various activities )
I wanted to use Linear regression model but can not use my dataframe for some reason, however, I do not mind starting from 0 if you think I am doing it wrong
my DataFrame columns:
> Index(['timestamp', 'activity_ID', 'heart_rate', 'IMU_hand_temp',
> 'hand_acceleration_16_1', 'hand_acceleration_16_2',
> 'hand_acceleration_16_3', 'hand_gyroscope_rad_7',
> 'hand_gyroscope_rad_8', 'hand_gyroscope_rad_9',
> 'hand_magnetometer_μT_10', 'hand_magnetometer_μT_11',
> 'hand_magnetometer_μT_12', 'IMU_chest_temp', 'chest_acceleration_16_1',
> 'chest_acceleration_16_2', 'chest_acceleration_16_3',
> 'chest_gyroscope_rad_7', 'chest_gyroscope_rad_8',
> 'chest_gyroscope_rad_9', 'chest_magnetometer_μT_10',
> 'chest_magnetometer_μT_11', 'chest_magnetometer_μT_12',
> 'IMU_ankle_temp', 'ankle_acceleration_16_1', 'ankle_acceleration_16_2',
> 'ankle_acceleration_16_3', 'ankle_gyroscope_rad_7',
> 'ankle_gyroscope_rad_8', 'ankle_gyroscope_rad_9',
> 'ankle_magnetometer_μT_10', 'ankle_magnetometer_μT_11',
> 'ankle_magnetometer_μT_12', 'Intensity'],
> dtype='object')
first 5 rows:
timestamp activity_ID heart_rate IMU_hand_temp hand_acceleration_16_1 hand_acceleration_16_2 hand_acceleration_16_3 hand_gyroscope_rad_7 hand_gyroscope_rad_8 hand_gyroscope_rad_9 ... ankle_acceleration_16_1 ankle_acceleration_16_2 ankle_acceleration_16_3 ankle_gyroscope_rad_7 ankle_gyroscope_rad_8 ankle_gyroscope_rad_9 ankle_magnetometer_μT_10 ankle_magnetometer_μT_11 ankle_magnetometer_μT_12 Intensity
2928 37.66 lying 100.0 30.375 2.21530 8.27915 5.58753 -0.004750 0.037579 -0.011145 ... 9.73855 -1.84761 0.095156 0.002908 -0.027714 0.001752 -61.1081 -36.8636 -58.3696 low
2929 37.67 lying 100.0 30.375 2.29196 7.67288 5.74467 -0.171710 0.025479 -0.009538 ... 9.69762 -1.88438 -0.020804 0.020882 0.000945 0.006007 -60.8916 -36.3197 -58.3656 low
2930 37.68 lying 100.0 30.375 2.29090 7.14240 5.82342 -0.238241 0.011214 0.000831 ... 9.69633 -1.92203 -0.059173 -0.035392 -0.052422 -0.004882 -60.3407 -35.7842 -58.6119 low
2931 37.69 lying 100.0 30.375 2.21800 7.14365 5.89930 -0.192912 0.019053 0.013374 ... 9.66370 -1.84714 0.094385 -0.032514 -0.018844 0.026950 -60.7646 -37.1028 -57.8799 low
2932 37.70 lying 100.0 30.375 2.30106 7.25857 6.09259 -0.069961 -0.018328 0.004582 ... 9.77578 -1.88582 0.095775 0.001351 -0.048878 -0.006328 -60.2040 -37.1225 -57.8847 low
if you check the timestamp attribute you will see that the data acquired is in milliseconds so it might be a good idea to use the data from this dataframe as in every 2-5 seconds and train the model
also as an option, I want to use as one of these models for this task Linear,polynomial, multiple linear, agglomerative clustering and kmeans clustering.
my code:
target = subject1.DataFrame(data.target, columns=["heart_rate"])
X = df
y = target[“heart_rate”]
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)
print(predictions)[0:5]
Error:
AttributeError Traceback (most recent call last)
<ipython-input-93-b0c3faad3a98> in <module>()
3 #heart_rate
4 # Put the target (housing value -- MEDV) in another DataFrame
----> 5 target = subject1.DataFrame(data.target, columns=["heart_rate"])
c:\python36\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5178 return self[name]
-> 5179 return object.__getattribute__(self, name)
5180
5181 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'DataFrame'
for fixing the error I have used:
subject1.columns = subject1.columns.str.strip()
but still did not work
Thank you, sorry if I was not precise enough.
Try this:
X = df.drop("heart_rate", axis=1)
y = df[[“heart_rate”]]
X=X.apply(zscore)
test_size=0.30
seed=7
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=test_size, random_state=seed)
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
predictions = lm.predict(X)
print(predictions)[0:5]

Zabbix API - Is there a way to request reduced number of 'trend' or 'history' records for a specific time range

I have been working on a project for a while which needs to convert Zabbix 'trends' and 'history' data to various types of charts, such as line chart or pie chart.
The problem is that there might be too much data (time-value pairs), especially in the case of 'history' data. Of course, I do not want to send 10,000+ points to the frontend, therefore I want to reduce the number of points, such that it still remains representative of that specific time range.
Of course, one way to solve is to implement this on server-side but, if not necessary, I do not want to burden my resources (CPU, network, etc.).
I have searched through the documentation of Zabbix API for 'history' and 'trends' but I have not found what I needed.
I would like to know if there is any way to request a reduced number of 'history' or 'trend' points from Zabbix API for a specific time period such that it is still representative regarding all the data?
Zabbix API version: 4.0
from datetime import datetime
import math
import sys
import time
from pyzabbix import ZabbixAPI
def n_sized_chunks(lst, n):
"""Yield successive n-sized chunks from 'lst'."""
for i in range(0, len(lst), n):
yield lst[i:i+n]
# The hostname at which the Zabbix web interface is available
ZABBIX_SERVER = '<zabbix-server>'
MAX_POINTS = 300
zapi = ZabbixAPI(ZABBIX_SERVER)
# Login to the Zabbix API
zapi.login('<username>', '<password>')
item_id = '<item-id>'
# Create a time range
time_till = time.mktime(datetime.now().timetuple())
time_from = time_till - 60 * 60 * 24 * 7 # 1 week
# Query item's history (integer) data
history = zapi.history.get(itemids=[item_id],
time_from=time_from,
time_till=time_till,
output='extend',
)
length = len(history)
print(f"Before: {length}") # ~10097
###################################################################
# Can Zabbix API do the followings (or something similar) for me? #
###################################################################
if length <= MAX_POINTS:
sys.exit(0)
chunk_size = math.ceil(length / MAX_POINTS)
x = list(map(lambda point: float(point['clock']), history))
y = list(map(lambda point: float(point['value']), history))
x_chunks = list(n_sized_chunks(lst=x, n=chunk_size))
y_chunks = list(n_sized_chunks(lst=y, n=chunk_size))
history = []
for x, y in zip(x_chunks, y_chunks):
history.append({'clock': (x[0]+x[-1])/2, 'value': sum(y)/len(y)})
######################################################################
print(f"After: {len(history)}") ## ~297
This is not possible currently. You might want to vote on https://support.zabbix.com/browse/ZBXNEXT-656 .

Calculating the average of a column in csv per hour

I have a csv file that contains data in the following format.
Layer relative_time Ht BSs Vge Temp Message
57986 2:52:46 0.00m 87 15.4 None CMSG
20729 0:23:02 45.06m 82 11.6 None BMSG
20729 0:44:17 45.06m 81 11.6 None AMSG
I want to get read in this csv file and calculate the average BSs for every hour. My csv file is quite huge about 2000 values. However the values are not evenly distributed across every hour. For e.g.
I have 237 samples from hour 3 and only 4 samples from hour 6. Also I should mention that the BSs can be collected from multiple sources.The value always ranges from 20-100. Because of this it is giving a skewed result. For each hour I am calculating the sum of BSs for that hour divided by the number of samples in that hour.
The primary purpose is to understand how BSs evolves over time.
But what is the common approach to this problem. Is this where people apply normalization? It would be great if someone could explain how to apply normalization in such a situation.
The code I am using for my processing is shown below. I believe the code below is correct.
#This 24x2 matrix will contain no of values recorded per hour per hour
hours_no_values = [[0 for i in range(24)] for j in range(2)]
#This 24x2 matrix will contain mean bss stats per hour
mean_bss_stats = [[0 for i in range(24)] for j in range(2)]
with open(PREFINAL_OUTPUT_FILE) as fin, open(FINAL_OUTPUT_FILE, "w",newline='') as f:
reader = csv.reader(fin, delimiter=",")
writer = csv.writer(f)
header = next(reader) # <--- Pop header out
writer.writerow([header[0],header[1],header[2],header[3],header[4],header[5],header[6]]) # <--- Write header
sortedlist = sorted(reader, key=lambda row: datetime.datetime.strptime(row[1],"%H:%M:%S"), reverse=True)
print(sortedlist)
for item in sortedlist:
rel_time = datetime.datetime.strptime(item[1], "%H:%M:%S")
if rel_time.hour not in hours_no_values[0]:
print('item[6] {}'.format(item[6]))
if 'MAN' in item[6]:
print('Hour found {}'.format(rel_time.hour))
hours_no_values[0][rel_time.hour] = rel_time.hour
mean_bss_stats[0][rel_time.hour] = rel_time.hour
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
else:
if 'MAN' in item[6]:
print('Hour Previous {}'.format(rel_time.hour))
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
for i in range(0,24):
if(hours_no_values[1][i] != 0):
mean_bss_stats[1][i] = mean_bss_stats[1][i]/hours_no_values[1][i]
else:
mean_bss_stats[1][i] = 0
pprint.pprint('mean bss stats {} \n hour_no_values {} \n'.format(mean_bss_stats,hours_no_values))
The number of value per each hour are as follows for hours starting from 0 to 23.
[31, 117, 85, 237, 3, 67, 11, 4, 57, 0, 5, 21, 2, 5, 10, 8, 29, 7, 14, 3, 1, 1, 0, 0]
You could do it with pandas using groupby and aggregate to appropriate column:
import pandas as pd
import numpy as np
df = pd.read_csv("your_file")
df.groupby('hour')['BSs'].aggregate(np.mean)
If you don't have that column in initial dataframe you could add it:
df['hour'] = your_hour_data
numpy.mean - calculates the mean of the array.
Compute the arithmetic mean along the specified axis.
pandas.groupby
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns
From pandas docs:
By “group by” we are referring to a process involving one or more of the following steps
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
Aggregation: computing a summary statistic (or statistics) about each group.
Some examples:
Compute group sums or means
Compute group sizes / counts

scikit learn - converting features stored as strings into numbers

I'm currently dipping my toes into machine learning using the scikit-learn python library and am trying to use some .CSV data in the format
Date Name Average_Price_SA
1995-01-01 Barking And Dagenham 70885.331285935
1995-01-01 Barnet 99567.4268042005
1995-01-01 Barnsley 49608.33494746
....
....
....
2005-01-01 Barking And Dagenham 13294.12321312
I have read them in using panda using the line
data = pd.read_csv('data.csv')
From what I have learned so far, I think I'm supposed to convert those 'Name' category strings into floats so that they can be accepted into a model.
I'm not sure how to go about this. Any help would be greatly appreciated.
Thanks
You can use scikit's LabelBinarizer to convert the strings to one hot vectors. These have N zeros (where N is the number of unique strings) with a one at a single component.
from __future__ import print_function
from sklearn import preprocessing
names = ["Barking And Dagenham", "Barnet", "Barnsley"]
lb = preprocessing.LabelBinarizer()
vectors = lb.fit_transform(names)
for name, vector in zip(names, vectors):
print("%s => %s" % (name, str(vector)))
Output:
Barking And Dagenham => [1 0 0]
Barnet => [0 1 0]
Barnsley => [0 0 1]

Rpy2 - Select Results and Output to CSV File

I'm currently doing Cox Proportional Hazards Modeling using Rpy2 - I imagine my question will cover other functions and the results from calling them as well though.
After I run the function, I have a variable which contains the results from the function, in the form of a vector. I have tried explicitly converting this to a DataFrame (resultsDataFrame = DataFrame(resultVector)). There are no errors returned when doing this. However, when I do resultsDataFrame.to_csvfile(filename) I get the following error:
Traceback (most recent call last):
File "<pyshell#171>", line 1, in <module>
modelFrame.to_csvfile('/Users/fortylashes/Documents/Matthews_Research/Cox_PH/ResultOutput_Exp1.csv')
File "/Library/Python/2.7/site-packages/rpy2/robjects/vectors.py", line 1031, in to_csvfile
'col.names': col_names, 'qmethod': qmethod, 'append': append})
RRuntimeError: Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""coxph"" to a data.frame
Furthermore, when I simply do:
for result in resultVector:
print (result)
I get an extremely long list of results- including information on each entry in the dataset used in the model, for each variable (so 9,000 records x 9 variables = 81,000 unneeded results). The results I really need are at the bottom of this vector and look like this:
coef exp(coef) se(coef) z p
age_age6574 -0.057775 0.944 0.05469 -1.056 2.9e-01
age_age75plus -0.020795 0.979 0.04891 -0.425 6.7e-01
sex_female -0.005304 0.995 0.03961 -0.134 8.9e-01
stage_late -0.261609 0.770 0.04527 -5.779 7.5e-09
access -0.000494 1.000 0.00069 -0.715 4.7e-01
Likelihood ratio test=36.6 on 5 df, p=7.31e-07 n= 9752, number of events= 2601
*NOTE: There were several more variables for which data was reported in the initial results (the 9,000 x 9 that I was talking about) but weren't actually used in the model.
I was wondering if there was a way to explicitly get this data, put it in one long ordered row, and then output it to a csv file?
::::UPDATE::::
When I call theModel.names I get a list of the various measures which can be called by numerical index:
[1] "coefficients" "var" "loglik"
[4] "score" "iter" "linear.predictors"
[7] "residuals" "means" "concordance"
[10] "method" "n" "nevent"
[13] "terms" "assign" "wald.test"
[16] "y" "formula" "call"
From this I can get the coefficients, which can then be exponentiated. I have not found, however, the p-value, the z score or the likelihood test ratio, which I will need.