Calculating the average of a column in csv per hour - csv

I have a csv file that contains data in the following format.
Layer relative_time Ht BSs Vge Temp Message
57986 2:52:46 0.00m 87 15.4 None CMSG
20729 0:23:02 45.06m 82 11.6 None BMSG
20729 0:44:17 45.06m 81 11.6 None AMSG
I want to get read in this csv file and calculate the average BSs for every hour. My csv file is quite huge about 2000 values. However the values are not evenly distributed across every hour. For e.g.
I have 237 samples from hour 3 and only 4 samples from hour 6. Also I should mention that the BSs can be collected from multiple sources.The value always ranges from 20-100. Because of this it is giving a skewed result. For each hour I am calculating the sum of BSs for that hour divided by the number of samples in that hour.
The primary purpose is to understand how BSs evolves over time.
But what is the common approach to this problem. Is this where people apply normalization? It would be great if someone could explain how to apply normalization in such a situation.
The code I am using for my processing is shown below. I believe the code below is correct.
#This 24x2 matrix will contain no of values recorded per hour per hour
hours_no_values = [[0 for i in range(24)] for j in range(2)]
#This 24x2 matrix will contain mean bss stats per hour
mean_bss_stats = [[0 for i in range(24)] for j in range(2)]
with open(PREFINAL_OUTPUT_FILE) as fin, open(FINAL_OUTPUT_FILE, "w",newline='') as f:
reader = csv.reader(fin, delimiter=",")
writer = csv.writer(f)
header = next(reader) # <--- Pop header out
writer.writerow([header[0],header[1],header[2],header[3],header[4],header[5],header[6]]) # <--- Write header
sortedlist = sorted(reader, key=lambda row: datetime.datetime.strptime(row[1],"%H:%M:%S"), reverse=True)
print(sortedlist)
for item in sortedlist:
rel_time = datetime.datetime.strptime(item[1], "%H:%M:%S")
if rel_time.hour not in hours_no_values[0]:
print('item[6] {}'.format(item[6]))
if 'MAN' in item[6]:
print('Hour found {}'.format(rel_time.hour))
hours_no_values[0][rel_time.hour] = rel_time.hour
mean_bss_stats[0][rel_time.hour] = rel_time.hour
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
else:
if 'MAN' in item[6]:
print('Hour Previous {}'.format(rel_time.hour))
mean_bss_stats[1][rel_time.hour] += int(item[3])
hours_no_values[1][rel_time.hour] +=1
else:
pass
for i in range(0,24):
if(hours_no_values[1][i] != 0):
mean_bss_stats[1][i] = mean_bss_stats[1][i]/hours_no_values[1][i]
else:
mean_bss_stats[1][i] = 0
pprint.pprint('mean bss stats {} \n hour_no_values {} \n'.format(mean_bss_stats,hours_no_values))
The number of value per each hour are as follows for hours starting from 0 to 23.
[31, 117, 85, 237, 3, 67, 11, 4, 57, 0, 5, 21, 2, 5, 10, 8, 29, 7, 14, 3, 1, 1, 0, 0]

You could do it with pandas using groupby and aggregate to appropriate column:
import pandas as pd
import numpy as np
df = pd.read_csv("your_file")
df.groupby('hour')['BSs'].aggregate(np.mean)
If you don't have that column in initial dataframe you could add it:
df['hour'] = your_hour_data
numpy.mean - calculates the mean of the array.
Compute the arithmetic mean along the specified axis.
pandas.groupby
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns
From pandas docs:
By “group by” we are referring to a process involving one or more of the following steps
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
Aggregation: computing a summary statistic (or statistics) about each group.
Some examples:
Compute group sums or means
Compute group sizes / counts

Related

Csv reader PyQt5 [duplicate]

I've an iterable list of over 100 elements. I want to do something after every 10th iterable element. I don't want to use a counter variable. I'm looking for some solution which does not includes a counter variable.
Currently I do like this:
count = 0
for i in range(0,len(mylist)):
if count == 10:
count = 0
#do something
print i
count += 1
Is there some way in which I can omit counter variable?
for count, element in enumerate(mylist, 1): # Start counting from 1
if count % 10 == 0:
# do something
Use enumerate. Its built for this
Just to show another option...hopefully I understood your question correctly...slicing will give you exactly the elements of the list that you want without having to to loop through every element or keep any enumerations or counters. See Explain Python's slice notation.
If you want to start on the 1st element and get every 10th element from that point:
# 1st element, 11th element, 21st element, etc. (index 0, index 10, index 20, etc.)
for e in myList[::10]:
<do something>
If you want to start on the 10th element and get every 10th element from that point:
# 10th element, 20th element, 30th element, etc. (index 9, index 19, index 29, etc.)
for e in myList[9::10]:
<do something>
Example of the 2nd option (Python 2):
myList = range(1, 101) # list(range(1, 101)) for Python 3 if you need a list
for e in myList[9::10]:
print e # print(e) for Python 3
Prints:
10
20
30
...etc...
100
for i in range(0,len(mylist)):
if (i+1)%10==0:
do something
print i
A different way to approach the problem is to split the iterable into your chunks before you start processing them.
The grouper recipe does exactly this:
from itertools import izip_longest # needed for grouper
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
You would use it like this:
>>> i = [1,2,3,4,5,6,7,8]
>>> by_twos = list(grouper(i, 2))
>>> by_twos
[(1, 2), (3, 4), (5, 6), (7, 8)]
Now, simply loop over the by_twos list.
You can use range loops to iterate through the length of mylist in multiples of 10 the following way:
for i in range(0,len(mylist), 10):
#do something

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

I'm trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It's returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.
Below I'm attaching the code please look at it
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)
data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')
data.head()
Output is
Review
0 If you've ever been to Disneyland anywhere you...
1 Its been a while since d last time we visit HK...
2 Thanks God it wasn t too hot or too humid wh...
3 HK Disneyland is a great compact park. Unfortu...
4 the location is not in the city, took around 1...
Followed by
classifier("My name is mark")
Output is
[{'label': 'POSITIVE', 'score': 0.9953688383102417}]
Followed by code
basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment
Output is
['POSITIVE']
Appending the total rows to empty list
text = []
for index, row in data.iterrows():
text.append(row['Review'])
I'm trying to get the sentiment for all the rows
sent = []
for i in range(len(data)):
sentiment = classifier(data.iloc[i,0])
sent.append(sentiment)
The error is :
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
2
3 for i in range(len(data)):
----> 4 sentiment = classifier(data.iloc[i,0])
5 sent.append(sentiment)
11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1914 # remove once script supports set_grad_enabled
1915 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1917
1918
IndexError: index out of range in self
some of the sentences in your Review column of the data frame are too long. when these sentences are converted to tokens and sent inside the model they are exceeding the 512 seq_length limit of the model, the embedding of the model used in the sentiment-analysis task was trained on 512 tokens embedding.
to fix this issue you can filter out the long sentences and keep only smaller ones (with token length < 512 )
or you can truncate the sentences with truncating = True
sentiment = classifier(data.iloc[i,0], truncation=True)
If you're tokenizing separately from your classification step, this warning can be output during tokenization itself (as opposed to classification).
In my case, I am using a BERT model, so I have MAX_TOKENS=510 (leaving room for the sequence-start and sequence-end tokens).
token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
text, max_length=MAX_TOKENS, truncation=True
)
Now, when you run your classifier, the tokens are guaranteed not to exceed the maximum length.

Python3: Adding equal elements together from json format

Data = [{'Ferrari': 51078}, {'Volvo': 83245, 'Ferrari': 70432, 'Skoda':
29264, 'Lambo': 862},
{'Ferrari': 306415, 'Jeep': 4025, 'Saab': 2708, 'Lexus': 161}, {'Fiat':
27583, 'Maserati': 11030, 'Renault': 3194, 'Volvo': 259, 'Skoda': 164},
{'Ferrari': 2313172, 'Renault': 2475},
{'Volvo': 198671}, {'Volvo': 15762}]
I want to add together the numbers for each car, so I get the total amount for each element (the numbers below aren't accurate with the Data and just an example):
Ferrari: 152455
Volvo: 13515
Skoda: 1532
Lambo: 4366
Renault: 4262
Maserati: 2345
Lexus: 235
Jeep: 124
Saab: 15
I've tried with sum(), append it to new lists, collections and many other potential solutions, but I just cannot get this one right. I'm searching for a general solution not only applicable to my problem, so if I change my dataset and hence the numbers and cars, it needs to work also for the new Data.
I'm using Python3.
You can use defaultdict. The code below iterates over the list of dicts. Then taking out a random key-value pair until each dict is empty and summing the results.
from collections import defaultdict
data = [{'Ferrari': 51078},
{'Volvo': 83245, 'Ferrari': 70432, 'Skoda': 29264, 'Lambo': 862},
{'Ferrari': 306415, 'Jeep': 4025, 'Saab': 2708, 'Lexus': 161},
{'Fiat': 27583, 'Maserati': 11030, 'Renault': 3194, 'Volvo': 259, 'Skoda': 164},
{'Ferrari': 2313172, 'Renault': 2475},
{'Volvo': 198671},
{'Volvo': 15762}]
output = defaultdict(int)
for d in data:
while d:
k, v = d.popitem()
output[k] += v
print(output)
Outputs
defaultdict(<class 'int'>, {'Ferrari': 2741097,
'Lambo': 862,
'Skoda': 29428,
'Volvo': 297937,
'Lexus': 161,
'Saab': 2708,
'Jeep': 4025,
'Renault': 5669,
'Maserati': 11030,
'Fiat': 27583})

How to find intersection or subset of two CSV files

I have 2 CSV files containing two columns and a large number of rows. The first column is the id, and the second is the set of paired values.
e.g.:
CSV1:
1 {[1,2],[1,4],[5,6],[3,1]}
2 {[2,4] ,[6,3], [8,3]}
3 {[3,2], [5,2], [3,5]}
CSV2:
1 {[2,4] ,[6,3], [8,3]}
2 {[3,4] ,[3,3], [2,3]}
3 {[1,4],[5,6],[3,1],[5,5]}
Now I need to get a CSV file which contains either exact matching items or subset which belongs to both CSVs.
Here the result should be:
{[2,4] ,[6,3], [8,3]}
{[1,4],[5,6],[3,1]}
Can anyone suggest python code to do this?
As suggested by this answer you can use set.intersection to get the intersection of two sets, however this does not work with lists as items. Instead you can also use filter (comparable to this answer):
>>> l1 = [[1,2],[1,4],[5,6],[3,1]]
>>> l2 = [[1,4],[5,6],[3,1],[5,5]]
>>> filter(lambda q: q in l2, l1)
[[1, 4], [5, 6], [3, 1]]
In Python 3 you should convert it to list since there filter returns an iterable:
>>> list(filter(lambda x: x in l2,l1))
You can load CSV files (if they are really comma [or some other character] separated files) with csv.reader or pandas.read_csv for example.

Egg dropping in worst case

I have been trying to write an algorithm to compute the maximum number or trials required in worst case, in the egg dropping problem. Here is my python code
def eggDrop(n,k):
eggFloor=[ [0 for i in range(k+1) ] ]* (n+1)
for i in range(1, n+1):
eggFloor[i][1] = 1
eggFloor[i][0] = 0
for j in range(1, k+1):
eggFloor[1][j] = j
for i in range (2, n+1):
for j in range (2, k+1):
eggFloor[i][j] = 'infinity'
for x in range (1, j + 1):
res = 1 + max(eggFloor[i-1][x-1], eggFloor[i][j-x])
if res < eggFloor[i][j]:
eggFloor[i][j] = res
return eggFloor[n][k]print eggDrop(2, 100)
```
The code is outputting a value of 7 for 2eggs and 100floors, but the answer should be 14, i don't know what mistake i have made in the code. What is the problem?
The problem is in this line:
eggFloor=[ [0 for i in range(k+1) ] ]* (n+1)
You want this to create a list containing (n+1) lists of (k+1) zeroes. What the * (n+1) does is slightly different - it creates a list containing (n+1) copies of the same list.
This is an important distinction - because when you start modifying entries in the list - say,
eggFloor[i][1] = 1
this actually changes element [1] of all of the lists, not just the ith one.
To instead create separate lists that can be modified independently, you want something like:
eggFloor=[ [0 for i in range(k+1) ] for j in range(n+1) ]
With this modification, the program returns 14 as expected.
(To debug this, it might have been a good idea to write out a function to pring out the eggFloor array, and display it at various points in your program, so you can compare it with what you were expecting. It would soon become pretty clear what was going on!)