Formatting data in a CSV file (calculating average) in python - csv

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.

Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?

Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

Related

COUNTIFS: Excel to pandas and remove counted elements

I have a COUNTIFS equation in excel (COUNTIFS($A$2:$A$6, "<=" & $C4))-SUM(D$2:D3) where A2toA6 is my_list. C4 is current 'bin' with the condition and D* are previous summed results from my_list that meet the condition. I am attempting to implement this in Python
I have looked at previous COUNTIF questions but I am struggling to complete the final '-SUM(D$2:D3)' part of the code.
See the COUNTIFS($A$2:$A$6, "<=" & $C4) section below.
'''
my_list=(-1,-0.5, 0, 1, 2)
bins = (-1, 0, 1)
out = []
for iteration, num in enumerate(bins):
n = []
out.append(n)
count = sum(1 for elem in my_list if elem<=(num))
n.append(count)
print(out)
'''
out = [1, [3], [4]]
I need to sum previous elements, that have already been counted, and remove these elements from the next count so that they are not counted twice ( Excel representation -SUM(D$2:D3) ). This is where I need some help! I used enumerate to track iterations. I have tried the code below in the same loop but I can't resolve this and I get errors:
'''
count1 = sum(out[0:i[0]]) for i in (out)
and
count1 = out(n) - out(n-1)
''''
See expected output values in 'out' array for bin conditions below:
I was able to achieve the required output array values by creating an additional if/elif statement to factor out previous array elements and generate a new output array 'out1'. This works but may not be the most efficient way to achieve the end goal:
'''
import numpy as np
my_list=(-1,-0.5, 0, 1, 2)
#bins = np.arange(-1.0, 1.05, 0.05)
bins = (-1, 0, 1)
out = []
out1 = []
for iteration, num in enumerate(bins):
count = sum(1 for elem in my_list if elem<=(num))
out.append(count)
if iteration == 0:
count1 = out[iteration]
out1.append(count1)
elif iteration > 0:
count1 = out[iteration] - out[iteration - 1]
out1.append(count1)
print(out1)
'''
I also tried using the below code as suggested in other answers but this didn't work for me:
'''
-np.diff([out])
print(out)
'''

NxN detect win for tic-tac-toe

I have tried to generalise my tic-tac-toe game for an NxN grid. I have everything working but am finding it hard to get the code needed to detect a win.
This is my function at the moment where I loop over the rows and columns of the board. I can't figure out why it's not working currently. Thanks
def check_win(array_board):
global winner
for row in range(N):
for i in range(N-1):
if array_board[row][i] != array_board[row][i+1] or array_board[row][i] == 0:
break
if i == N-1:
winner = array_board[row][0]
pygame.draw.line(board, (0, 0, 0), (75, (row * round(height / N) + 150)), (825, (row * round(height / N) + 150)), 3)
for col in range(N):
for j in range(N-1):
if array_board[j][col] == 0 or array_board[col][j] != array_board[col][i+1]:
break
if j == N - 1:
winner = array_board[0][col]
pygame.draw.line(board, (0, 0, 0), (col * round(width / N) + 150, 75), (col * round(width / N) + 150, 825), 3)
You don't specify in your question, so my noughts-and-crosses grid is a 2D array of characters, with some default "empty" string (a single space).
def getEmptyBoard( size, default=' ' ):
""" Create a 2D array <size> by <size> of empty string """
grid = []
for j in range( size ):
row = []
for i in range( size ): # makes a full empty row
row.append( default )
grid.append( row )
return ( size, grid )
So given a 2D grid of strings, how does one check for a noughts-and-crosses Win? This would be when the count of the same character in a particular row or column is equal to the size of the grid.
Thus if you have a 5x5 grid, any row with 5 of the same item (say 'x') is a winner. Similarly for a column... 5 lots of 'o' is a win.
So given a 2D array, how do you check for these conditions. One way to do this is to tally the number of occurrences of separate symbols in each cell. If that tally reaches the 5 (grid size), then whatever that symbol is, it's a winner.
def checkForWin( board, default=' ' ):
winner = None
size = board[0]
grid = board[1]
### Tally the row and column
for j in range( size ):
col_results = {}
### Count the symbols in this column
for i in range( size ):
value = grid[i][j]
if ( value in col_results.keys() ):
col_results[ value ] += 1
else:
col_results[ value ] = 1
### Check the tally for a winning count
for k in col_results.keys():
if ( k != default and col_results[k] >= size ):
winner = k # Found a win
print("Winner: column %d" % ( j ) )
break
if ( winner != None ):
break
# TODO: also implement for rows
# TODO: also implement for diagonals
return winner # returns None, or 'o', 'x' (or whatever used for symbols)
The above function uses two loops and a python dictionary to keep a list of what's been found. It's possible to check both the row and columns in the same loops, so it's not really row-by-row or column-by-column, just two loops of size.
Anyway, so during the loop when we first encounter an x, it will be added to the dictionary, with a value of 1. The next time we find an x, the dictionary is used to tally that occurrence, dict['x'] → 2, and so forth for the entire column.
At the end of the loop, we iterate through the dictionary keys (which might be , o, and x) checking the counts. When the count is the same size as a row or column, it's a winning line.
Obviously if there's no win found, we zero the tally and move to the next column/row with the outer-loop.

Retrieve data in sets Pandas

I'm retrieving data from the Open Weather Map API. I have the following code where I'm extracting the current weather from more than 500 cities and I want the log that is giving me separate the data in sets of 50 each
I did a non efficient way that I would really like to improve!
Many many thanks!
x = 1
for index, row in df.iterrows():
base_url = "http://api.openweathermap.org/data/2.5/weather?"
units = "imperial"
query_url = f"{base_url}appid={api_key}&units={units}&q="
city = row['Name'] #this comes from a df
response = requests.get(query_url + city).json()
try:
df.loc[index,"Max Temp"] = response["main"]["temp_max"]
if index < 50:
print(f"Processing Record {index} of Set {x} | {city}")
elif index <100:
x = 2
print(f"Processing Record {index} of Set {x} | {city}")
elif index <150:
x = 3
print(f"Processing Record {index} of Set {x} | {city}")
except (KeyError, IndexError):
pass
print("City not found. Skipping...")

Python - filter column from a .dat file and return a given value from other columns

I'm new to Python and been practicing with sample data I created (150 rows) of student ID numbers, grades, age, class_code, area_code, etc. What I'm trying to do with the data is not only filtered by a certain column (by grade, age, etc), but then to also create a list of a different column from that row (student ID). I've manged to find how to isolate the column I need to find the certain value by, but then can't figure out how to create that list of the values I need to return.
So here's a sample of 5 rows of the data:
1/A/15/13/43214
2/I/15/21/58322
3/C/17/89/68470
4/I/18/6/57362
5/I/14/4/00000
6/A/16/23/34567
I need a list of the 1st column (student ID), based on filtering the second column (grade)...(and eventually the 3rd column, 4th column, etc. But if I see how it looks with just the 2nd, I think I can figure out the others) Also note: I didn't use headers in the .dat file.
I figured out how to isolate/view the 2nd column.
import numpy
data = numpy.genfromtxt('/testdata.dat', delimiter='/', dtype='unicode')
grades = data[:,1]
print (grades)
to print:
['A' 'I' 'C' 'I' 'I' 'A']
But now, how can I pull just the first column's that correspond to the A's, C's, I's into separate lists?
So I'd want to see a list, also with the commas between the integers of column 1 for the A's, C's, and I's
list from A = [1, 6]
list from C = [3]
list from I = [2, 4, 5]
Again, if I can just see how it's done with just the 2nd column, with just one of the values (say A's), I think I could figure out how to do it for B's, C's, D's, etc and probably the other columns. I just need to see one example to how the syntax would be applied and then like to play around with the rest.
Also, I've been using numpy, but also read about panda, csv and I think those libraries could be possibilities too. But like I said, been using numpy for the .dat files. I don't know if the other libraries would be easier to use?
You actually don't need any additional modules for such a simple task. Pure-Python solution would be reading file line-by-line and 'parsing' them using str.split() will give you your lists, and then you can pretty much filter on any parameter. Something like:
students = {} # store for our students by grade
with open("testdata.dat", "r") as f: # open the file
for line in f: # read the file line by line
row = line.strip().split("/") # split the line into individual columns
# you can now directly filter your row, or you can store the row in a list for later
# let's split them by grade:
grade = row[1] # second column of our row is the grade
# create/append the sublist in our `students` dict keyed by the grade
students[grade] = students.get(grade, []) + [row]
# now your students dict contains all students split by grade, e.g.:
a_students = students["A"]
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# if you want only to collect the A-grade student IDs, you can get a list of them as:
student_ids = [entry[0] for entry in students["A"]]
# ['1', '6']
But let's go back a few steps - if you want a more generalized solution you should just store your list and then create a function to filter it by passed parameters, so:
# define a filter function
# filters should contain a list of filters whereas a filter would be defined as:
# [position, [values]]
# and you can define as many as you want
def filter_sublists(source, filters=None):
result = [] # store for our result
filters = filters or [] # in case no filter is returned
for element in source: # go through every element of our source data
try:
if all(element[f[0]] in f[1] for f in filters): # check if all our filters match
result.append(element) # add the element
except IndexError: # invalid filter position or data position, ignore
pass
return result # return the result
# now we can use it to filter our data, first lets load our data:
with open("testdata.dat", "r") as f: # open the file
students = [line.strip().split("/") for line in f] # store all our students as a list
# now we have all the data in the `students` list and we can filter it by any element
a_students = filter_sublists(students, [[1, ["A"]]])
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# or again, if you just need the IDs:
a_student_ids = [entry[0] for entry in filter_sublists(students, [[1, ["A"]]])]
# ['1', '6']
# but you can filter by any parameter, for example:
age_15_students = filter_sublists(students, [[2, ["15"]]])
# [['1', 'A', '15', '13', '43214'], ['2', 'I', '15', '21', '58322']]
# or you can get all I-grade students aged 14 or 15:
i_students = filter_sublists(students, [[1, ["I"]], [2, ["14", "15"]]])
# [['2', 'I', '15', '21', '58322'], ['5', 'I', '14', '4', '00000']]
Pandas solution:
import pandas as pd
df = pd.read_csv('data.txt', header=None, sep='/')
dfs = {k:v for k,v in df.groupby(1)}
As a result we have a dictionary of DataFrames:
In [59]: dfs.keys()
Out[59]: dict_keys(['I', 'C', 'A'])
In [60]: dfs['I']
Out[60]:
0 1 2 3 4
1 2 I 15 21 58322
3 4 I 18 6 57362
4 5 I 14 4 0
In [61]: dfs['C']
Out[61]:
0 1 2 3 4
2 3 C 17 89 68470
In [62]: dfs['A']
Out[62]:
0 1 2 3 4
0 1 A 15 13 43214
5 6 A 16 23 34567
If you want to have groupped lists of first columns:
In [67]: dfs['I'].iloc[:, 0].tolist()
Out[67]: [2, 4, 5]
In [68]: dfs['C'].iloc[:, 0].tolist()
Out[68]: [3]
In [69]: dfs['A'].iloc[:, 0].tolist()
Out[69]: [1, 6]
You can go through the list and make a boolean to select arrays matching a particular grade. This may require some refinement.
import numpy as np
grades = np.genfromtxt('data.txt', delimiter='/', skip_header=0, dtype='unicode')
res = {}
for grade in set(grades[:, 1].tolist()):
res[grade] = grades[grades[:, 1]==grade][:,0].tolist()
print res

R- collapse rows based on contents of two columns

I apologize in advance if this question is too specific or involved for this type of forum. I have been a long time lurker on this site, and this is the first time I haven't been able to solve my issue by looking at previous questions, so I finally decided to post. Please let me know if there is a better place to post this, or if you have advice on making it more clear. here goes.
I have a data.table with the following structure:
library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))
These are genomic coordinates with associated values, the file is sorted by by chromosome ("chr") (1 through 22, then X, then Y), start and end position so that the first row contains the lowest numbered start position on chromosome 1, and proceeds sequentially for all data points on chromosome 1, then 2, etc. At this point, every single row has a start-end length of 1. After collapsing the start-end lengths will vary depending on how many rows were collapsed and their distance from the adjacent row.
1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:
The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
The two adjacent rows have "start" coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)
2nd: I would like to calculate the coverage_ weighted averages of the collapsed mean_KO/WT columns with the weighting by the coverage_KO/WT columns:
Ex: collapse 2 rows,
row 1 mean_1 = 5.0, coverage_1 = 20.
row 2 mean_1 =40.0, coverage_1 = 45.
weighted avg mean_1 = (((5.0*20)/(20+45)) + ((40.0*45)/(20+45))) = 29.23
What I would like the output to look like (except collapsed row means would be calculated and not in string form):
library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327,
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))",
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))",
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0",
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))",
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr",
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))
Help with either part 1 or 2 or any advice is appreciated.
I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.