COUNTIFS: Excel to pandas and remove counted elements - csv

I have a COUNTIFS equation in excel (COUNTIFS($A$2:$A$6, "<=" & $C4))-SUM(D$2:D3) where A2toA6 is my_list. C4 is current 'bin' with the condition and D* are previous summed results from my_list that meet the condition. I am attempting to implement this in Python
I have looked at previous COUNTIF questions but I am struggling to complete the final '-SUM(D$2:D3)' part of the code.
See the COUNTIFS($A$2:$A$6, "<=" & $C4) section below.
'''
my_list=(-1,-0.5, 0, 1, 2)
bins = (-1, 0, 1)
out = []
for iteration, num in enumerate(bins):
n = []
out.append(n)
count = sum(1 for elem in my_list if elem<=(num))
n.append(count)
print(out)
'''
out = [1, [3], [4]]
I need to sum previous elements, that have already been counted, and remove these elements from the next count so that they are not counted twice ( Excel representation -SUM(D$2:D3) ). This is where I need some help! I used enumerate to track iterations. I have tried the code below in the same loop but I can't resolve this and I get errors:
'''
count1 = sum(out[0:i[0]]) for i in (out)
and
count1 = out(n) - out(n-1)
''''
See expected output values in 'out' array for bin conditions below:

I was able to achieve the required output array values by creating an additional if/elif statement to factor out previous array elements and generate a new output array 'out1'. This works but may not be the most efficient way to achieve the end goal:
'''
import numpy as np
my_list=(-1,-0.5, 0, 1, 2)
#bins = np.arange(-1.0, 1.05, 0.05)
bins = (-1, 0, 1)
out = []
out1 = []
for iteration, num in enumerate(bins):
count = sum(1 for elem in my_list if elem<=(num))
out.append(count)
if iteration == 0:
count1 = out[iteration]
out1.append(count1)
elif iteration > 0:
count1 = out[iteration] - out[iteration - 1]
out1.append(count1)
print(out1)
'''
I also tried using the below code as suggested in other answers but this didn't work for me:
'''
-np.diff([out])
print(out)
'''

Related

How to calculate a probability vector and an observation count vector for a range of bins?

I want to test the hypothesis whether some 30 occurrences should fit a Poisson distribution.
#GNU Octave
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; #30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; #each bin can be single value or multiple values
I am trying to use Pearson's chi-square statistics here and coded the below function. I want a Poisson vector to contain corresponding Poisson probabilities for each bin and count the observations for each bin. I feel the loop is rather redundant and ugly. Can you please let me know how can I re-factor the function without the loop and make the whole calculation cleaner and more vectorized?
function result= poissonGoodnessOfFit(bins, observed)
assert(iscell(bins), "bins should be a cell array");
assert(all(cellfun("ismatrix", bins)) == 1, "bin entries either scalars or matrices");
assert(ismatrix(observed) && rows(observed) == 1, "observed data should be a 1xn matrix");
lambda_head = mean(observed); #poisson lambda parameter estimate
k = length(bins); #number of bin groups
n = length(observed); #number of observations
poisson_probability = []; #variable for poisson probability for each bin
observations = []; #variable for observation counts for each bin
for i=1:k
if isscalar(bins{1,i}) #this bin contains a single value
poisson_probability(1,i) = poisspdf(bins{1, i}, lambda_head);
observations(1, i) = histc(observed, bins{1, i});
else #this bin contains a range of values
inner_bins = bins{1, i}; #retrieve the range
inner_bins_k = length(inner_bins); #number of values inside
inner_poisson_probability = []; #variable to store individual probability of each value inside this bin
inner_observations = []; #variable to store observation counts of each value inside this bin
for j=1:inner_bins_k
inner_poisson_probability(1,j) = poisspdf(inner_bins(1, j), lambda_head);
inner_observations(1, j) = histc(observed, inner_bins(1, j));
endfor
poisson_probability(1, i) = sum(inner_poisson_probability, 2); #assign over the sum of all inner probabilities
observations(1, i) = sum(inner_observations, 2); #assign over the sum of all inner observation counts
endif
endfor
expected = n .* poisson_probability; #expected observations if indeed poisson using lambda_head
chisq = sum((observations - expected).^2 ./ expected, 2); #Pearson Chi-Square statistics
pvalue = 1 - chi2cdf(chisq, k-1-1);
result = struct("actual", observations, "expected", expected, "chi2", chisq, "pvalue", pvalue);
return;
endfunction
There's a couple of things worth noting in the code.
First, the 'scalar' case in your if block is actually identical to your 'range' case, since a scalar is simply a range of 1 element. So no special treatment is needed for it.
Second, you don't need to create such explicit subranges, your bin groups seem to be amenable to being used as indices into a larger result (as long as you add 1 to convert from 0-indexed to 1-indexed indices).
Therefore my approach would be to calculate the expected and observed numbers over the entire domain of interest (as inferred from your bin groups), and then use the bin groups themselves as 1-indices to obtain the desired subgroups, summing accordingly.
Here's an example code, written in the octave/matlab compatible subset of both languges:
function Result = poissonGoodnessOfFit( BinGroups, Observations )
% POISSONGOODNESSOFFIT( BinGroups, Observations) calculates the [... etc, etc.]
pkg load statistics; % only needed in octave; for matlab buy statistics toolbox.
assert( iscell( BinGroups ), 'Bins should be a cell array' );
assert( all( cellfun( #ismatrix, BinGroups ) ) == 1, 'Bin entries either scalars or matrices' );
assert( ismatrix( Observations ) && rows( Observations ) == 1, 'Observed data should be a 1xn matrix' );
% Define helpful variables
RangeMin = min( cellfun( #min, BinGroups ) );
RangeMax = max( cellfun( #max, BinGroups ) );
Domain = RangeMin : RangeMax;
LambdaEstimate = mean( Observations );
NBinGroups = length( BinGroups );
NObservations = length( Observations );
% Get expected and observed numbers per 'bin' (i.e. discrete value) over the *entire* domain.
Expected_Domain = NObservations * poisspdf( Domain, LambdaEstimate );
Observed_Domain = histc( Observations, Domain );
% Apply BinGroup values as indices
Expected_byBinGroup = cellfun( #(c) sum( Expected_Domain(c+1) ), BinGroups );
Observed_byBinGroup = cellfun( #(c) sum( Observed_Domain(c+1) ), BinGroups );
% Perform a Chi-Square test on the Bin-wise Expected and Observed outputs
O = Observed_byBinGroup; E = Expected_byBinGroup ; df = NBinGroups - 1 - 1;
ChiSquareTestStatistic = sum( (O - E) .^ 2 ./ E );
PValue = 1 - chi2cdf( ChiSquareTestStatistic, df );
Result = struct( 'actual', O, 'expected', E, 'chi2', ChiSquareTestStatistic, 'pvalue', PValue );
end
Running with your example gives:
X = [8 0 0 1 3 4 0 2 12 5 1 8 0 2 0 1 9 3 4 5 3 3 4 7 4 0 1 2 1 2]; % 30 observations
bins = {0, 1, [2:3], [4:5], [6:20]}; % each bin can be single value or multiple values
Result = poissonGoodnessOfFit( bins, X )
% Result =
% scalar structure containing the fields:
% actual = 6 5 8 6 5
% expected = 1.2643 4.0037 13.0304 8.6522 3.0493
% chi2 = 21.989
% pvalue = 0.000065574
A general comment about the code; it is always preferable to write self-explainable code, rather than code that does not make sense by itself in the absence of a comment. Comments generally should only be used to explain the 'why', rather than the 'how'.

NxN detect win for tic-tac-toe

I have tried to generalise my tic-tac-toe game for an NxN grid. I have everything working but am finding it hard to get the code needed to detect a win.
This is my function at the moment where I loop over the rows and columns of the board. I can't figure out why it's not working currently. Thanks
def check_win(array_board):
global winner
for row in range(N):
for i in range(N-1):
if array_board[row][i] != array_board[row][i+1] or array_board[row][i] == 0:
break
if i == N-1:
winner = array_board[row][0]
pygame.draw.line(board, (0, 0, 0), (75, (row * round(height / N) + 150)), (825, (row * round(height / N) + 150)), 3)
for col in range(N):
for j in range(N-1):
if array_board[j][col] == 0 or array_board[col][j] != array_board[col][i+1]:
break
if j == N - 1:
winner = array_board[0][col]
pygame.draw.line(board, (0, 0, 0), (col * round(width / N) + 150, 75), (col * round(width / N) + 150, 825), 3)
You don't specify in your question, so my noughts-and-crosses grid is a 2D array of characters, with some default "empty" string (a single space).
def getEmptyBoard( size, default=' ' ):
""" Create a 2D array <size> by <size> of empty string """
grid = []
for j in range( size ):
row = []
for i in range( size ): # makes a full empty row
row.append( default )
grid.append( row )
return ( size, grid )
So given a 2D grid of strings, how does one check for a noughts-and-crosses Win? This would be when the count of the same character in a particular row or column is equal to the size of the grid.
Thus if you have a 5x5 grid, any row with 5 of the same item (say 'x') is a winner. Similarly for a column... 5 lots of 'o' is a win.
So given a 2D array, how do you check for these conditions. One way to do this is to tally the number of occurrences of separate symbols in each cell. If that tally reaches the 5 (grid size), then whatever that symbol is, it's a winner.
def checkForWin( board, default=' ' ):
winner = None
size = board[0]
grid = board[1]
### Tally the row and column
for j in range( size ):
col_results = {}
### Count the symbols in this column
for i in range( size ):
value = grid[i][j]
if ( value in col_results.keys() ):
col_results[ value ] += 1
else:
col_results[ value ] = 1
### Check the tally for a winning count
for k in col_results.keys():
if ( k != default and col_results[k] >= size ):
winner = k # Found a win
print("Winner: column %d" % ( j ) )
break
if ( winner != None ):
break
# TODO: also implement for rows
# TODO: also implement for diagonals
return winner # returns None, or 'o', 'x' (or whatever used for symbols)
The above function uses two loops and a python dictionary to keep a list of what's been found. It's possible to check both the row and columns in the same loops, so it's not really row-by-row or column-by-column, just two loops of size.
Anyway, so during the loop when we first encounter an x, it will be added to the dictionary, with a value of 1. The next time we find an x, the dictionary is used to tally that occurrence, dict['x'] → 2, and so forth for the entire column.
At the end of the loop, we iterate through the dictionary keys (which might be , o, and x) checking the counts. When the count is the same size as a row or column, it's a winning line.
Obviously if there's no win found, we zero the tally and move to the next column/row with the outer-loop.

Octave error: element undefined in return list

I am writing the following octave code:
function p = predict(Theta1, Theta2, X)
m = size(X, 1);
num_labels = size(Theta2, 1);
global a=zeros(size(Theta2, 2), m);
global delta=zeros(m, 1);
p = zeros(size(X, 1), 1);
X=[ones(size(X,1),1) X];
a=sigmoid(Theta1*X');
a=[ones(1,size(X,1));a];
[delta p]=max(sigmoid(Theta2*a))';
It gives me the error: "element number 2 undefined in return list".
The error occurs when I use delta in the last line to store max values.
I have searched a lot but couldn't find any relevant answer.
The line
[delta p] = max( sigmoid( Theta2*a ) )'; # transpose operator over the result
is equivalent to
[delta p] = transpose( max( sigmoid( Theta2*a ) ); # transpose function over the result
which means you are trying to get a "two-output" result out of this transpose operation, which fails, since the transpose function only returns one output, therefore octave is informing you that it cannot find a second output in the 'results' list.
Presumably you either meant to do something along the lines of:
[delta p] = max( sigmoid( Theta2*a )' );
and misplaced the transpose operator, or you actually did want to obtain the maxima and their indices as a column vector, in which case you need to do this in two steps, i.e.
[delta p] = max( sigmoid( Theta2*a ) );
ColVector = [delta p]';
PS. Incidentally, you should use .' instead of ' as the transpose operator. ' is not the transpose operator, it's the "conjugate transpose" one.

Why isn't my return statement working?

I have a function that converts a decimal value to binary. I understand I have the logic correct as I can get it to work outside of a function.
def decimaltobinary(value):
invertedbinary = []
value = int(value)
while value >= 1:
value = (value / 2)
invertedbinary.append(value)
value = int(value)
for n, i in enumerate(invertedbinary):
if (round(i) == i):
invertedbinary[n] = 0
else:
invertedbinary[n] = 1
invertedbinary.reverse()
value = ''.join(str(e) for e in invertedbinary)
return value
decimaltobinary(firstvalue)
print (firstvalue)
decimaltobinary(secondvalue)
print (secondvalue)
Let's say firstvalue = 5 and secondvalue = 10. The values returned each time the function is executed should be 101 and 1010 respectively. However, the values I get printed are the starting values of five and ten. Why is this happening?
The code works as expected, but you didn't assign the returned value:
>>> firstvalue = decimaltobinary(5)
>>> firstvalue
'101'
Note that there are easier ways to accomplish your goal:
>>> str(bin(5))[2:]
'101'
>>> "{0:b}".format(10)
'1010'

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]