Function does not return the list correctly - function

I have written a code for adding the numbers from two different text files. For a very big data 2-3 GB, I get the MemoryError. So, I am writing a new code using some functions to avoid loading the whole data into memory.
This code opens an input file 'd.txt' an reads the numbers after some lines from a bigger data as following:
SCALAR
ND 3
ST 0
TS 1000
1.0
1.0
1.0
SCALAR
ND 3
ST 0
TS 2000
3.3
3.4
3.5
SCALAR
ND 3
ST 0
TS 3000
1.7
1.8
1.9
and adds to the number have read from a smaller text file 'e.txt' as following:
SCALAR
ND 3
ST 0
TS 0
10.0
10.0
10.0
The result is written in a text file 'output.txt' like this:
SCALAR
ND 3
ST 0
TS 1000
11.0
11.0
11.0
SCALAR
ND 3
ST 0
TS 2000
13.3
13.4
13.5
SCALAR
ND 3
ST 0
TS 3000
11.7
11.8
11.9
The code which I prepared:
def add_list_same(list1, list2):
"""
list2 has the same size as list1
"""
c = [a+b for a, b in zip(list1, list2)]
print(c)
return c
def list_numbers_after_ts(n, f):
result = []
for line in f:
if line.startswith('TS'):
for node in range(n):
result.append(float(next(f)))
return result
def writing_TS(f1):
TS = []
ND = []
for line1 in f1:
if line1.startswith('ND'):
ND = float(line1.split()[-1])
if line1.startswith('TS'):
x = float(line1.split()[-1])
TS.append(x)
return TS, ND
with open('d.txt') as depth_dat_file, \
open('e.txt') as elev_file, \
open('output.txt', 'w') as out:
m = writing_TS(depth_dat_file)
print('number of TS', m[1])
for j in range(0,int(m[1])-1):
i = m[1]*j
out.write('SCALAR\nND {0:2f}\nST 0\nTS {0:2f}\n'.format(m[1], m[0][j]))
list1 = list_numbers_after_ts(int(m[1]), depth_dat_file)
list2 = list_numbers_after_ts(int(m[1]), elev_file)
Eh = add_list_same(list1, list2)
out.writelines(["%.2f\n" % item for item in Eh])
the output.txt is like this:
SCALAR
ND 3.000000
ST 0
TS 3.000000
SCALAR
ND 3.000000
ST 0
TS 3.000000
SCALAR
ND 3.000000
ST 0
TS 3.000000
The addition of lists does not work, besides I checked separately the functions, they work. I don't find the error. I changed it a lot, but it does not work. Any suggustion? I really appreciate any help you can provide!

You can use grouper to read files by fixed count of lines. Next code should works if order of lines in groups is unchanged.
from itertools import zip_longest
#Split by group iterator
#See http://stackoverflow.com/questions/434287/what-is-the-most-pythonic-way-to-iterate-over-a-list-in-chunks
def grouper(iterable, n, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
add_numbers = []
with open("e.txt") as f:
# Read data by 7 lines
for lines in grouper(f, 7):
# Suppress first SCALAR line
for line in lines[1:]:
# add last number in every line to array (6 elements)
add_numbers.append(float(line.split()[-1].strip()))
#template for every group
template = 'SCALAR\nND {:.2f}\nST {:.2f}\nTS {:.2f}\n{:.2f}\n{:.2f}\n{:.2f}\n'
with open("d.txt") as f, open('output.txt', 'w') as out:
# As before
for lines in grouper(f, 7):
data_numbers = []
for line in lines[1:]:
data_numbers.append(float(line.split()[-1].strip()))
# in result_numbers sum elements of two arrays by pair (6 elements)
result_numbers = [x + y for x, y in zip(data_numbers, add_numbers)]
# * unpack result_numbers as 6 arguments of function format
out.write(template.format(*result_numbers))

I had to change some small things in the code and now it works but just for small input files, because many variables are loaded into memory. Can you please tell me how can I work with yield.
from itertools import zip_longest
def grouper(iterable, n, padvalue=None):
return zip_longest(*[iter(iterable)]*n, fillvalue=padvalue)
def writing_ND(f1):
for line1 in f1:
if line1.startswith('ND'):
ND = float(line1.split()[-1])
return ND
def writing_TS(f):
for line2 in f:
if line2.startswith('TS'):
x = float(line2.split()[-1])
TS.append(x)
return TS
TS = []
ND = []
x = 0.0
n = 0
add_numbers = []
with open("e.txt") as f, open("d.txt") as f1,\
open('output.txt', 'w') as out:
ND = writing_ND(f)
TS = writing_TS(f1)
n = int(ND)+4
f.seek(0)
for lines in grouper(f, int(n)):
for item in lines[4:]:
add_numbers.append(float(item))
i = 0
for l in grouper(f1, n):
data_numbers = []
for line in l[4:]:
data_numbers.append(float(line.split()[-1].strip()))
result_numbers = [x + y for x, y in zip(data_numbers, add_numbers)]
del data_numbers
out.write('SCALAR\nND %d\nST 0\nTS %0.2f\n' % (ND, TS[i]))
i += 1
for item in result_numbers:
out.write('%s\n' % item)

Related

apply function to any element of the columns in loop for each column

I have a df named bb:
A B C D
0.5 5 2.3 1.7
1.7 2.1 4.5 2.5
3.2 4 8.5 7.9
for each column calculate median.
and then apply this function to each element of the df:
median([abs(x-y)])
where x is each element of the column and y is the median value obtained in the upper step (in the order).
and this is what I did:
mad = []
for col in bb.iteritems():
for x in col[1:]:
for y in medianx:
zz = (median([abs(x-y)]))
mad.append(zz)
medianx= []
for column in qq:
print(column)
medianx.append((median(qq[column].values)))
cc = qq.columns.values.tolist()
# create a list of column names
columns_list = cc
# create a list of median values
for column in qq:
print(median(qq[column].values))
median_list = medianx
# calculate MAD for each value of the column
mad_list = []
from scipy.stats import median_abs_deviation
for i,col in enumerate(columns_list):
mad_col = qq[col].apply(lambda x: median_abs_deviation(x))
mad_list.append(mad_col)
mad_list1 = mad_list[mad_list > 1.5]

Sentence similarity

Can anyone explain how this line works
X_set = {w for w in X_list if not w in sw}
So I need to know why we use the variable w 3 times, and what each w refers to.
I've also posted my code below for further reference
# Program to measure the similarity between
# two sentences using cosine similarity.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# X = input("Enter first string: ").lower()
# Y = input("Enter second string: ").lower()
X =" Ravi went to the market and buy 4 oranges and 2 apples in total how many fruits did Ravi buy"
Y =" Ram went to the shopping mall and buy 1pant and 5 shirts. how many clothes does Ram buy"
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[]
l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
print(X_set)
print(Y_set)
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
# print(w)
if w in X_set:
l1.append(1) # create a vector
else:
l1.append(0)
if w in Y_set:
l2.append(1)
else:
l2.append(0)
c = 0
#
# # cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)

How many binary numbers with N bits if no more than M zeros/ones in a row

Is there an equation I can use for arbitrary M and N?
Example, N=3 and M=2:
3 bits allow for 8 different combinations, but only 2 of them do not contain more than 2 same symbols in a row
000 - Fails
001 - Fails
010 - OK
011 - Fails
100 - Fails
101 - OK
110 - Fails
111 - Fails
One way to frame the problem is as follows: we would like to count binary words of length n without runs of length m or larger. Let g(n, m) denote the number of such words. In the example, n = 3 and m = 2.
If n < m, every binary word works, and we get g(n, m) = 2^n words in total.
When n >= m, we can choose to start with 1, 2, ... m-1 repeated values,
followed by g(n-1, m), g(n-2, m), ... g(n-m+1, m) choices respectively. Combined, we get the following recursion (in Python):
from functools import lru_cache
#lru_cache(None) # memoization
def g(n, m):
if n < m:
return 2 ** n
else:
return sum(g(n-j, m) for j in range(1, m))
To test for correctness, we can compute the number of such binary sequences directly:
from itertools import product, groupby
def brute_force(n, k):
# generate all binary sequences of length n
products = product([0,1], repeat=n)
count = 0
for prod in products:
has_run = False
# group consecutive digits
for _, gp in groupby(prod):
gp_size = sum(1 for _ in gp)
if gp_size >= k:
# there are k or more consecutive digits in a row
has_run = True
break
if not has_run:
count += 1
return count
assert 2 == g(3, 2) == brute_force(3, 2)
assert 927936 == g(20, 7) == brute_force(20, 7)

problem with bootMer CI: upper and lower limits are identical

I'm having the hardest time generating confidence intervals for my glmer poisson model. After following several very helpful tutorials (such as https://drewtyre.rbind.io/classes/nres803/week_12/lab_12/) as well as stackoverflow posts, I keep getting very strange results, i.e. the upper and lower limits of the CI are identical.
Here is a reproducible example containing a response variable called "production," a fixed effect called "Treatment_Num" and a random effect called "Genotype":
df1 <- data.frame(production=c(15,12,10,9,6,8,9,5,3,3,2,1,0,0,0,0), Treatment_Num=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), Genotype=c(1,1,2,2,1,1,2,2,1,1,2,2,1,1,2,2))
#run the glmer model
df1_glmer <- glmer(production ~ Treatment_Num +(1|Genotype),
data = df1, family = poisson(link = "log"))
#make an empty data set to predict from, that contains the explanatory variables but no response
require(magrittr)
df_empty <- df1 %>%
tidyr::expand(Treatment_Num, Genotype)
#create new column containing predictions
df_empty$PopPred <- predict(df1_glmer, newdata = df_empty, type="response",re.form = ~0)
#function for bootMer
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
#run bootMer
require(lme4)
merBoot_df1_glmer <- bootMer(df1_glmer, myFunc_df1_glmer, nsim = 10)
#get confidence intervals out of it
predCL <- t(apply(merBoot_df1_glmer$t, MARGIN = 2, FUN = quantile, probs = c(0.025, 0.975)))
#enter lower and upper limits of confidence interval into df_empty
df_empty$lci <- predCL[, 1]
df_empty$uci <- predCL[, 2]
#when viewing df_empty the problem becomes clear: the lci and uci are identical!
df_empty
Any insights you can give me will be much appreciated!
Ignore my comment!
The issue is with the function you created to pass to bootMer(). You wrote:
myFunc_df1_glmer <- function(mm) {
predict(df1_glmer, newdata = df_empty, type="response",re.form=~0)
}
The argument mm should be a fitted model object derived from the bootstrapped data.
However, you don't pass this object to predict(), but rather the original model
object. If you change the function to:
myFunc_df1_glmer <- function(mm) {
predict(mm, newdata = df_empty, type="response",re.form=~0)
#^^ pass in the object created by bootMer
}
then it works:
> df_empty
# A tibble: 8 x 5
Treatment_Num Genotype PopPred lci uci
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 12.9 9.63 15.7
2 1 2 12.9 9.63 15.7
3 2 1 5.09 3.87 5.89
4 2 2 5.09 3.87 5.89
5 3 1 2.01 1.20 2.46
6 3 2 2.01 1.20 2.46
7 4 1 0.796 0.361 1.14
8 4 2 0.796 0.361 1.14
As an aside -- how many genotypes in your actual data? If less than 5-7 you might
do better using a straight up glm() with genotype as a factor using sum-to-zero
contrasts.

Function on each row of pandas DataFrame but not generating a new column

I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.