Scala Spark - For loop in Data Frame and compare date - mysql

I have a Data Frame which has 3 columns like this:
---------------------------------------------
| x(string) | date(date) | value(int) |
---------------------------------------------
I want to SELECT all the the rows [i] that satisfy all 4 conditions:
1) row [i] and row [i - 1] have the same value in column 'x'
AND
2) 'date' at row [i] == 'date' at row [i - 1] + 1 (two consecutive days)
AND
3) 'value' at row [i] > 5
AND
4) 'value' at row [i - 1] <= 5
I think maybe I need a For loop, but don't know how exactly! Please help me!
Every help is much appreciated!

It can be very easily done with Window functions, look at lag function:
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
// test data
val list = Seq(
("x", "2016-12-13", 1),
("x", "2016-12-14", 7)
);
val df = sc.parallelize(list).toDF("x", "date", "value");
// add lags - so read previous value from dataset
val withPrevs = df
.withColumn ("prevX", lag('x, 1).over(Window.orderBy($"date")))
.withColumn ("prevDate", lag('date, 1).over(Window.orderBy($"date")))
.withColumn ("prevValue", lag('value, 1).over(Window.orderBy($"date")))
// filter values and select only needed fields
withPrevs
.where('x === 'prevX)
.where('value > lit(5))
.where('prevValue < lit(5))
.where('date === date_add('prevDate, 1))
.select('x, 'date, 'value)
.show()
Note that without order, i.e. by date, this cannot be done. Dataset has none meaningful order, you must specify order explicity

If you have a DataFrame created, then all you need to do is to call a filter function on DataFrame will all your conditions.
For example:
df1.filter($"Column1" === 2 || $"Column2" === 3)
You can pass as many conditions as you want. It will return you a new DataFrame with filtered data.

Related

COUNTIFS: Excel to pandas and remove counted elements

I have a COUNTIFS equation in excel (COUNTIFS($A$2:$A$6, "<=" & $C4))-SUM(D$2:D3) where A2toA6 is my_list. C4 is current 'bin' with the condition and D* are previous summed results from my_list that meet the condition. I am attempting to implement this in Python
I have looked at previous COUNTIF questions but I am struggling to complete the final '-SUM(D$2:D3)' part of the code.
See the COUNTIFS($A$2:$A$6, "<=" & $C4) section below.
'''
my_list=(-1,-0.5, 0, 1, 2)
bins = (-1, 0, 1)
out = []
for iteration, num in enumerate(bins):
n = []
out.append(n)
count = sum(1 for elem in my_list if elem<=(num))
n.append(count)
print(out)
'''
out = [1, [3], [4]]
I need to sum previous elements, that have already been counted, and remove these elements from the next count so that they are not counted twice ( Excel representation -SUM(D$2:D3) ). This is where I need some help! I used enumerate to track iterations. I have tried the code below in the same loop but I can't resolve this and I get errors:
'''
count1 = sum(out[0:i[0]]) for i in (out)
and
count1 = out(n) - out(n-1)
''''
See expected output values in 'out' array for bin conditions below:
I was able to achieve the required output array values by creating an additional if/elif statement to factor out previous array elements and generate a new output array 'out1'. This works but may not be the most efficient way to achieve the end goal:
'''
import numpy as np
my_list=(-1,-0.5, 0, 1, 2)
#bins = np.arange(-1.0, 1.05, 0.05)
bins = (-1, 0, 1)
out = []
out1 = []
for iteration, num in enumerate(bins):
count = sum(1 for elem in my_list if elem<=(num))
out.append(count)
if iteration == 0:
count1 = out[iteration]
out1.append(count1)
elif iteration > 0:
count1 = out[iteration] - out[iteration - 1]
out1.append(count1)
print(out1)
'''
I also tried using the below code as suggested in other answers but this didn't work for me:
'''
-np.diff([out])
print(out)
'''

Function on each row of pandas DataFrame but not generating a new column

I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

Python. If value on column 1 (row X) = value from column 2 (row Y), print row Y of column 3

I have a .csv file, df, with 3 columns (C1, C2 and C3). All columns are of the same length (aprox. 600000 rows) and have unique values. Values in C1, which represent SNPs (single nucleotide polymorphisms) are ordered according to their location on chromosomes. C2 has the same values as C1 but they are disordered. Values in C2 are coupled to corresponding values (chromosome locations) in the same row on C3. What I want to do is to couple the chromosomal locations on C3 to the values in C1 keeping the column order of C1. In other words, generate another column with chromosome locations for the ordered SNPs on C1. So far, I tried to create a dictionary with keys from C2 and values from C3 and then using a for loop to match values on C1 and print the ordered chromosome positions, but I get C3. I understand why I get that but I don't manage to get what I want.
Any suggestion/help would be welcome. I am new into programming.
import csv
from collections import OrderedDict # to save keys order
import sys
sys.stdout = open("output1.csv", "w")
# C1= rows[0], C2= rows[1], C3= rows[2]
with open('df1.csv', 'rU') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) #skip header
d = OrderedDict((rows[1], rows[2]) for rows in reader)
for rows in reader:
if rows[0] in d:
print rows[2]
Input example:
C1 C2 C3
12082473 2980300 785989
11240776 4245756 799463
2980300 12082473 740857
2905036 2341354 918573
4245756 3748597 888659
3748597 11240776 765269
2341354 2905036 792480
2465126 2465126 947034
Desired output:
C1 C4
12082473 740857
11240776 765269
2980300 785989
2905036 792480
4245756 799463
3748597 888659
2341354 918573
2465126 947034
I am not entirely sure I understand what you are trying to do.
I think your error is from using the generator expression d = OrderedDict((rows[0], rows[3]) for rows in reader1) and then referring to it after the file has been closed at the end of the with block.
You might try something along these lines:
import csv
from collections import OrderedDict
d=OrderedDict()
with open('df1.csv', 'rU') as csv1, open('df2.csv', 'rU') as csv2:
reader1 = csv.reader(csv1, delimiter=',')
reader2 = csv.reader(csv2, delimiter=',')
next(reader1) #skip header
next(reader2) #skip header
for row in reader1:
d[row[0]]=row[3]
# d = OrderedDict(("a", "b") for rows in reader1)
for row in reader2:
if row[0] in d:
print d[row[0]]
I do not see any reason you need an OrderedDict since this is just a mapping between row[0] and row[3] as written. You are not using the order currently.

How do I sum up properties of a JSON object in coffescript?

I have an object that looks like this one:
object =
title : 'an object'
properties :
attribute1 :
random_number: 2
attribute_values:
a: 10
b: 'irrelevant'
attribute2 :
random_number: 4
attribute_values:
a: 15
b: 'irrelevant'
some_random_stuff: 'random stuff'
I want to extract the sum of the 'a' values on attribute1 and attribute2.
What would be the best way to do this in Coffeescript?
(I have already found one way to do it but that just looks like Java-translated-to-coffee and I was hoping for a more elegant solution.)
Here is what I came up with (edited to be more generic based on comment):
sum_attributes = (x) =>
sum = 0
for name, value of object.properties
sum += value.attribute_values[x]
sum
alert sum_attributes('a') # 25
alert sum_attributes('b') # 0irrelevantirrelevant
So, that does what you want... but it probably doesn't do exactly what you want with strings.
You might want to pass in the accumulator seed, like sum_attributes 0, 'a' and sum_attributes '', 'b'
Brian's answer is good. But if you wanted to bring in a functional programming library like Underscore.js, you could write a more succinct version:
sum = (arr) -> _.reduce arr, ((memo, num) -> memo + num), 0
sum _.pluck(object.properties, 'a')
total = (attr.attribute_values.a for key, attr of obj.properties).reduce (a,b) -> a+b
or
sum = (arr) -> arr.reduce((a, b) -> a+b)
total = sum (attr.attribute_values.a for k, attr of obj.properties)