Python. If value on column 1 (row X) = value from column 2 (row Y), print row Y of column 3 - csv

I have a .csv file, df, with 3 columns (C1, C2 and C3). All columns are of the same length (aprox. 600000 rows) and have unique values. Values in C1, which represent SNPs (single nucleotide polymorphisms) are ordered according to their location on chromosomes. C2 has the same values as C1 but they are disordered. Values in C2 are coupled to corresponding values (chromosome locations) in the same row on C3. What I want to do is to couple the chromosomal locations on C3 to the values in C1 keeping the column order of C1. In other words, generate another column with chromosome locations for the ordered SNPs on C1. So far, I tried to create a dictionary with keys from C2 and values from C3 and then using a for loop to match values on C1 and print the ordered chromosome positions, but I get C3. I understand why I get that but I don't manage to get what I want.
Any suggestion/help would be welcome. I am new into programming.
import csv
from collections import OrderedDict # to save keys order
import sys
sys.stdout = open("output1.csv", "w")
# C1= rows[0], C2= rows[1], C3= rows[2]
with open('df1.csv', 'rU') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) #skip header
d = OrderedDict((rows[1], rows[2]) for rows in reader)
for rows in reader:
if rows[0] in d:
print rows[2]
Input example:
C1 C2 C3
12082473 2980300 785989
11240776 4245756 799463
2980300 12082473 740857
2905036 2341354 918573
4245756 3748597 888659
3748597 11240776 765269
2341354 2905036 792480
2465126 2465126 947034
Desired output:
C1 C4
12082473 740857
11240776 765269
2980300 785989
2905036 792480
4245756 799463
3748597 888659
2341354 918573
2465126 947034

I am not entirely sure I understand what you are trying to do.
I think your error is from using the generator expression d = OrderedDict((rows[0], rows[3]) for rows in reader1) and then referring to it after the file has been closed at the end of the with block.
You might try something along these lines:
import csv
from collections import OrderedDict
d=OrderedDict()
with open('df1.csv', 'rU') as csv1, open('df2.csv', 'rU') as csv2:
reader1 = csv.reader(csv1, delimiter=',')
reader2 = csv.reader(csv2, delimiter=',')
next(reader1) #skip header
next(reader2) #skip header
for row in reader1:
d[row[0]]=row[3]
# d = OrderedDict(("a", "b") for rows in reader1)
for row in reader2:
if row[0] in d:
print d[row[0]]
I do not see any reason you need an OrderedDict since this is just a mapping between row[0] and row[3] as written. You are not using the order currently.

Related

Extract and explode inner nested element as rows from string nested structure

I would like to explode a column to rows in a dataframe on pyspark hive.
There are two columns in the dataframe.
The column "business_id" is a string.
The column "sports_info" is a struct type, each element value is an array of string.
Data:
business_id sports_info
"abc-123" {"sports_type":
["{sport_name:most_recent,
sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
]
}
I need to get a dataframe like:
business_id. sport_id
"abc-123" 568
"abc-123" 171
I defined:
schema = StructType([ \
StructField("sports_type",ArrayType(),True)
])
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)
def get_ids(val):
sports_type = 'sports_type'
sport_events = 'sport_events'
sport_id = 'sport_id'
sport_ids_vals = eval(val.sports_type[0])['sport_events']
ids = [s['sport_id'] for s in sport_ids_scores]
return ids
df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x),
ArrayType(ArrayType(StringType())))('sports_info'))
How could I create the df and extract/explode the inner nested elements?
df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()
Explained:
expr( #use a SQL expression, only way to access transform (pre spark 3)
"transform ( # run a SQL function on an array
sports_type, # declare column to use
x # declare the name of the variable to use for each element in the array
-> # Start writing SQL code to run on each element in the array
regexp_extract( # user SQL regex functions to pull out from the string
x, #string to run regex on
'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
)
THis will likely run faster than a UDF as it can be vectorized.

Convert a key value pair in a column as new column in python

I want to parse a column, and get the key-value pair as column
Input:
I have a dataframe (called df) with the following structure:
ID data
A1 {"userMatch": "{"match":{"phone":{"name":{"score":1}},"name":{"score":1}}}"}
A2 {"userMatch": "{"match":{"phone":{"name":{"score":0.934}},"name":{"score":0.952}}}"}
Expected Output:
I wanted to create new column called 'score' and get the value from the key value pair
ID score1 score2
A1 1 1
A2 0.934 0.952
Attempted Solution:
data_json = df['data'].transform(lambda x: json.loads(x))
df['score1'] = data_json.str.get('userMatch').str.get('match').str.get('phone').str.get('name').str.get('score')
df['score2'] = data_json.str.get('userMatch').str.get('match').str.get('phone').str.get('name').str.get('name').str.get('score')
Error:
TypeError: the JSON object must be str, bytes or bytearray, not Series
Notes:
I am not even sure how to get the next score2
Using mu previous though regarding using regex, this is how I would approach your problem:
import re
def getOffset(row, offset):
vals = re.findall(r"[-+]?\d*\.\d+|\d+", row.data['userMatch'])
if len(vals)> offset:
return vals[offset]
return None
df['score1'] = df.apply(lambda row: getOffset(row, 0), axis= 1)
df['score2'] = df.apply(lambda row: getOffset(row, 1), axis = 1)
df.drop(['data'], axis= 1, inplace=True)
This yields a dataframe of the form:
ID score1 score2
0 A1 1 1
1 A2 0.934 0.952
This isn't pretty, but works with split(). Couldn't get a dictionary to be read, kept getting invalid syntax or missing delimiter.
df = pd.read_csv(io.StringIO('''ID data
A1 {"userMatch": "{"match":{"phone":{"name":{"score":1}},"name":{"score":1}}}"}
A2 {"userMatch": "{"match":{"phone":{"name":{"score":0.934}},"name":{"score":0.952}}}"}'''), sep=' ', engine='python')
df['score1'] = df['data'].apply(lambda x: x.split('{"userMatch": "{"match":{"phone":{"name":{"score":')[1].split('}', 1)[0])
df['score2'] = df['data'].apply(lambda x: x.split('{"userMatch": "{"match":{"phone":{"name":{"score":')[1].split(',"name":{"score":')[1].split('}', 1)[0])
Output:
ID data score1 score2
0 A1 {"userMatch": "{"match":{"phone":{"name":{"score":1}},"name":{"score":1}}}"} 1 1
1 A2 {"userMatch": "{"match":{"phone":{"name":{"score":0.934}},"name":{"score":0.952}}}"} 0.934 0.952

Function on each row of pandas DataFrame but not generating a new column

I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

write items from a list to csv file column by column using pandas dataframe.to_csv

I have a list named items
items=['a' , 'b','c']
Code is:
df = pandas.DataFrame(items)
df.to_csv("myfile.csv",headers=None,index=False)
the values written to the file are in different rows but same column.(vertically written)
But
I want the values to be written as : a b c ie. in same row but different column.
Help please
You get each element in different rows because you load the df as that way.
If you want in different column I would suggest to do transpose,
df = df.T
or you can load as one row like below,
items=[['a' , 'b','c']]
df = pd.DataFrame(items)
df
Out[22]:
0 1 2
0 a b c
And then write the output to csv,
eg:
df = pandas.DataFrame(items)
df = df.T
df.to_csv("myfile.csv",headers=None,index=False)
df = pd.DataFrame(items)
df
Out[5]:
0
0 a
1 b
2 c
df.T
Out[11]:
0 1 2
0 a b c