Python - filter column from a .dat file and return a given value from other columns - csv

I'm new to Python and been practicing with sample data I created (150 rows) of student ID numbers, grades, age, class_code, area_code, etc. What I'm trying to do with the data is not only filtered by a certain column (by grade, age, etc), but then to also create a list of a different column from that row (student ID). I've manged to find how to isolate the column I need to find the certain value by, but then can't figure out how to create that list of the values I need to return.
So here's a sample of 5 rows of the data:
1/A/15/13/43214
2/I/15/21/58322
3/C/17/89/68470
4/I/18/6/57362
5/I/14/4/00000
6/A/16/23/34567
I need a list of the 1st column (student ID), based on filtering the second column (grade)...(and eventually the 3rd column, 4th column, etc. But if I see how it looks with just the 2nd, I think I can figure out the others) Also note: I didn't use headers in the .dat file.
I figured out how to isolate/view the 2nd column.
import numpy
data = numpy.genfromtxt('/testdata.dat', delimiter='/', dtype='unicode')
grades = data[:,1]
print (grades)
to print:
['A' 'I' 'C' 'I' 'I' 'A']
But now, how can I pull just the first column's that correspond to the A's, C's, I's into separate lists?
So I'd want to see a list, also with the commas between the integers of column 1 for the A's, C's, and I's
list from A = [1, 6]
list from C = [3]
list from I = [2, 4, 5]
Again, if I can just see how it's done with just the 2nd column, with just one of the values (say A's), I think I could figure out how to do it for B's, C's, D's, etc and probably the other columns. I just need to see one example to how the syntax would be applied and then like to play around with the rest.
Also, I've been using numpy, but also read about panda, csv and I think those libraries could be possibilities too. But like I said, been using numpy for the .dat files. I don't know if the other libraries would be easier to use?

You actually don't need any additional modules for such a simple task. Pure-Python solution would be reading file line-by-line and 'parsing' them using str.split() will give you your lists, and then you can pretty much filter on any parameter. Something like:
students = {} # store for our students by grade
with open("testdata.dat", "r") as f: # open the file
for line in f: # read the file line by line
row = line.strip().split("/") # split the line into individual columns
# you can now directly filter your row, or you can store the row in a list for later
# let's split them by grade:
grade = row[1] # second column of our row is the grade
# create/append the sublist in our `students` dict keyed by the grade
students[grade] = students.get(grade, []) + [row]
# now your students dict contains all students split by grade, e.g.:
a_students = students["A"]
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# if you want only to collect the A-grade student IDs, you can get a list of them as:
student_ids = [entry[0] for entry in students["A"]]
# ['1', '6']
But let's go back a few steps - if you want a more generalized solution you should just store your list and then create a function to filter it by passed parameters, so:
# define a filter function
# filters should contain a list of filters whereas a filter would be defined as:
# [position, [values]]
# and you can define as many as you want
def filter_sublists(source, filters=None):
result = [] # store for our result
filters = filters or [] # in case no filter is returned
for element in source: # go through every element of our source data
try:
if all(element[f[0]] in f[1] for f in filters): # check if all our filters match
result.append(element) # add the element
except IndexError: # invalid filter position or data position, ignore
pass
return result # return the result
# now we can use it to filter our data, first lets load our data:
with open("testdata.dat", "r") as f: # open the file
students = [line.strip().split("/") for line in f] # store all our students as a list
# now we have all the data in the `students` list and we can filter it by any element
a_students = filter_sublists(students, [[1, ["A"]]])
# [['1', 'A', '15', '13', '43214'], ['6', 'A', '16', '23', '34567']]
# or again, if you just need the IDs:
a_student_ids = [entry[0] for entry in filter_sublists(students, [[1, ["A"]]])]
# ['1', '6']
# but you can filter by any parameter, for example:
age_15_students = filter_sublists(students, [[2, ["15"]]])
# [['1', 'A', '15', '13', '43214'], ['2', 'I', '15', '21', '58322']]
# or you can get all I-grade students aged 14 or 15:
i_students = filter_sublists(students, [[1, ["I"]], [2, ["14", "15"]]])
# [['2', 'I', '15', '21', '58322'], ['5', 'I', '14', '4', '00000']]

Pandas solution:
import pandas as pd
df = pd.read_csv('data.txt', header=None, sep='/')
dfs = {k:v for k,v in df.groupby(1)}
As a result we have a dictionary of DataFrames:
In [59]: dfs.keys()
Out[59]: dict_keys(['I', 'C', 'A'])
In [60]: dfs['I']
Out[60]:
0 1 2 3 4
1 2 I 15 21 58322
3 4 I 18 6 57362
4 5 I 14 4 0
In [61]: dfs['C']
Out[61]:
0 1 2 3 4
2 3 C 17 89 68470
In [62]: dfs['A']
Out[62]:
0 1 2 3 4
0 1 A 15 13 43214
5 6 A 16 23 34567
If you want to have groupped lists of first columns:
In [67]: dfs['I'].iloc[:, 0].tolist()
Out[67]: [2, 4, 5]
In [68]: dfs['C'].iloc[:, 0].tolist()
Out[68]: [3]
In [69]: dfs['A'].iloc[:, 0].tolist()
Out[69]: [1, 6]

You can go through the list and make a boolean to select arrays matching a particular grade. This may require some refinement.
import numpy as np
grades = np.genfromtxt('data.txt', delimiter='/', skip_header=0, dtype='unicode')
res = {}
for grade in set(grades[:, 1].tolist()):
res[grade] = grades[grades[:, 1]==grade][:,0].tolist()
print res

Related

COUNTIFS: Excel to pandas and remove counted elements

I have a COUNTIFS equation in excel (COUNTIFS($A$2:$A$6, "<=" & $C4))-SUM(D$2:D3) where A2toA6 is my_list. C4 is current 'bin' with the condition and D* are previous summed results from my_list that meet the condition. I am attempting to implement this in Python
I have looked at previous COUNTIF questions but I am struggling to complete the final '-SUM(D$2:D3)' part of the code.
See the COUNTIFS($A$2:$A$6, "<=" & $C4) section below.
'''
my_list=(-1,-0.5, 0, 1, 2)
bins = (-1, 0, 1)
out = []
for iteration, num in enumerate(bins):
n = []
out.append(n)
count = sum(1 for elem in my_list if elem<=(num))
n.append(count)
print(out)
'''
out = [1, [3], [4]]
I need to sum previous elements, that have already been counted, and remove these elements from the next count so that they are not counted twice ( Excel representation -SUM(D$2:D3) ). This is where I need some help! I used enumerate to track iterations. I have tried the code below in the same loop but I can't resolve this and I get errors:
'''
count1 = sum(out[0:i[0]]) for i in (out)
and
count1 = out(n) - out(n-1)
''''
See expected output values in 'out' array for bin conditions below:
I was able to achieve the required output array values by creating an additional if/elif statement to factor out previous array elements and generate a new output array 'out1'. This works but may not be the most efficient way to achieve the end goal:
'''
import numpy as np
my_list=(-1,-0.5, 0, 1, 2)
#bins = np.arange(-1.0, 1.05, 0.05)
bins = (-1, 0, 1)
out = []
out1 = []
for iteration, num in enumerate(bins):
count = sum(1 for elem in my_list if elem<=(num))
out.append(count)
if iteration == 0:
count1 = out[iteration]
out1.append(count1)
elif iteration > 0:
count1 = out[iteration] - out[iteration - 1]
out1.append(count1)
print(out1)
'''
I also tried using the below code as suggested in other answers but this didn't work for me:
'''
-np.diff([out])
print(out)
'''

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

Python. If value on column 1 (row X) = value from column 2 (row Y), print row Y of column 3

I have a .csv file, df, with 3 columns (C1, C2 and C3). All columns are of the same length (aprox. 600000 rows) and have unique values. Values in C1, which represent SNPs (single nucleotide polymorphisms) are ordered according to their location on chromosomes. C2 has the same values as C1 but they are disordered. Values in C2 are coupled to corresponding values (chromosome locations) in the same row on C3. What I want to do is to couple the chromosomal locations on C3 to the values in C1 keeping the column order of C1. In other words, generate another column with chromosome locations for the ordered SNPs on C1. So far, I tried to create a dictionary with keys from C2 and values from C3 and then using a for loop to match values on C1 and print the ordered chromosome positions, but I get C3. I understand why I get that but I don't manage to get what I want.
Any suggestion/help would be welcome. I am new into programming.
import csv
from collections import OrderedDict # to save keys order
import sys
sys.stdout = open("output1.csv", "w")
# C1= rows[0], C2= rows[1], C3= rows[2]
with open('df1.csv', 'rU') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) #skip header
d = OrderedDict((rows[1], rows[2]) for rows in reader)
for rows in reader:
if rows[0] in d:
print rows[2]
Input example:
C1 C2 C3
12082473 2980300 785989
11240776 4245756 799463
2980300 12082473 740857
2905036 2341354 918573
4245756 3748597 888659
3748597 11240776 765269
2341354 2905036 792480
2465126 2465126 947034
Desired output:
C1 C4
12082473 740857
11240776 765269
2980300 785989
2905036 792480
4245756 799463
3748597 888659
2341354 918573
2465126 947034
I am not entirely sure I understand what you are trying to do.
I think your error is from using the generator expression d = OrderedDict((rows[0], rows[3]) for rows in reader1) and then referring to it after the file has been closed at the end of the with block.
You might try something along these lines:
import csv
from collections import OrderedDict
d=OrderedDict()
with open('df1.csv', 'rU') as csv1, open('df2.csv', 'rU') as csv2:
reader1 = csv.reader(csv1, delimiter=',')
reader2 = csv.reader(csv2, delimiter=',')
next(reader1) #skip header
next(reader2) #skip header
for row in reader1:
d[row[0]]=row[3]
# d = OrderedDict(("a", "b") for rows in reader1)
for row in reader2:
if row[0] in d:
print d[row[0]]
I do not see any reason you need an OrderedDict since this is just a mapping between row[0] and row[3] as written. You are not using the order currently.

Grouping multiple rows in R

I've generated a heatmap in R for microbiome data, using the following link
My data as far as rows is concerned looks like this:
781
782
783
547
519
575
044
045
049
If I want to group 781-783, 547-575 and 044-049 as individual groups and give them separate colours using the below idea:
Assigning animals to different groups (2 random groups in this case)
var1 <- round(runif(n=12, min=1, max=2))
var1 <- replace (var1, which(var1 == 1), "deepskyblue")
var1 <- replace (var1, which(var1 == 2), "magenta")
cbind(row.names(data.prop), var1)
How do I go about it? I understand that the above code, randomly generates 2 groups, but how can I specify which rows go into which group?
Thank you,
Susheel
Because rownames are of necessity character and the only good range-operator in R is ":" for numeric values: you need to coerce ranges to the desired "0nn" format. This is untested in the absence of a proper test case (which questioners are asked to provide):
#look at...
sprintf("%03i", c(781:783, 547:575, 044:049))
# then....
data.prop[ sprintf("%03i", c( 781:783, 547:575, 044:049), 'var1'] <-
mapply(function(clr, rng) {rep(clr, length(rng) )},
c("deepskyblue", "magenta", "green"),
list( 781:783, 547:575, 44:49)
)

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}