Grouping CSV file by ID and extracting JSON column - json

I currently have a CSV like this:
A B C
1 10 {"a":"one","b":"two","c":"three"}
1 10 {"a":"four","b":"five","c":"six"}
1 10 {"a":"seven","b":"eight","c":"nine"}
1 10 {"a":"ten","b":"eleven","c":"twelve"}
2 10 {"a":"thirteen","b":"fourteen","c":"fifteen"}
2 10 {"a":"sixteen","b":"seventeen","c":"eighteen"}
2 10 {"a":"nineteen","b":"twenty","c":"twenty-one"}
3 10 {"a":"twenty-two","b":"twenty-three","c":"twenty-four"}
3 10 {"a":"twenty-five","b":"twenty-six","c":"twenty-seven"}
3 10 {"a":"twenty-eight","b":"twenty-nine","c":"thirty"}
3 10 {"a":"thirty-one","b":"thirty-two","c":"thirty-three"}
I want to group by column A, ignore column B, and take only the "b" field in C, and get an output like:
A C
1 ['two','five','eight','eleven']
2 ['fourteen','seventeen','twenty']
3 ['twenty-three','twenty-six','twenty-nine','thirty-two']
Can I do this? I have pandas if that will be useful! Also I would like the output file to be tab delimited.

Try this:
import pandas as pd
import json
# read file that looks exactly as given above
df = pd.read_csv("file.csv", delim_whitespace=True)
# drop the 'B' column
del df['B']
# 'C' will start life as a string. convert from json, extract values, return as list
df['C'] = df['C'].map(lambda x: json.loads(x)['b'])
# 'C' now holds just the 'b' values. group these together:
df = df.groupby('A').C.apply(lambda x : list(x))
print(df)
This returns:
A
1 [two, five, eight, eleven]
2 [fourteen, seventeen, twenty]
3 [twenty-three, twenty-six, twenty-nine, thirty...

IIUC
df.groupby('A').C.apply(lambda x : [y['b'] for y in x ])
A
1 [two, five, eight, eleven]
2 [fourteen, seventeen, twenty]
3 [twenty-three, twenty-six, twenty-nine, thirty...
Name: C, dtype: object

Related

Alternative to extract function when working with raster objects

I wonder how to sum pixel values of a raster (val_r) for each categories of another raster (cat_r). In other words, does an alternative to the function "extract" exist when working with raster objects? Thank you very much!
# sample raster with categories
cat_r<-raster(ncol=3,nrow=3, xmn=-10, xmx=10, ymn=-10, ymx=10)
cat_r[]<-c(1,2,1,3,4,3,4,4,4 ) #4 categories: 1, 2, 3 and 4
#sample raster with pixel values
val_r <-raster(ncol=3,nrow=3, xmn=-10, xmx=10, ymn=-10, ymx=10)
val_r[]<-c(1,0,1,5,2,5,2,2,2)
#extract function doesn't work for
extract(val_r, cat_r, fun=sum)
#I should find the following values: category 1: 2, cat 2: 0, cat 3: 10, cat 4: 8
You can use the zonal method:
library(raster)
cat_r <- raster(ncol=3,nrow=3, xmn=-10, xmx=10, ymn=-10, ymx=10, vals=c(1,2,1,3,4,3,4,4,4 ))
val_r <- setValues(cat_r, c(1,0,1,5,2,5,2,2,2))
zonal(val_r, cat_r, "sum")
# zone sum
#[1,] 1 2
#[2,] 2 0
#[3,] 3 10
#[4,] 4 8
This is equivalent to
s <- stack(cat_r, val_r)
v <- values(s)
tapply(v[,2], v[,1], sum)
# 1 2 3 4
# 2 0 10 8

Transform a CSV of Ids into a CSV of Names

I need to transform a csv of Ids into a csv of Names.
I have:
FOLDER ID NAME | FILE ID NAME PATH
1 A 1 fX 1
2 AB 2 fZ 1,2
3 B 3 fY 3,4
4 BC 4 fW 3,4,5
5 BCD
Get info about FILEs and its sizes from the FILEDATA table
select FILE.NAME, FILE.PATH, FILEDATA.SIZE
from FILEDATA inner join FILE on FILEDATA.fileid = FILE.id
WHERE FILEDATA.PropName = "Size"
Actually I get
fX 1 23805
fZ 1,2 27205
fY 3,4 23608
fW 3,4,5 21501
I need replace the IDs by the FOLDER names
fX A 23805
fZ A/AB 27205
fY B/BC 23608
fW B/BC/BDC 21501

write items from a list to csv file column by column using pandas dataframe.to_csv

I have a list named items
items=['a' , 'b','c']
Code is:
df = pandas.DataFrame(items)
df.to_csv("myfile.csv",headers=None,index=False)
the values written to the file are in different rows but same column.(vertically written)
But
I want the values to be written as : a b c ie. in same row but different column.
Help please
You get each element in different rows because you load the df as that way.
If you want in different column I would suggest to do transpose,
df = df.T
or you can load as one row like below,
items=[['a' , 'b','c']]
df = pd.DataFrame(items)
df
Out[22]:
0 1 2
0 a b c
And then write the output to csv,
eg:
df = pandas.DataFrame(items)
df = df.T
df.to_csv("myfile.csv",headers=None,index=False)
df = pd.DataFrame(items)
df
Out[5]:
0
0 a
1 b
2 c
df.T
Out[11]:
0 1 2
0 a b c

Iterating through CSV reader to slice data frame

I have a data frame that contains 508383 rows. I am only showing the first 10 row.
0 1 2
0 chr3R 4174822 4174922
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
I want to iterate through each row and check the value of column #2 of the first row to the value of the next row. I want to check if the difference between these values is less than 5000. If the difference is greater than 5000 then I want to slice the data frame from the first row to the previous row and have this be a subset data frame.
I then want to repeat this process and create a second subset data frame. I've only manage to get this done by using CSV reader in combination with Pandas.
Here is my code:
#!/usr/bin/env python
import pandas as pd
data = pd.read_csv('sort_cov_emb_sg.bed', sep='\t', header=None, index_col=None)
import csv
file = open('sort_cov_emb_sg.bed')
readCSV = csv.reader(file, delimiter="\t")
first_row = readCSV.next()
print first_row
count_1 = 0
while count_1 < 100000:
next_row = readCSV.next()
value_1 = int(next_row[1]) - int(first_row[1])
count_1 = count_1 + 1
if value_1 < 5000:
continue
else:
break
print next_row
print count_1
print value_1
window_1 = data[0:63]
print window_1
first_row = readCSV.next()
print first_row
count_2 = 0
while count_2 < 100000:
next_row = readCSV.next()
value_2 = int(next_row[1]) - int(first_row[1])
count_2 = count_2 + 1
if value_2 < 5000:
continue
else:
break
print next_row
print count_2
print value_2
window_2 = data[0:74]
print window_2
I wanted to know if there is a better way to do this process )without repeating the code every time) and get all the subset data frames I need.
Thanks.
Rodrigo
This is yet another example of the compare-cumsum-groupby pattern. Using only rows you showed (and so changing the diff to 100 instead of 5000):
jumps = df[2] > df[2].shift() + 100
grouped = df.groupby(jumps.cumsum())
for k, group in grouped:
print(k)
print(group)
produces
0
0 1 2
0 chr3R 4174822 4174922
1
0 1 2
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
2
0 1 2
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
This works because the comparison gives us a new True every time a new group starts, and when we take the cumulative sum of that, we get what is effectively a group id, which we can group on:
>>> jumps
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: 2, dtype: bool
>>> jumps.cumsum()
0 0
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: 2, dtype: int32

how to select/add a column to pandas dataframe based on a non trivial function of other columns

This is a followup question for this one: how to select/add a column to pandas dataframe based on a function of other columns?
have a data frame and I want to select the rows that match some criteria. The criteria is a function of values of other columns and some additional values.
Here is a toy example:
>> df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9],
'B': [randint(1,9) for x in xrange(9)],
'C': [4,10,3,5,4,5,3,7,1]})
>>
A B C
0 1 6 4
1 2 8 10
2 3 8 3
3 4 4 5
4 5 2 4
5 6 1 5
6 7 1 3
7 8 2 7
8 9 8 1
I want select all rows for which some non trivial function returns true, e.g. f(a,c,L), where L is a list of lists and f returns True iff a and c are not part of the same sublist.
That is, if L = [[1,2,3],[4,2,10],[8,7,5,6,9]] I want to get:
A B C
0 1 6 4
3 4 4 5
4 5 2 4
6 7 1 3
8 9 8 1
Thanks!
Here is a VERY VERY hacky and non-elegant solution. As another disclaimer, since your question doesn't state what you want to do if a number in the column is in none of the sub lists this code doesn't handle that in any real way besides any default functionality within isin().
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9],
'B': [6,8,8,4,2,1,1,2,8],
'C': [4,10,3,5,4,5,3,7,1]})
L = [[1,2,3],[4,2,10],[8,7,5,6,9]]
df['passed1'] = df['A'].isin(L[0])
df['passed2'] = df['C'].isin(L[0])
df['1&2'] = (df['passed1'] ^ df['passed2'])
df['passed4'] = df['A'].isin(L[1])
df['passed5'] = df['C'].isin(L[1])
df['4&5'] = (df['passed4'] ^ df['passed5'])
df['passed7'] = df['A'].isin(L[2])
df['passed8'] = df['C'].isin(L[2])
df['7&8'] = (df['passed7'] ^ df['passed8'])
df['PASSED'] = df['1&2'] & df['4&5'] ^ df['7&8']
del df['passed1'], df['passed2'], df['1&2'], df['passed4'], df['passed5'], df['4&5'], df['passed7'], df['passed8'], df['7&8']
df = df[df['PASSED'] == True]
del df['PASSED']
With an output that looks like:
A B C
0 1 6 4
3 4 4 5
4 5 2 4
6 7 1 3
8 9 8 1
I implemented this rather quickly hence the utter and complete ugliness of this code, but I believe you can refactor it any way you would like (e.g. iterate over the original set of lists with for sub_list in L, improve variable names, come up with a better solution, etc).
Hope this helps. Oh, and did I mention this was hacky and not very good code? Because it is.