Iterating through CSV reader to slice data frame

Iterating through CSV reader to slice data frame - csv

I have a data frame that contains 508383 rows. I am only showing the first 10 row.
0 1 2
0 chr3R 4174822 4174922
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
I want to iterate through each row and check the value of column #2 of the first row to the value of the next row. I want to check if the difference between these values is less than 5000. If the difference is greater than 5000 then I want to slice the data frame from the first row to the previous row and have this be a subset data frame.
I then want to repeat this process and create a second subset data frame. I've only manage to get this done by using CSV reader in combination with Pandas.
Here is my code:
#!/usr/bin/env python
import pandas as pd
data = pd.read_csv('sort_cov_emb_sg.bed', sep='\t', header=None, index_col=None)
import csv
file = open('sort_cov_emb_sg.bed')
readCSV = csv.reader(file, delimiter="\t")
first_row = readCSV.next()
print first_row
count_1 = 0
while count_1 < 100000:
next_row = readCSV.next()
value_1 = int(next_row[1]) - int(first_row[1])
count_1 = count_1 + 1
if value_1 < 5000:
continue
else:
break
print next_row
print count_1
print value_1
window_1 = data[0:63]
print window_1
first_row = readCSV.next()
print first_row
count_2 = 0
while count_2 < 100000:
next_row = readCSV.next()
value_2 = int(next_row[1]) - int(first_row[1])
count_2 = count_2 + 1
if value_2 < 5000:
continue
else:
break
print next_row
print count_2
print value_2
window_2 = data[0:74]
print window_2
I wanted to know if there is a better way to do this process )without repeating the code every time) and get all the subset data frames I need.
Thanks.
Rodrigo

This is yet another example of the compare-cumsum-groupby pattern. Using only rows you showed (and so changing the diff to 100 instead of 5000):
jumps = df[2] > df[2].shift() + 100
grouped = df.groupby(jumps.cumsum())
for k, group in grouped:
print(k)
print(group)
produces
0
0 1 2
0 chr3R 4174822 4174922
1
0 1 2
1 chr3R 4175400 4175500
2 chr3R 4175466 4175566
3 chr3R 4175521 4175621
4 chr3R 4175603 4175703
5 chr3R 4175619 4175719
6 chr3R 4175692 4175792
2
0 1 2
7 chr3R 4175889 4175989
8 chr3R 4175966 4176066
9 chr3R 4176044 4176144
This works because the comparison gives us a new True every time a new group starts, and when we take the cumulative sum of that, we get what is effectively a group id, which we can group on:
>>> jumps
0 False
1 True
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: 2, dtype: bool
>>> jumps.cumsum()
0 0
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
Name: 2, dtype: int32

Related

Count cells with same string in dynamic range

I've read many articles on Google and StackOverflow, but haven't found any that mention how to count cells (under the same column) containing same string value. The count only considers a part of the sheet: many cells are added/removed in a short time, so the range keeps changing length. In the same sheet there are several ranges, separated by a blank row.
The counters should refer to a single range (counter_1 --> range_1; counter_2 --> range_2 , etc.).
e.g.: if cells can show 4 different options AND there are 5 dynamic ranges in the sheet --> there will be 4 counters for each range (4*5).
Following several websites (like this, this and this), I attempted to implement this check function directly from the sheet, without involving AppsScript.
E.g.: if I add this function in D2, E2, F2, G2 (watch the table below for reference):
COUNTIF(B2:B2,”1st option”) in D2 ; COUNTIF(B2:B2,”2nd option”) in E2 ; COUNTIF(B2:B2,”3rd option”) in F2 ; COUNTIF(B2:B2,”4th option”) in G2
Each counter will check its condition and update its cell value. This will be done only for cells grouped under "1st department".
The problem is that I have to add 16 counters manually (4 options for 4 departments) and, if an item is added/removed, all counters will throw an error. I can't divide departments in different sheets as a workaround.
My sheet is as follows:
Department
Option
1st option counter
2nd option counter
3rd option counter
4th option counter
1st
"2nd option"
0
1
0
0
2nd
"1st option"
1
1
0
0
2nd
"2nd option"
3rd
"4th option"
0
1
0
1
3rd
"2nd option"
4th
"3rd option"
0
2
1
0
4th
"2nd option"
4th
"2nd option"
After some items were added/removed:
Department
Option
1st option counter
2nd option counter
3rd option counter
4th option counter
1st
"2nd option"
0
2
0
0
1st
"2nd option"
2nd
"1st option"
2
1
0
0
2nd
"2nd option"
2nd
"1st option"
3rd
"4th option"
0
1
1
1
3rd
"2nd option"
3rd
"3rd option"
4th
"3rd option"
0
0
2
0
4th
"3rd option"
Any help would be appreciated.

Here's a non-array answer. The only reason this might be helpful compared to the above two answers is if you have a lot of calculations going and you begin to hit some performance issues. The obvious drawback to the below formula is that you would have to reapply it by dragging down after changes were made. You could build in an app script to reapply the formula as an r1C1 during an onEdit event.
Put this in all cells in columns D:G and assuming D1:G1 have the matching count syntax (i.e D1=1st Option)
=if(And($A2<>"",OR(Row($A2)=2,$A1="")),SUMPRODUCT((--($A:$A=$A2))*(--(D$1=$B:$B))),)
Again the first two answers offer a dynamic solution, which is probably better, but I figured I'd add this just for illustration or maybe to ignite some other ideas.

You can try with this formula in D2:
=MAKEARRAY(ROWS(A2:A);4;LAMBDA(r;c;IF(AND(INDEX(A2:A;r)<>"";INDEX(A1:A;r)<>INDEX(A2:A;r));COUNTIFS(A2:A;INDEX(A2:A;r);B2:B;INDEX(D1:1;c));"")))
You can see it working here

function countcellswithsamestring() {
const ss = SpreadsheetApp.getActive();
const sh = ss.getSheetByName("Sheet0");
const osh = ss.getSheetByName("Sheet1");
const sr = 2;//data start row
const rg = sh.getRange(sr, 1, sh.getLastRow() - sr + 1, sh.getLastRow());
const row = rg.getRow();
const col = rg.getColumn();
const vs = rg.getDisplayValues();
let co = {pA:[]};
vs.forEach((r,i) => {
r.forEach((c,j) => {
if(!co.hasOwnProperty(c)) {
co[c] = {count:1,loc:[sh.getRange(row + i,col + j).getA1Notation()]}
co.pA.push(c);
} else {
co[c].count++;
co[c].loc.push(sh.getRange(row + i,col + j).getA1Notation())
}
})
})
let o = co.pA.map(c => [c,co[c].count,co[c].loc.join(',')]);
osh.clearContents();
o.unshift(["String","Count","Locations"])
osh.getRange(1,1, o.length,o[0].length).setValues(o);
}
Data:
COL1
COL2
COL3
COL4
COL5
COL6
COL7
COL8
COL9
COL10
6
10
0
5
1
2
4
5
2
3
5
7
1
5
8
0
9
8
3
8
5
1
5
5
0
4
8
6
0
3
7
4
0
6
3
8
9
8
3
5
4
7
5
1
7
9
4
6
3
9
0
0
0
7
4
7
9
2
6
1
4
2
10
10
4
4
6
6
6
9
7
0
10
0
2
10
8
0
8
1
0
0
0
0
6
9
1
4
7
8
8
9
5
3
5
8
1
4
1
6
9
5
6
7
1
4
2
5
8
7
Output:
String
Count
Locations
6
11
A2,H4,D5,H6,I7,G8,H8,I8,E10,J11,C12
10
5
B2,C8,D8,C9,F9
0
15
C2,F3,E4,I4,C5,A7,B7,C7,B9,D9,H9,A10,B10,C10,D10
5
13
D2,H2,A3,D3,A4,C4,D4,J5,C6,C11,E11,B12,H12
1
10
E2,C3,B4,D6,J7,J9,G10,G11,I11,E12
2
6
F2,I2,H7,B8,E9,G12
4
12
G2,F4,B5,A6,G6,E7,A8,E8,F8,H10,H11,F12
3
7
J2,I3,J4,E5,I5,I6,D11
22
K2,L2,K3,L3,K4,L4,K5,L5,K6,L6,K7,L7,K8,L8,K9,L9,K10,L10,K11,L11,K12,L12
7
10
B3,A5,B6,E6,D7,F7,A9,I10,D12,J12
8
12
E3,H3,J3,G4,F5,H5,G9,I9,J10,A11,F11,I12
9
9
G3,G5,F6,J6,G7,J8,F10,B11,A12

Grouping CSV file by ID and extracting JSON column

I currently have a CSV like this:
A B C
1 10 {"a":"one","b":"two","c":"three"}
1 10 {"a":"four","b":"five","c":"six"}
1 10 {"a":"seven","b":"eight","c":"nine"}
1 10 {"a":"ten","b":"eleven","c":"twelve"}
2 10 {"a":"thirteen","b":"fourteen","c":"fifteen"}
2 10 {"a":"sixteen","b":"seventeen","c":"eighteen"}
2 10 {"a":"nineteen","b":"twenty","c":"twenty-one"}
3 10 {"a":"twenty-two","b":"twenty-three","c":"twenty-four"}
3 10 {"a":"twenty-five","b":"twenty-six","c":"twenty-seven"}
3 10 {"a":"twenty-eight","b":"twenty-nine","c":"thirty"}
3 10 {"a":"thirty-one","b":"thirty-two","c":"thirty-three"}
I want to group by column A, ignore column B, and take only the "b" field in C, and get an output like:
A C
1 ['two','five','eight','eleven']
2 ['fourteen','seventeen','twenty']
3 ['twenty-three','twenty-six','twenty-nine','thirty-two']
Can I do this? I have pandas if that will be useful! Also I would like the output file to be tab delimited.

Try this:
import pandas as pd
import json
# read file that looks exactly as given above
df = pd.read_csv("file.csv", delim_whitespace=True)
# drop the 'B' column
del df['B']
# 'C' will start life as a string. convert from json, extract values, return as list
df['C'] = df['C'].map(lambda x: json.loads(x)['b'])
# 'C' now holds just the 'b' values. group these together:
df = df.groupby('A').C.apply(lambda x : list(x))
print(df)
This returns:
A
1 [two, five, eight, eleven]
2 [fourteen, seventeen, twenty]
3 [twenty-three, twenty-six, twenty-nine, thirty...

IIUC
df.groupby('A').C.apply(lambda x : [y['b'] for y in x ])
A
1 [two, five, eight, eleven]
2 [fourteen, seventeen, twenty]
3 [twenty-three, twenty-six, twenty-nine, thirty...
Name: C, dtype: object

how to select/add a column to pandas dataframe based on a non trivial function of other columns

This is a followup question for this one: how to select/add a column to pandas dataframe based on a function of other columns?
have a data frame and I want to select the rows that match some criteria. The criteria is a function of values of other columns and some additional values.
Here is a toy example:
>> df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9],
'B': [randint(1,9) for x in xrange(9)],
'C': [4,10,3,5,4,5,3,7,1]})
>>
A B C
0 1 6 4
1 2 8 10
2 3 8 3
3 4 4 5
4 5 2 4
5 6 1 5
6 7 1 3
7 8 2 7
8 9 8 1
I want select all rows for which some non trivial function returns true, e.g. f(a,c,L), where L is a list of lists and f returns True iff a and c are not part of the same sublist.
That is, if L = [[1,2,3],[4,2,10],[8,7,5,6,9]] I want to get:
A B C
0 1 6 4
3 4 4 5
4 5 2 4
6 7 1 3
8 9 8 1
Thanks!

Here is a VERY VERY hacky and non-elegant solution. As another disclaimer, since your question doesn't state what you want to do if a number in the column is in none of the sub lists this code doesn't handle that in any real way besides any default functionality within isin().
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9],
'B': [6,8,8,4,2,1,1,2,8],
'C': [4,10,3,5,4,5,3,7,1]})
L = [[1,2,3],[4,2,10],[8,7,5,6,9]]
df['passed1'] = df['A'].isin(L[0])
df['passed2'] = df['C'].isin(L[0])
df['1&2'] = (df['passed1'] ^ df['passed2'])
df['passed4'] = df['A'].isin(L[1])
df['passed5'] = df['C'].isin(L[1])
df['4&5'] = (df['passed4'] ^ df['passed5'])
df['passed7'] = df['A'].isin(L[2])
df['passed8'] = df['C'].isin(L[2])
df['7&8'] = (df['passed7'] ^ df['passed8'])
df['PASSED'] = df['1&2'] & df['4&5'] ^ df['7&8']
del df['passed1'], df['passed2'], df['1&2'], df['passed4'], df['passed5'], df['4&5'], df['passed7'], df['passed8'], df['7&8']
df = df[df['PASSED'] == True]
del df['PASSED']
With an output that looks like:
A B C
0 1 6 4
3 4 4 5
4 5 2 4
6 7 1 3
8 9 8 1
I implemented this rather quickly hence the utter and complete ugliness of this code, but I believe you can refactor it any way you would like (e.g. iterate over the original set of lists with for sub_list in L, improve variable names, come up with a better solution, etc).
Hope this helps. Oh, and did I mention this was hacky and not very good code? Because it is.

octave: using find() on cell array {} subscript and assigning it to another cell array

This is an example in Section 6.3.1 Comma Separated Lists Generated from Cell Arrays of the Octave documentation (I browsed it through the doc command on the Octave prompt) which I don't quite understand.
in{1} = [10, 20, 30, 40, 50, 60, 70, 80, 90];
in{2} = inf;
in{3} = "last";
in{4} = "first";
out = cell(4, 1);
[out{1:3}] = find(in{1 : 3}); % line which I do not understand
So at the end of this section, we have in looking like:
in =
{
[1,1] =
10 20 30 40 50 60 70 80 90
[1,2] = Inf
[1,3] = last
[1,4] = first
}
and out looking like:
out =
{
[1,1] =
1 1 1 1 1 1 1 1 1
[2,1] =
1 2 3 4 5 6 7 8 9
[3,1] =
10 20 30 40 50 60 70 80 90
[4,1] = [](0x0)
}
Here, find is called with 3 output parameters (forgive me if I'm wrong on calling them output parameters, I am pretty new to Octave) from [out{1:3}], which represents the first 3 empty cells of the cell array out.
When I run find(in{1 : 3}) with 3 output parameters, as in:
[i,j,k] = find(in{1 : 3})
I get:
i = 1 1 1 1 1 1 1 1 1
j = 1 2 3 4 5 6 7 8 9
k = 10 20 30 40 50 60 70 80 90
which kind of explains why out looks like it does, but when I execute in{1:3}, I get:
ans = 10 20 30 40 50 60 70 80 90
ans = Inf
ans = last
which are the 1st to 3rd elements of the in cell array.
My question is: Why does find(in{1 : 3}) drop off the 2nd and 3rd entries in the comma separated list for in{1 : 3}?
Thank you.

The documentation for find should help you answer your question:
When called with 3 output arguments, find returns the row and column indices of non-zero elements (that's your i and j) and a vector containing the non-zero values (that's your k). That explains the 3 output arguments, but not why it only considers in{1}. To answer that you need to look at what happens when you pass 3 input arguments to find as in find (x, n, direction):
If three inputs are given, direction should be one of "first" or
"last", requesting only the first or last n indices, respectively.
However, the indices are always returned in ascending order.
so in{1} is your x (your data if you want), in{2} is how many indices find should consider (all of them in your case since in{2} = Inf) and {in3}is whether find should find the first or last indices of the vector in{1} (last in your case).

Subsetting in a function to calculate a row total

I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…

Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10

TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Iterating through CSV reader to slice data frame - csv

Related

Count cells with same string in dynamic range

Grouping CSV file by ID and extracting JSON column

how to select/add a column to pandas dataframe based on a non trivial function of other columns

octave: using find() on cell array {} subscript and assigning it to another cell array

Subsetting in a function to calculate a row total

Categories

Resources