R- collapse rows based on contents of two columns - mysql

I apologize in advance if this question is too specific or involved for this type of forum. I have been a long time lurker on this site, and this is the first time I haven't been able to solve my issue by looking at previous questions, so I finally decided to post. Please let me know if there is a better place to post this, or if you have advice on making it more clear. here goes.
I have a data.table with the following structure:
library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))
These are genomic coordinates with associated values, the file is sorted by by chromosome ("chr") (1 through 22, then X, then Y), start and end position so that the first row contains the lowest numbered start position on chromosome 1, and proceeds sequentially for all data points on chromosome 1, then 2, etc. At this point, every single row has a start-end length of 1. After collapsing the start-end lengths will vary depending on how many rows were collapsed and their distance from the adjacent row.
1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:
The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
The two adjacent rows have "start" coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)
2nd: I would like to calculate the coverage_ weighted averages of the collapsed mean_KO/WT columns with the weighting by the coverage_KO/WT columns:
Ex: collapse 2 rows,
row 1 mean_1 = 5.0, coverage_1 = 20.
row 2 mean_1 =40.0, coverage_1 = 45.
weighted avg mean_1 = (((5.0*20)/(20+45)) + ((40.0*45)/(20+45))) = 29.23
What I would like the output to look like (except collapsed row means would be calculated and not in string form):
library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327,
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))",
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))",
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0",
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))",
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr",
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))
Help with either part 1 or 2 or any advice is appreciated.
I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.

Related

R - specifying interaction contrasts for aov

How to specificy the contrasts (point estimates, 95CI and p-values) for the between-group differences of the within-group delta changes?
In the example below, I would be interest in the between-groups (group = 1 minus group = 2) of delta changes (time = 3 minus time = 1).
df and model:
demo3 <- read.csv("https://stats.idre.ucla.edu/stat/data/demo3.csv")
## Convert variables to factor
demo3 <- within(demo3, {
group <- factor(group)
time <- factor(time)
id <- factor(id)
})
par(cex = .6)
demo3$time <- as.factor(demo3$time)
demo3.aov <- aov(pulse ~ group * time + Error(id), data = demo3)
summary(demo3.aov)
Neither of these chunks of code achieve my goal, correct?
m2 <- emmeans(demo3.aov, "group", by = "time")
pairs(m2)
m22 <- emmeans(demo3.aov, c("group", "time") )
pairs(m22)
Look at the documentation for emmeans::contrast and in particular the argument interaction. If I understand your question correctly, you might want
summary(contrast(m22, interaction = c("pairwise", "dunnett")),
infer = c(TRUE, TRUE))
which would compute Dunnett-style contrasts for time (each time vs. time1), and compare those for group1 - group2. The summary(..., infer = c(TRUE, TRUE)) part overrides the default that tests but not CIs are shown.
You could also do this in stanges:
time.con <- contrast(m22, "dunnett", by = "group", name = "timediff")
summary(pairs(time.con, by = NULL), infer = c(TRUE, TRUE))
If you truly want just time 3 - time 1, then replace time.con with
time.con1 <- contrast(m22, list(`time3-time1` = c(-1, 0, 1, 0, 0))
(I don't know how many times you have. I assumed 5 in the above.)

COUNTIFS: Excel to pandas and remove counted elements

I have a COUNTIFS equation in excel (COUNTIFS($A$2:$A$6, "<=" & $C4))-SUM(D$2:D3) where A2toA6 is my_list. C4 is current 'bin' with the condition and D* are previous summed results from my_list that meet the condition. I am attempting to implement this in Python
I have looked at previous COUNTIF questions but I am struggling to complete the final '-SUM(D$2:D3)' part of the code.
See the COUNTIFS($A$2:$A$6, "<=" & $C4) section below.
'''
my_list=(-1,-0.5, 0, 1, 2)
bins = (-1, 0, 1)
out = []
for iteration, num in enumerate(bins):
n = []
out.append(n)
count = sum(1 for elem in my_list if elem<=(num))
n.append(count)
print(out)
'''
out = [1, [3], [4]]
I need to sum previous elements, that have already been counted, and remove these elements from the next count so that they are not counted twice ( Excel representation -SUM(D$2:D3) ). This is where I need some help! I used enumerate to track iterations. I have tried the code below in the same loop but I can't resolve this and I get errors:
'''
count1 = sum(out[0:i[0]]) for i in (out)
and
count1 = out(n) - out(n-1)
''''
See expected output values in 'out' array for bin conditions below:
I was able to achieve the required output array values by creating an additional if/elif statement to factor out previous array elements and generate a new output array 'out1'. This works but may not be the most efficient way to achieve the end goal:
'''
import numpy as np
my_list=(-1,-0.5, 0, 1, 2)
#bins = np.arange(-1.0, 1.05, 0.05)
bins = (-1, 0, 1)
out = []
out1 = []
for iteration, num in enumerate(bins):
count = sum(1 for elem in my_list if elem<=(num))
out.append(count)
if iteration == 0:
count1 = out[iteration]
out1.append(count1)
elif iteration > 0:
count1 = out[iteration] - out[iteration - 1]
out1.append(count1)
print(out1)
'''
I also tried using the below code as suggested in other answers but this didn't work for me:
'''
-np.diff([out])
print(out)
'''

How to add alphanumeric values in a speardsheet if they are comma separated?

Suppose, we have cells as below:
Cell Value Legend
==========================
A1 1,A // A = 1
A2 2,AA // AA = 2
A3 3,L // L = -1
A4 4,N // N = 0
I want the total to be calculated separately in other cells as:
A5 = SUM(1, 2, 3, 4) = 1 + 2 + 3 + 4 = 10
A6 = SUM(1*A, 2*AA, 3*L, 4*N) = 1 + 4 - 3 + 0 = 2
Considering it may require separate functions in App Script, I tried to use SPLIT and SUM them, but it's not accepting the values. I asked a related question: How to pass multiple comma separated values in a cell to a custom function?
However, being a novice in spreadsheet, I am not sure if my approach is correct.
How to add alphanumeric values separately as stated above?
you can create a small lookup table (legend) and then for the first sum try something like
=ArrayFormula(sum(iferror(REGEXEXTRACT(A1:A4, "[0-9-.]+")+0)))
and for the last
=sum(ArrayFormula(iferror(regexextract(A1:A4, "[0-9-.]+")*vlookup(regexextract(A1:A4, "[^,]+$"),D1:E4, 2, 0 ))))

How to color the background of a cell in datatable (DT package) in R with column and row names or indices?

Here is an example. I created a data frame and use that to create a datatable for visualization. As you can see, my column name and the row from the first column indicate conditions from A and B. What I want to do is to change the background color of a specific cell in this datatable. It is easy to select the column to change, as explained in this link (https://rstudio.github.io/DT/010-style.html). However, it is not obvious to me how to specify the row I want to select.
To give you more context, I am developing a Shiny app, and I would like to design a datatable allow me to color a cell based on the condition from A and B. For example, if A is less than 1 and B is between 1 and 2, I would like to be able to select the second cell from the A is less than 1 column. To acheive this, I will need to know how to specify the row number or row name. For now, I only know how to specify the rows based on the contents in the rows, as this example shows.
library(tibble)
library(DT)
dat <- tribble(
~`A/B`, ~`A is less than 1`, ~`A is between 1 and 2`, ~`A is larger than 2`,
"B is less than 1", 10, 30, 30,
"B is between 1 and 2", 20, 10, 30,
"B is larger than 2", 20, 20, 10
)
datatable(dat, filter = "none", rownames = FALSE, selection = "none",
options = list(dom = 't', ordering = FALSE)) %>%
formatStyle(
'A is less than 1',
backgroundColor = styleEqual(20, "orange")
)
I'm not sure to get the question, but if you want to change the background color of a cell given by its row index and its column index (that's what I understand), you can do:
changeCellColor <- function(row, col){
c(
"function(row, data, num, index){",
sprintf(" if(index == %d){", row-1),
sprintf(" $('td:eq(' + %d + ')', row)", col),
" .css({'background-color': 'orange'});",
" }",
"}"
)
}
datatable(dat,
options = list(
dom = "t",
rowCallback = JS(changeCellColor(1, 2))
)
)

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]