PowerQuery: How can I transform a table by dividing the columns by columnspecific factors? - function

I have a table ("Index") which I want to transform by dividing all columns except "Date" by columnspecific factors. The factors are stored in a table named "normfactor" which has the same columnheaders as "Index" and only one row.
Table "Index":
Date
Zone 1
Zone 2
01
1.8
1.4
02
1.9
1.5
Table "normfactor":
Zone 1
Zone 2
0.98
0.97
I found a function I could use to divide the columns of "Index" by a fixed factor (here the normfactor of the first column):
let
Source = Excel.CurrentWorkbook(){[Name="Index"]}[Content],
ZoneColumnNames = List.RemoveMatchingItems(Table.ColumnNames(Quelle), {"Date"}),
fn_devide_column_by_factor = (fnRec as record) as list => List.Transform(ZoneColumnNames, each {_, each _ / Table.Column(normfactor, ZoneColumnNames{0}){0}}),
#"transform" = Table.FromRecords(Table.TransformRows(Source, (Rec) => Record.TransformFields(Rec, fn_devide_column_by_factor(Rec))))
in
#"transform"
How can I write this variable, so it doesn't always divide by the normfactor of the first column, but by the normfactor in the column with the same columnname?

One way
// code for index table using normfactor table
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Columns = List.RemoveItems(Table.ColumnNames(Source),{"Date"}),
Transforms = List.Transform(Columns,(x)=>{x, each _ / Table.Column(normfactor,x){0} , type number}),
Transformit = Table.TransformColumns(Source, Transforms)
in Transformit
another way
// code for index table using normfactor table
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index", "Date"}, "Attribute", "Value"),
#"Added Custom" = Table.AddColumn(#"Unpivoted Other Columns", "NewValue", each [Value] / Table.Column(normfactor,[Attribute]){0}),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Value"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Attribute]), "Attribute", "NewValue", List.Sum),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns1"

Related

I need to make a dynamic aggregation in Power Query, by summing or concatenating the duplicated values in my tables

Here's an example of my data:
Sample
Method A
Method B
Method C
Method D
Method E
BATCH Nu
Lab Data
Sample 1
1
2
8
TX_0001
LAB1
Sample 1
5
9
TX_0002
LAB2
Sample 2
7
8
8
23
TX_0001
LAB1
Sample 2
41
TX_0001
LAB2
Sample 3
11
55
TX_0394
LAB2
Sample 4
2
9
5
9
TX_0394
LAB1
I need to make a M Language code that unites them, based on duplicated samples. Note that they might be in the same batch and/or in the same lab, but they won't ever be made the same method twice.
So I can't pass the column names, because they keep changing, and I wanted to do it passaing the column names dynamically.
**OBS: I have the possibility to make a linked table of the source to a Microsoft Access and make this with SQL, but I couldn't find a text aggregation function in MS Access library. There it's possible to each column name with no problem. (Just a matter that no one else knows M Language in my company and I can't let this be non-automated)
**
This is the what I have been trying to improve, but I keep have some errors:
1.Both goruped columns have "Errors" in all of the cells
2.Evaluation running out of memory
I can't discover what I'm doing wrong here.
let
Source = ALS,
schema = Table.Schema(Source),
columns = schema[Name],
types = schema[Kind],
Table = Table.FromColumns({columns,types}),
Number_Columns = Table.SelectRows(Table, each ([Column2] = "number")),
Other_Columns = Table.SelectRows(Table, each ([Column2] <> "number")),
numCols = Table.Column(Number_Columns, "Column1"),
textColsSID = List.Select(Table.ColumnNames(Source), each Table.Column(Source, _) <> type number),
textCols = List.RemoveItems(textColsSID, {"Sample ID"}),
groupedNum = Table.Group(Source, {"Sample ID"},List.Transform(numCols, each {_, (nmr) => List.Sum(nmr),type nullable number})),
groupedText = Table.Group(Source,{"Sample ID"},List.Transform(textCols, each {_, (tbl) => Text.Combine(tbl, "_")})),
merged = Table.NestedJoin(groupedNum, {"Sample ID"}, groupedText, {"Sample ID"}, "merged"),
expanded = Table.ExpandTableColumn(merged, "merged", Table.ColumnNames(merged{1}[merged]))
in
expanded
This is what I expected to have:
Sample
Method A
Method B
Method C
Method D
Method E
BATCH Nu
Lab Data
Sample 1
1
2
5
9
8
TX_0001_TX_0002
LAB1_LAB2
Sample 2
7
8
8
23
41
TX_0001_TX_0001
LAB1_LAB1
Sample 3
11
55
TX_0394
LAB2
Sample 4
2
9
5
9
TX_0394
LAB1
Here is a method which assumes only that the first column is a column which will be used to group the different samples.
It makes no assumptions about any column names, or the numbers of columns.
It tests the first 10 rows in each column (after removing any nulls) to determine if the column type can be type number, otherwise it will assume type text.
If there are other possible data types, the type detection code can be expanded.
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
//dynamically detect data types from first ten rows
//only detecting "text" and "number"
colNames = Table.ColumnNames(Source),
checkRows = 10,
colTestTypes = List.Generate(
()=>[t=
let
Values = List.FirstN(Table.Column(Source,colNames{0}),10),
tryNumber = List.Transform(List.RemoveNulls(Values), each (try Number.From(_))[HasError])
in
tryNumber, idx=0],
each [idx] < List.Count(colNames),
each [t=
let
Values = List.FirstN(Table.Column(Source,colNames{[idx]+1}),10),
tryNumber = List.Transform(List.RemoveNulls(Values), each (try Number.From(_))[HasError])
in
tryNumber, idx=[idx]+1],
each [t]),
colTypes = List.Transform(colTestTypes, each if List.AllTrue(_) then type text else type number),
//Group and Sum or Concatenate columns, keying on the first column
group = Table.Group(Source,{colNames{0}},
{"rw", (t)=>
Record.FromList(
List.Generate(
()=>[rw=if colTypes{1} = type number
then List.Sum(Table.Column(t,colNames{1}))
else Text.Combine(Table.Column(t,colNames{1}),"_"),
idx=1],
each [idx] < List.Count(colNames),
each [rw=if colTypes{[idx]+1} = type number
then List.Sum(Table.Column(t,colNames{[idx]+1}))
else Text.Combine(Table.Column(t,colNames{[idx]+1}),"_"),
idx=[idx]+1],
each [rw]), List.RemoveFirstN(colNames,1)), type record}
),
//expand the record column and set the data types
#"Expanded rw" = Table.ExpandRecordColumn(group, "rw", List.RemoveFirstN(colNames,1)),
#"Set Data Type" = Table.TransformColumnTypes(#"Expanded rw", List.Zip({colNames, colTypes}))
in
#"Set Data Type"
Original Data
Results
One way. You could probably do this all within the group as well
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
names = List.Distinct(List.Select(Table.ColumnNames(Source), each Text.Contains(_,"Method"))),
#"Grouped Rows" = Table.Group(Source, {"Sample"}, {{"data", each _, type table }}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Batch Nu", each Text.Combine(List.Distinct([data][BATCH Nu]),"_")),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Lab Data", each Text.Combine(List.Distinct([data][Lab Data]),"_")),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "Custom", each Table.SelectRows(Table.UnpivotOtherColumns([data], {"Sample"}, "Attribute", "Value"), each List.Contains(names,[Attribute]))),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "Custom.1", each Table.Pivot([Custom], List.Distinct([Custom][Attribute]), "Attribute", "Value", List.Sum)),
#"Expanded Custom.1" = Table.ExpandTableColumn(#"Added Custom3" , "Custom.1", names,names),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Custom.1",{"data", "Custom"})
in #"Removed Columns"

How to color the background of a cell in datatable (DT package) in R with column and row names or indices?

Here is an example. I created a data frame and use that to create a datatable for visualization. As you can see, my column name and the row from the first column indicate conditions from A and B. What I want to do is to change the background color of a specific cell in this datatable. It is easy to select the column to change, as explained in this link (https://rstudio.github.io/DT/010-style.html). However, it is not obvious to me how to specify the row I want to select.
To give you more context, I am developing a Shiny app, and I would like to design a datatable allow me to color a cell based on the condition from A and B. For example, if A is less than 1 and B is between 1 and 2, I would like to be able to select the second cell from the A is less than 1 column. To acheive this, I will need to know how to specify the row number or row name. For now, I only know how to specify the rows based on the contents in the rows, as this example shows.
library(tibble)
library(DT)
dat <- tribble(
~`A/B`, ~`A is less than 1`, ~`A is between 1 and 2`, ~`A is larger than 2`,
"B is less than 1", 10, 30, 30,
"B is between 1 and 2", 20, 10, 30,
"B is larger than 2", 20, 20, 10
)
datatable(dat, filter = "none", rownames = FALSE, selection = "none",
options = list(dom = 't', ordering = FALSE)) %>%
formatStyle(
'A is less than 1',
backgroundColor = styleEqual(20, "orange")
)
I'm not sure to get the question, but if you want to change the background color of a cell given by its row index and its column index (that's what I understand), you can do:
changeCellColor <- function(row, col){
c(
"function(row, data, num, index){",
sprintf(" if(index == %d){", row-1),
sprintf(" $('td:eq(' + %d + ')', row)", col),
" .css({'background-color': 'orange'});",
" }",
"}"
)
}
datatable(dat,
options = list(
dom = "t",
rowCallback = JS(changeCellColor(1, 2))
)
)

Move values in a data frame from one column to another based on matching criteria

I am receiving output from a JSON object,however the JSON returns three fields sometimes two somtimes one, depending in the input. As a result I have a dataframe which looks like this:
mixed score type
1 1 0.0183232 positive
2 neutral <NA> <NA>
3 -0.566558 negative <NA>
4 0.473484 positive <NA>
5 0.856743 positive <NA>
6 -0.422655 negative <NA>
Mixed can take values of 1 or 0
Score can take a positive or negative value between -1 and +1
Type can take a value of either positive, negative or neutral
I'm wondering how I can rearrange the values in the data.frame so that they are in the correct column i.e.
mixed score type
1 1 0.018323 positive
2 <NA> <NA> neutral
3 <NA> -0.566558 negative
4 <NA> 0.473484 positive
5 <NA> 0.856743 positive
6 <NA> -0.422655 negative
Not an elegant solution at all, but the best I could come up with.
### Seeds initial Dataframe
mixed = c("1", "neutral", "0.473484", "-0.566558", "0.856743", "-0.422655", "-0.692675")
score = c("0.0183232", "0", "positive", "negative", "positive", "negative", "negative")
type = c("positive", "0", "0", "0", "0", "0", "0")
df = data.frame(mixed, score, type)
# Create a new DF (3 cols by nrow ize) for output
df <- as.data.frame(matrix(0, ncol = 3, nrow = i))
setnames(df, old=c("V1","V2", "V3"), new=c("mixed", "score", "type"))
df
# Create a 2nd new DF (3 cols by nrow ize) for output
df.2 <- as.data.frame(matrix(0, ncol = 3, nrow = i))
setnames(df.2, old=c("V1","V2", "V3"), new=c("mixed", "score", "type"))
df.2
#Check each column cell by cell if it does copy it do the shadow dataframe
# Set all <NA> values to Null
df[is.na(df)] <- 0
# Set interation length to column length
l <- 51
# Checked the mixed column for '1' and then copy it to the new frame
for(l in 1:l)
if (df$mixed[l] == '1')
{
df.2$mixed[l] <-df$mixed[l]
}
# Checked the mixed column for a value which is less than 1 and then copy it to the score column in the new frame
for(l in 1:l)
if (df$mixed[l] < '1')
{
df.2$score[l] <-df$mixed[l]
}
# Checked the mixed column for positive/negative/neutral and then copy it to the type column in the new frame
for(l in 1:l)
if (df$mixed[l] == "positive" | df$mixed[l] == "negative" | df$mixed[l] == "neutral")
{
df.2$type[l] <-df$mixed[l]
}
# Checked the score column for '1' and then copy it to mixed column in the new frame
for(l in 1:l)
if (df$score[l] == '1')
{
df.2$mixed[l] <-df$score[l]
}
# Checked the score column for a value which is less than 1 and then copy it to the score column in the new frame
for(l in 1:l)
if (df$score[l] < '1')
{
df.2$score[l] <-df$score[l]
}
# Checked the score column for positive/negative/neutral and then copy it to the type column in the new frame
for(l in 1:l)
if (df$score[l] == "positive" | df$score[l] == "negative" | df$score[l] == "neutral")
{
df.2$type[l] <-df$score[l]
}
# Checked the type column for '1' and then copy it to mixed column in the new frame **This one works***
for(l in 1:l)
if (df$type[l] == '1')
{
df.2$mixed[l] <-df$type[l]
}
# Checked the type column for a value which is less than 1 and then copy it to the score column in the new frame ** this one is erasing data in the new frame**
for(l in 1:l)
if (df$type[l] < '1')
{
df.2$score[l] <-df$type[l]
}
# Checked the type column for positive/negative/neutral and then copy it to the type column in the new frame **This one works***
for(l in 1:l)
if (df$type[l] == "positive" | df$type[l] == "negative" | df$type[l] == "neutral")
{
df.2$type[l] <-df$type[l]
}

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

R- collapse rows based on contents of two columns

I apologize in advance if this question is too specific or involved for this type of forum. I have been a long time lurker on this site, and this is the first time I haven't been able to solve my issue by looking at previous questions, so I finally decided to post. Please let me know if there is a better place to post this, or if you have advice on making it more clear. here goes.
I have a data.table with the following structure:
library(data.table)
dt = structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chrX",
"chrX", "chrX", "chrX"), start = c(842326, 855423, 855426, 855739,
153880833, 153880841, 154298086, 154298089), end = c(842327L,
855424L, 855427L, 855740L, 153880834L, 153880842L, 154298087L,
154298090L), meth.diff = c(9.35200555410902, 19.1839617944039,
29.6734426495636, -12.3375577709254, 50.5830043986142, 52.7503561092491,
46.5783738475184, 41.8662800742733), mean_KO = c(9.35200555410902,
19.1839617944039, 32.962962583692, 1.8512250859083, 51.2741224212646,
53.0928367727283, 47.4901932463221, 44.8441659366298), mean_WT = c(0,
0, 3.28951993412841, 14.1887828568337, 0.69111802265039, 0.34248066347919,
0.91181939880374, 2.97788586235646), coverage_KO = c(139L, 55L,
55L, 270L, 195L, 194L, 131L, 131L), coverage_WT = c(120L, 86L,
87L, 444L, 291L, 293L, 181L, 181L)), .Names = c("chr", "start",
"end", "meth.diff", "mean_KO", "mean_WT", "coverage_KO", "coverage_WT"
), class = c("data.table", "data.frame"), row.names = c(NA, -8L
))
These are genomic coordinates with associated values, the file is sorted by by chromosome ("chr") (1 through 22, then X, then Y), start and end position so that the first row contains the lowest numbered start position on chromosome 1, and proceeds sequentially for all data points on chromosome 1, then 2, etc. At this point, every single row has a start-end length of 1. After collapsing the start-end lengths will vary depending on how many rows were collapsed and their distance from the adjacent row.
1st: I would like to collapse adjacent rows into larger start/end ranges based on the following criteria:
The two adjacent rows share the same value for the "chr" column (row 1 "chr" = chr1, and row 2 "chr" = chr1)
The two adjacent rows have "start" coordinate within 500 of one another (if row 1 "start" = 1000, and row 2 "start" <= 1499, collapse these into a single row; if row1 = 1000 and row2 = 1500, keep separate)
The adjacent rows must have the same sign for the "diff" column (i.e. even if chr = chr and start within 500, if diff1 = + 5 and diff2 = -5, keep entries separate)
2nd: I would like to calculate the coverage_ weighted averages of the collapsed mean_KO/WT columns with the weighting by the coverage_KO/WT columns:
Ex: collapse 2 rows,
row 1 mean_1 = 5.0, coverage_1 = 20.
row 2 mean_1 =40.0, coverage_1 = 45.
weighted avg mean_1 = (((5.0*20)/(20+45)) + ((40.0*45)/(20+45))) = 29.23
What I would like the output to look like (except collapsed row means would be calculated and not in string form):
library(data.table)
dt_output = structure(list(chr = c("chr1", "chr1", "chr1", "chrX", "chrX"
), start = c(842326, 855423, 855739, 153880833, 154298086), end = c(842327,
855427, 855740, 153880842, 154298090), mean_1 = c("9.35", "((19.18*55)/(55+55)) + ((32.96*55)/(55+55))",
"1.85", "((51.27*195)/(195+194)) + ((53.09*194)/(195+194))",
"((47.49*131)/(131+131)) + ((44.84*131)/(131+131))"), mean_2 = c("0",
"((0.00*86)/(86+87)) + ((3.29*87)/(86+87))", "14.19", "((0.69*291)/(291+293)) + ((0.34*293)/(291+293))",
"((0.91*181)/(181+181)) + ((2.98*181)/(181+181))")), .Names = c("chr",
"start", "end", "mean_1", "mean_2"), row.names = c(NA, -5L), class = c("data.table", "data.frame"))
Help with either part 1 or 2 or any advice is appreciated.
I have been using R for most of my data manipulations, but I am open to any language that can provide a solution. Thanks in advance.