I need to make a dynamic aggregation in Power Query, by summing or concatenating the duplicated values in my tables - ms-access

Here's an example of my data:
Sample
Method A
Method B
Method C
Method D
Method E
BATCH Nu
Lab Data
Sample 1
1
2
8
TX_0001
LAB1
Sample 1
5
9
TX_0002
LAB2
Sample 2
7
8
8
23
TX_0001
LAB1
Sample 2
41
TX_0001
LAB2
Sample 3
11
55
TX_0394
LAB2
Sample 4
2
9
5
9
TX_0394
LAB1
I need to make a M Language code that unites them, based on duplicated samples. Note that they might be in the same batch and/or in the same lab, but they won't ever be made the same method twice.
So I can't pass the column names, because they keep changing, and I wanted to do it passaing the column names dynamically.
**OBS: I have the possibility to make a linked table of the source to a Microsoft Access and make this with SQL, but I couldn't find a text aggregation function in MS Access library. There it's possible to each column name with no problem. (Just a matter that no one else knows M Language in my company and I can't let this be non-automated)
**
This is the what I have been trying to improve, but I keep have some errors:
1.Both goruped columns have "Errors" in all of the cells
2.Evaluation running out of memory
I can't discover what I'm doing wrong here.
let
Source = ALS,
schema = Table.Schema(Source),
columns = schema[Name],
types = schema[Kind],
Table = Table.FromColumns({columns,types}),
Number_Columns = Table.SelectRows(Table, each ([Column2] = "number")),
Other_Columns = Table.SelectRows(Table, each ([Column2] <> "number")),
numCols = Table.Column(Number_Columns, "Column1"),
textColsSID = List.Select(Table.ColumnNames(Source), each Table.Column(Source, _) <> type number),
textCols = List.RemoveItems(textColsSID, {"Sample ID"}),
groupedNum = Table.Group(Source, {"Sample ID"},List.Transform(numCols, each {_, (nmr) => List.Sum(nmr),type nullable number})),
groupedText = Table.Group(Source,{"Sample ID"},List.Transform(textCols, each {_, (tbl) => Text.Combine(tbl, "_")})),
merged = Table.NestedJoin(groupedNum, {"Sample ID"}, groupedText, {"Sample ID"}, "merged"),
expanded = Table.ExpandTableColumn(merged, "merged", Table.ColumnNames(merged{1}[merged]))
in
expanded
This is what I expected to have:
Sample
Method A
Method B
Method C
Method D
Method E
BATCH Nu
Lab Data
Sample 1
1
2
5
9
8
TX_0001_TX_0002
LAB1_LAB2
Sample 2
7
8
8
23
41
TX_0001_TX_0001
LAB1_LAB1
Sample 3
11
55
TX_0394
LAB2
Sample 4
2
9
5
9
TX_0394
LAB1

Here is a method which assumes only that the first column is a column which will be used to group the different samples.
It makes no assumptions about any column names, or the numbers of columns.
It tests the first 10 rows in each column (after removing any nulls) to determine if the column type can be type number, otherwise it will assume type text.
If there are other possible data types, the type detection code can be expanded.
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
//dynamically detect data types from first ten rows
//only detecting "text" and "number"
colNames = Table.ColumnNames(Source),
checkRows = 10,
colTestTypes = List.Generate(
()=>[t=
let
Values = List.FirstN(Table.Column(Source,colNames{0}),10),
tryNumber = List.Transform(List.RemoveNulls(Values), each (try Number.From(_))[HasError])
in
tryNumber, idx=0],
each [idx] < List.Count(colNames),
each [t=
let
Values = List.FirstN(Table.Column(Source,colNames{[idx]+1}),10),
tryNumber = List.Transform(List.RemoveNulls(Values), each (try Number.From(_))[HasError])
in
tryNumber, idx=[idx]+1],
each [t]),
colTypes = List.Transform(colTestTypes, each if List.AllTrue(_) then type text else type number),
//Group and Sum or Concatenate columns, keying on the first column
group = Table.Group(Source,{colNames{0}},
{"rw", (t)=>
Record.FromList(
List.Generate(
()=>[rw=if colTypes{1} = type number
then List.Sum(Table.Column(t,colNames{1}))
else Text.Combine(Table.Column(t,colNames{1}),"_"),
idx=1],
each [idx] < List.Count(colNames),
each [rw=if colTypes{[idx]+1} = type number
then List.Sum(Table.Column(t,colNames{[idx]+1}))
else Text.Combine(Table.Column(t,colNames{[idx]+1}),"_"),
idx=[idx]+1],
each [rw]), List.RemoveFirstN(colNames,1)), type record}
),
//expand the record column and set the data types
#"Expanded rw" = Table.ExpandRecordColumn(group, "rw", List.RemoveFirstN(colNames,1)),
#"Set Data Type" = Table.TransformColumnTypes(#"Expanded rw", List.Zip({colNames, colTypes}))
in
#"Set Data Type"
Original Data
Results

One way. You could probably do this all within the group as well
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
names = List.Distinct(List.Select(Table.ColumnNames(Source), each Text.Contains(_,"Method"))),
#"Grouped Rows" = Table.Group(Source, {"Sample"}, {{"data", each _, type table }}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Batch Nu", each Text.Combine(List.Distinct([data][BATCH Nu]),"_")),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Lab Data", each Text.Combine(List.Distinct([data][Lab Data]),"_")),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "Custom", each Table.SelectRows(Table.UnpivotOtherColumns([data], {"Sample"}, "Attribute", "Value"), each List.Contains(names,[Attribute]))),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "Custom.1", each Table.Pivot([Custom], List.Distinct([Custom][Attribute]), "Attribute", "Value", List.Sum)),
#"Expanded Custom.1" = Table.ExpandTableColumn(#"Added Custom3" , "Custom.1", names,names),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Custom.1",{"data", "Custom"})
in #"Removed Columns"

Related

PowerQuery: How can I transform a table by dividing the columns by columnspecific factors?

I have a table ("Index") which I want to transform by dividing all columns except "Date" by columnspecific factors. The factors are stored in a table named "normfactor" which has the same columnheaders as "Index" and only one row.
Table "Index":
Date
Zone 1
Zone 2
01
1.8
1.4
02
1.9
1.5
Table "normfactor":
Zone 1
Zone 2
0.98
0.97
I found a function I could use to divide the columns of "Index" by a fixed factor (here the normfactor of the first column):
let
Source = Excel.CurrentWorkbook(){[Name="Index"]}[Content],
ZoneColumnNames = List.RemoveMatchingItems(Table.ColumnNames(Quelle), {"Date"}),
fn_devide_column_by_factor = (fnRec as record) as list => List.Transform(ZoneColumnNames, each {_, each _ / Table.Column(normfactor, ZoneColumnNames{0}){0}}),
#"transform" = Table.FromRecords(Table.TransformRows(Source, (Rec) => Record.TransformFields(Rec, fn_devide_column_by_factor(Rec))))
in
#"transform"
How can I write this variable, so it doesn't always divide by the normfactor of the first column, but by the normfactor in the column with the same columnname?
One way
// code for index table using normfactor table
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Columns = List.RemoveItems(Table.ColumnNames(Source),{"Date"}),
Transforms = List.Transform(Columns,(x)=>{x, each _ / Table.Column(normfactor,x){0} , type number}),
Transformit = Table.TransformColumns(Source, Transforms)
in Transformit
another way
// code for index table using normfactor table
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index", 0, 1, Int64.Type),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index", "Date"}, "Attribute", "Value"),
#"Added Custom" = Table.AddColumn(#"Unpivoted Other Columns", "NewValue", each [Value] / Table.Column(normfactor,[Attribute]){0}),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Value"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Attribute]), "Attribute", "NewValue", List.Sum),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns1"

Function on each row of pandas DataFrame but not generating a new column

I have a data frame in pandas as follows:
A B C D
3 4 3 1
5 2 2 2
2 1 4 3
My final goal is to produce some constraints for an optimization problem using the information in each row of this data frame so I don't want to generate an output and add it to the data frame. The way that I have done that is as below:
def Computation(row):
App = pd.Series(row['A'])
App = App.tolist()
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
But it does not work out by calling:
df.apply(Computation, axis = 1)
Could you please let me know if there is anyway to do this process?
.apply will attempt to convert the value returned by the function to a pandas Series or DataFrame. So, if that is not your goal, you are better off using .iterrows:
# In pseudocode:
for row in df.iterrows:
constrained = Computation(row)
Also, your Computation can be expressed as:
def Computation(row):
App = list(row['A']) # Will work as long as row['A'] is iterable
# For the next 3 lines, see note below.
PT = [row['B']] * len(App)
CS = [row['C']] * len(App)
DS = [row['D']] * len(App)
File3 = tuplelist(zip(PT,CS,DS,App))
return m.addConstr(quicksum(y[r,c,d,a] for r,c,d,a in File3) == 1)
Note: [<list>] * n will create n pointers or references to the same <list>, not n independent lists. Changes to one copy of n will change all copies in n. If that is not what you want, use a function. See this question and it's answers for details. Specifically, this answer.

Formatting data in a CSV file (calculating average) in python

import csv
with open('Class1scores.csv') as inf:
for line in inf:
parts = line.split()
if len(parts) > 1:
print (parts[4])
f = open('Class1scores.csv')
csv_f = csv.reader(f)
newlist = []
for row in csv_f:
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
maximum = max(row[1:3])
row.append(maximum)
average = round(sum(row[1:3])/3)
row.append(average)
newlist.append(row[0:4])
averageScore = [[x[3], x[0]] for x in newlist]
print('\nStudents Average Scores From Highest to Lowest\n')
Here the code is meant to read the CSV file and in the first three rows (row 0 being the users name) it should add all the three scores and divide by three but it doesn't calculate a proper average, it just takes the score from the last column.
Basically you want statistics of each row. In general you should do something like this:
import csv
with open('data.csv', 'r') as f:
rows = csv.reader(f)
for row in rows:
name = row[0]
scores = row[1:]
# calculate statistics of scores
attributes = {
'NAME': name,
'MAX' : max(scores),
'MIN' : min(scores),
'AVE' : 1.0 * sum(scores) / len(scores)
}
output_mesg ="name: {NAME:s} \t high: {MAX:d} \t low: {MIN:d} \t ave: {AVE:f}"
print(output_mesg.format(**attributes))
Try not to consider if doing specific things is inefficient locally. A good Pythonic script should be as readable as possible to every one.
In your code, I spot two mistakes:
Appending to row won't change anything, since row is a local variable in for loop and will get garbage collected.
row[1:3] only gives the second and the third element. row[1:4] gives what you want, as well as row[1:]. Indexing in Python normally is end-exclusive.
And some questions for you to think about:
If I can open the file in Excel and it's not that big, why not just do it in Excel? Can I make use of all the tools I have to get work done as soon as possible with least effort? Can I get done with this task in 30 seconds?
Here is one way to do it. See both parts. First, we create a dictionary with names as the key and a list of results as values.
import csv
fileLineList = []
averageScoreDict = {}
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
averageScoreDict[row[0]] = [highest, lowest, round(average)]
print(averageScoreDict)
Output:
{'Milky': [7, 4, 5], 'Billy': [6, 5, 6], 'Adam': [5, 2, 4], 'John': [10, 7, 9]}
Now that we have our dictionary, we can create your desired final output by sorting the list. See this updated code:
import csv
from operator import itemgetter
fileLineList = []
averageScoreDict = {} # Creating an empty dictionary here.
with open('Class1scores.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
for row in fileLineList:
highest = 0
lowest = 0
total = 0
average = 0
for column in row:
if column.isdigit():
column = int(column)
if column > highest:
highest = column
if column < lowest or lowest == 0:
lowest = column
total += column
average = total / 3
# Here is where we put the emtpy dictinary created earlier to good use.
# We assign the key, in this case the contents of the first column of
# the CSV, to the list of values.
# For the first line of the file, the Key would be 'John'.
# We are assigning a list to John which is 3 integers:
# highest, lowest and average (which is a float we round)
averageScoreDict[row[0]] = [highest, lowest, round(average)]
averageScoreList = []
# Here we "unpack" the dictionary we have created and create a list of Keys.
# which are the names and single value we want, in this case the average.
for key, value in averageScoreDict.items():
averageScoreList.append([key, value[2]])
# Sorting the list using the value instead of the name.
averageScoreList.sort(key=itemgetter(1), reverse=True)
print('\nStudents Average Scores From Highest to Lowest\n')
print(averageScoreList)
Output:
Students Average Scores From Highest to Lowest
[['John', 9], ['Billy', 6], ['Milky', 5], ['Adam', 4]]

Lua: Functions with Tables for Beginners - Proper Naming/Retrieving of Tables within Tables

I am having a horrible time at grasping functions and tables. I've asked a question before that is similar to this but still am having problems getting this to work properly. So I will be more descriptive. But just when I think I understand it I completely confuse myself again. Here is what I am trying to accomplish:
I have a program that is receiving its input from an outside source. It needs to take that input, and basically "dissect" the strings to get the required information. Based on the information it receives, it moves onto the next phase or functions to do the appropriate actions. For example:
input is received as NY345,de,M,9900
I created a table that has all of the different ways the specific input can begin, such as:
local t = {["NY"] = 5, ["MS"] = 7, ["HG"] = 10, ["JX"] = 14, ["UY"] = 20}
Now I want to use a function to receive the input and look for k in t{} and use that to gather other variables...
function seek(input)
for k, v in pairs (seek) do
local info = string.match(input,k)
if info then
return {seekData = string.match(input,k..",(%d*),.*"), seekMult = seekData*v}
end
end
end
How far off am I?
If I had the table "t = {...}" above, and that contained other tables; how can I name each table inside of "t = {...}" and retrieve it for other equations? Such as if ["a"] = 8, the rest of that table was to be utilized? For example:
t={["a"] = 2, ["b"] = 3, ["c"] = "IOS"},{["a"] = 8, ["b"] = 9, ["c"] = "NVY"},{["a"] = 1, ["b"] = 5, ["c"] = "CWQ"}}
if a = 8, then b = 9 and c = "NVY"
I would like my function to search k (of each table) and compare it with the input. If that was found, then it would set the other two local variables to b and c?
Thanks for your help!
I will only answer question 1, as 2 and 3 should be separate questions. There are many ways to do this based on specifics you don't mention but assuming you have a table t like this:
t={
{["a"] = 2, ["b"] = 3, ["c"] = "IOS"},
{["a"] = 8, ["b"] = 9, ["c"] = "NVY"},
{["a"] = 1, ["b"] = 5, ["c"] = "CWQ"}
}
then a function that takes an a key value to look for and returns b and c:
function findItem(a, yourTable)
for i,tt in ipairs(yourTable) do
if tt.a == a then
return i, tt.b, tt.c
end
end
end
With this, if the input is k, then
i, b, c = findItem(k, t)
if i == nil then
print('could not find k')
else
print('found k at index ' .. i)
end
The findItem could of course just return the subtable found, and maybe you don't need index:
function findItem(a, yourTable)
for i,tt in ipairs(yourTable) do
if tt.a == a then
return tt
end
end
end

HowTo: select all rows in a cell array, where a particular column has a particular value

I have a cell array, A. I would like to select all rows where the first column (for example) has the value 1234 (for example).
When A is not a cell array, I can accomplish this by:
B = A(A(:,1) == 1234,:);
But when A is a cell array, I get this error message:
error: binary operator `==' not implemented for `cell' by `scalar' operations
Does anyone know how to accomplish this, for a cell array?
The problem is the expression a(:,1) == 1234 (and also a{:,1} == 1234).
For example:
octave-3.4.0:48> a
a =
{
[1,1] = 10
[2,1] = 13
[3,1] = 15
[4,1] = 13
[1,2] = foo
[2,2] = 19
[3,2] = bar
[4,2] = 999
}
octave-3.4.0:49> a(:,1) == 13
error: binary operator `==' not implemented for `cell' by `scalar' operations
octave-3.4.0:49> a{:,1} == 13
error: binary operator `==' not implemented for `cs-list' by `scalar' operations
I don't know if this is the simplest or most efficient way to do it, but this works:
octave-3.4.0:49> cellfun(#(x) isequal(x, 13), a(:,1))
ans =
0
1
0
1
octave-3.4.0:50> a(cellfun(#(x) isequal(x, 13), a(:,1)), :)
ans =
{
[1,1] = 13
[2,1] = 13
[1,2] = 19
[2,2] = 999
}
I guess the Class of A is cell. (You can see in the Workspace box).
So you may need to convert A to the matrix by cell2mat(A).
Then, just like Matlab as you did: B = A(A(:,1) == 1234,:);
I don't have Octave available at the moment to try it out, but I believe that the following would do it:
B = A(A{:,1} == 1234,:);
When dealing with cells () returns the cell, {} returns the contents of the cell.