Deriving multiple fields in SPSS modeler (using unique fields each time) - data-analysis

I'm using SPSS modeler to calculate the accuracy of various weekly estimates. I have a data table with a couple weeks of estimates and a couple weeks of actual data points. Currently, to calculate the errors per week I have to use a derive node for every week ex: Wk1 error <- (Wk1 estimate - Wk1 actual). This is naturally inefficient when considering many weeks. Is there a way to derive all of these error columns at once?
In the data below, can we get 3 new columns for week1, week2, week3 error?
Sample Data

I'm not sure if this is possible within the normal Modeler GUI as you would need to parse the respective field names and use them as parameters within a derive (multiple) node.
You could however use the Modeler Scripting Language. See the following script that creates a userinput node and fills it with some sample data and dynamically creates the derive nodes based on the weekX columns:
import modeler.api
import re
s = modeler.script.stream()
# create user input node and fill with sample data
input_node = s.create("userinput", "Userinput Node")
input_node.setPropertyValue("names", ["group", "week1 (estimate)", "week2 (estimate)", "week3 (estimate)", "week1 (actual)", "week2 (actual)", "week3 (actual)"])
input_node.setKeyedPropertyValue("data", "group", '"A" "B" "C" "D"')
input_node.setKeyedPropertyValue("data", "week1 (estimate)", '1 2 5 1')
input_node.setKeyedPropertyValue("data", "week2 (estimate)", '2 3 1 4')
input_node.setKeyedPropertyValue("data", "week3 (estimate)", '1 1 1 1')
input_node.setKeyedPropertyValue("data", "week1 (actual)", '1 3 2 5')
input_node.setKeyedPropertyValue("data", "week2 (actual)", '1 3 6 2')
input_node.setKeyedPropertyValue("data", "week3 (actual)", '1 1 1 1')
input_node.setKeyedPropertyValue("custom_storage", "group", "String")
input_node.setKeyedPropertyValue("custom_storage", "week1 (estimate)", "Integer")
input_node.setKeyedPropertyValue("custom_storage", "week2 (estimate)", "Integer")
input_node.setKeyedPropertyValue("custom_storage", "week3 (estimate)", "Integer")
input_node.setKeyedPropertyValue("custom_storage", "week1 (actual)", "Integer")
input_node.setKeyedPropertyValue("custom_storage", "week2 (actual)", "Integer")
input_node.setKeyedPropertyValue("custom_storage", "week3 (actual)", "Integer")
input_node.setPropertyValue("data_mode", "Ordered")
# alternatively (e.g. data from database, flatfile): find input node by id and read fieldlist
#input_node = s.findByID("id1UTWII7ZNZF")
#fieldlist = []
#for field in input_node.getOutputDataModel().iterator():
# fieldlist.append(field)
# get fieldlist from input node
fieldlist = input_node.getPropertyValue("names")
print(fieldlist)
# loop through field list and assemble dict
weekdic = {}
p = re.compile('[wW]eek.?[0-9]')
for field in fieldlist:
print(field)
m = p.match(field)
if m:
try:
weekdic[m.group()[len(m.group())-1]].append(field)
except:
weekdic[m.group()[len(m.group())-1]] = [field]
derive = False
# loop through dic and create derive node and set formula
for i in weekdic:
derive_temp = s.create("derive", "My node"+i)
if derive:
s.link(derive,derive_temp)
derive = derive_temp
else:
derive = derive_temp
s.link(input_node,derive)
derive.setPropertyValue("new_name", "week" + i + " (error)")
derive.setPropertyValue("result_type", "Formula")
derive.setPropertyValue("formula_expr", "'" + weekdic[i][0] + "'" + ' - ' + "'" + weekdic[i][1] + "'")
# link last derive node to outputtable
tablenode = s.createAt("table", "Results", 288, 64)
s.link(derive, tablenode)
results = []
tablenode.run(results)

Related

I need to make a dynamic aggregation in Power Query, by summing or concatenating the duplicated values in my tables

Here's an example of my data:
Sample
Method A
Method B
Method C
Method D
Method E
BATCH Nu
Lab Data
Sample 1
1
2
8
TX_0001
LAB1
Sample 1
5
9
TX_0002
LAB2
Sample 2
7
8
8
23
TX_0001
LAB1
Sample 2
41
TX_0001
LAB2
Sample 3
11
55
TX_0394
LAB2
Sample 4
2
9
5
9
TX_0394
LAB1
I need to make a M Language code that unites them, based on duplicated samples. Note that they might be in the same batch and/or in the same lab, but they won't ever be made the same method twice.
So I can't pass the column names, because they keep changing, and I wanted to do it passaing the column names dynamically.
**OBS: I have the possibility to make a linked table of the source to a Microsoft Access and make this with SQL, but I couldn't find a text aggregation function in MS Access library. There it's possible to each column name with no problem. (Just a matter that no one else knows M Language in my company and I can't let this be non-automated)
**
This is the what I have been trying to improve, but I keep have some errors:
1.Both goruped columns have "Errors" in all of the cells
2.Evaluation running out of memory
I can't discover what I'm doing wrong here.
let
Source = ALS,
schema = Table.Schema(Source),
columns = schema[Name],
types = schema[Kind],
Table = Table.FromColumns({columns,types}),
Number_Columns = Table.SelectRows(Table, each ([Column2] = "number")),
Other_Columns = Table.SelectRows(Table, each ([Column2] <> "number")),
numCols = Table.Column(Number_Columns, "Column1"),
textColsSID = List.Select(Table.ColumnNames(Source), each Table.Column(Source, _) <> type number),
textCols = List.RemoveItems(textColsSID, {"Sample ID"}),
groupedNum = Table.Group(Source, {"Sample ID"},List.Transform(numCols, each {_, (nmr) => List.Sum(nmr),type nullable number})),
groupedText = Table.Group(Source,{"Sample ID"},List.Transform(textCols, each {_, (tbl) => Text.Combine(tbl, "_")})),
merged = Table.NestedJoin(groupedNum, {"Sample ID"}, groupedText, {"Sample ID"}, "merged"),
expanded = Table.ExpandTableColumn(merged, "merged", Table.ColumnNames(merged{1}[merged]))
in
expanded
This is what I expected to have:
Sample
Method A
Method B
Method C
Method D
Method E
BATCH Nu
Lab Data
Sample 1
1
2
5
9
8
TX_0001_TX_0002
LAB1_LAB2
Sample 2
7
8
8
23
41
TX_0001_TX_0001
LAB1_LAB1
Sample 3
11
55
TX_0394
LAB2
Sample 4
2
9
5
9
TX_0394
LAB1
Here is a method which assumes only that the first column is a column which will be used to group the different samples.
It makes no assumptions about any column names, or the numbers of columns.
It tests the first 10 rows in each column (after removing any nulls) to determine if the column type can be type number, otherwise it will assume type text.
If there are other possible data types, the type detection code can be expanded.
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
//dynamically detect data types from first ten rows
//only detecting "text" and "number"
colNames = Table.ColumnNames(Source),
checkRows = 10,
colTestTypes = List.Generate(
()=>[t=
let
Values = List.FirstN(Table.Column(Source,colNames{0}),10),
tryNumber = List.Transform(List.RemoveNulls(Values), each (try Number.From(_))[HasError])
in
tryNumber, idx=0],
each [idx] < List.Count(colNames),
each [t=
let
Values = List.FirstN(Table.Column(Source,colNames{[idx]+1}),10),
tryNumber = List.Transform(List.RemoveNulls(Values), each (try Number.From(_))[HasError])
in
tryNumber, idx=[idx]+1],
each [t]),
colTypes = List.Transform(colTestTypes, each if List.AllTrue(_) then type text else type number),
//Group and Sum or Concatenate columns, keying on the first column
group = Table.Group(Source,{colNames{0}},
{"rw", (t)=>
Record.FromList(
List.Generate(
()=>[rw=if colTypes{1} = type number
then List.Sum(Table.Column(t,colNames{1}))
else Text.Combine(Table.Column(t,colNames{1}),"_"),
idx=1],
each [idx] < List.Count(colNames),
each [rw=if colTypes{[idx]+1} = type number
then List.Sum(Table.Column(t,colNames{[idx]+1}))
else Text.Combine(Table.Column(t,colNames{[idx]+1}),"_"),
idx=[idx]+1],
each [rw]), List.RemoveFirstN(colNames,1)), type record}
),
//expand the record column and set the data types
#"Expanded rw" = Table.ExpandRecordColumn(group, "rw", List.RemoveFirstN(colNames,1)),
#"Set Data Type" = Table.TransformColumnTypes(#"Expanded rw", List.Zip({colNames, colTypes}))
in
#"Set Data Type"
Original Data
Results
One way. You could probably do this all within the group as well
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
names = List.Distinct(List.Select(Table.ColumnNames(Source), each Text.Contains(_,"Method"))),
#"Grouped Rows" = Table.Group(Source, {"Sample"}, {{"data", each _, type table }}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Batch Nu", each Text.Combine(List.Distinct([data][BATCH Nu]),"_")),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Lab Data", each Text.Combine(List.Distinct([data][Lab Data]),"_")),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "Custom", each Table.SelectRows(Table.UnpivotOtherColumns([data], {"Sample"}, "Attribute", "Value"), each List.Contains(names,[Attribute]))),
#"Added Custom3" = Table.AddColumn(#"Added Custom2", "Custom.1", each Table.Pivot([Custom], List.Distinct([Custom][Attribute]), "Attribute", "Value", List.Sum)),
#"Expanded Custom.1" = Table.ExpandTableColumn(#"Added Custom3" , "Custom.1", names,names),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Custom.1",{"data", "Custom"})
in #"Removed Columns"

Sorting with csv library, error says my dates don't match '%Y-%m-%d' format when it does

I'm trying to sort a CSV by date first then time second. With Pandas, it was easy by using df = df.sort_values(by=['Date', 'Time_UTC']). In the csv library, the code is (from here):
with open ('eqph_csv_29May2020_noF_5lines.csv') as file:
reader = csv.DictReader(file, delimiter=',')
date_sorted = sorted(reader, key=lambda Date: datetime.strptime('Date', '%Y-%m-%d'))
print(date_sorted)
The datetime documentation clearly says these codes are right. Here's a sample CSV (no delimiter):
Date Time_UTC Latitude Longitude
2020-05-28 05:17:31 16.63 120.43
2020-05-23 02:10:27 15.55 121.72
2020-05-20 12:45:07 5.27 126.11
2020-05-09 19:18:12 14.04 120.55
2020-04-10 18:45:49 5.65 126.54
csv.DictReader returns an iterator that yields a dict for each row in the csv file. To sort it on a column from each row, you need to specify that column in the sort function:
date_sorted = sorted(reader, key=lambda row: datetime.strptime(row['Date'], '%Y-%m-%d'))
To sort on both Date and Time_UTC, you could combine them into one string and convert that to a datetime:
date_sorted = sorted(reader, key=lambda row: datetime.strptime(row['Date'] + ' ' + row['Time_UTC'], '%Y-%m-%d %H:%M:%S'))
Nick's answer worked and used it to revise mine. I used csv.reader() instead.
lon,lat = [],[]
xy = zip(lon,lat)
with open ('eqph_csv_29May2020_noF_20lines.csv') as file:
reader = csv.reader(file, delimiter=',')
next(reader)
date_sorted = sorted(reader, key=lambda row: datetime.strptime
(row[0] + ' ' + row[1], '%Y-%m-%d %H:%M:%S'))
for row in date_sorted:
lon.append(float(row[2]))
lat.append(float(row[3]))
for i in xy:
print(i)
Result
(6.14, 126.2)
(14.09, 121.36)
(13.74, 120.9)
(6.65, 125.42)
(6.61, 125.26)
(5.49, 126.57)
(5.65, 125.61)
(11.33, 124.64)
(11.49, 124.42)
(15.0, 119.79) # 2020-03-19 06:33:00
(14.94, 120.17) # 2020-03-19 06:49:00
(6.7, 125.18)
(5.76, 125.14)
(9.22, 124.01)
(20.45, 122.12)
(5.65, 126.54)
(14.04, 120.55)
(5.27, 126.11)
(15.55, 121.72)
(16.63, 120.43)

uniting lists out of the googleways package for google directions?

I'm working on looping through long and latitude points for the googleways api. I've come up with two ways to do that in an effort to access the points sections shown in the following link:
https://cran.r-project.org/web/packages/googleway/vignettes/googleway-vignette.html
Unforuntaely since this uses a unique key I can't provide a reproducible example but Below are my attempts, one using mapply and the other with a loop. Both work in producing a response in list format, however I am not sure how to unpack it to pull out the points route as you would when passing only one point:
df$routes$overview_polyline$points
Any suggestions?
library(googleway)
dir_results = mapply(
myfunction,
origin = feed$origin,
destination = feed$destination,
departure = feed$departure
)
OR
empty_df = NULL
for (i in 1:nrow(feed)) {
print(i)
output = google_directions(feed[i,"origin"],
feed[i,"destination"],
mode = c("driving"),
departure_time = feed[i,"departure"],
arrival_time = NULL,
waypoints = NULL, alternatives = FALSE, avoid = NULL,
units = c("metric"), key = chi_directions, simplify = T)
empty_df = rbind(empty_df, output)
}
EDIT**
The intended output would be a data frame like below: where "id" represents the original trip fed in.
lat lon id
1 40.71938 -73.99323 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
2 40.71992 -73.99292 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
3 40.71984 -73.99266 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
4 40.71932 -73.99095 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
5 40.71896 -73.98981 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
6 40.71824 -73.98745 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
7 40.71799 -73.98674 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
8 40.71763 -73.98582 40.7193908691406+-73.9932174682617 40.7096214294434+-73.9497909545898
EDIT****
dput provided for answering question on dataframe to pair list:
structure(list(origin = c("40.7193908691406 -73.9932174682617",
"40.7641792297363 -73.9734268188477", "40.7507591247559 -73.9739990234375"
), destination = c("40.7096214294434-73.9497909545898", "40.7707366943359-73.9031448364258",
"40.7711143493652-73.9871368408203")), .Names = c("origin", "destination"
), row.names = c(NA, 3L), class = "data.frame")
sql code is basic looks like such:
feed = sqlQuery(con, paste("select top 10
longitude as px,
latitude as py,
dlongitude as dx ,
dlatitude as dy,
from mydb"))
and then before feeding it my data frame feed looks like so (u can ignore departure i was using that for the distance api):
origin destination departure
1 40.7439613342285 -73.9958724975586 40.716911315918-74.0121383666992 2017-03-03 01:00:32
2 40.7990493774414 -73.9685516357422 40.8066520690918-73.9610137939453 2017-03-03 01:00:33
3 40.7406234741211 -74.0055618286133 40.7496566772461-73.9834671020508 2017-03-03 01:00:33
4 40.7172813415527 -73.9953765869141 40.7503852844238-73.9811019897461 2017-03-03 01:00:33
5 40.7603607177734 -73.9817123413086 40.7416114807129-73.9795761108398 2017-03-03 01:00:34
As you know the result of the API query returns a list. And if you're doing multiple calls to the API you'll return multiple lists.
So to extract the data of interest you have to do standard operations on lists. In this example it can be done with a couple of *applys
Using the data.frame feed where each row consists of an origin lat/lon (px/py) and a destination lat/lon (dx/dy)
feed <- data.frame(px = c(40.7193, 40.7641),
py = c(-73.993, -73.973),
dx = c(40.7096, 40.7707),
dy = c(-73.949, -73.903))
You can use an apply to query the google_directions() API for each row of the data.frame. And within the same apply you can do whatever you want with the result to extract/format it how you want.
lst <- apply(feed, 1, function(x){
## query Google Directions API
res <- google_directions(key = key,
origin = c(x[['px']], x[['py']]),
destination = c(x[['dx']], x[['dy']]))
## Decode the polyline
df_route <- decode_pl(res$routes$overview_polyline$points)
## append the original coordinates as an 'id' column
df_route[, "id"] <- paste0(paste(x[['px']], x[['py']], sep = "+")
," "
, paste(x[['dx']], x[['dy']], sep = "+")
, collapse = " ")
## store the results of the query, the decoded polyline,
## and the original query coordinates in a list
lst_result <- list(route = df_route,
full_result = res,
origin = c(x[['px']], x[['py']]),
destination = c(x[['dx']],x[['dy']]))
return(lst_result)
})
So now lst is a list that contains the result of each query, plus the decoded polyline as a data.frame. To get all the decoded polylines as a single data.frame you can do another lapply, and then rbind it all together
## do what we want with the result, for example bind all the route coordinates into one data.frame
df <- do.call(rbind, lapply(lst, function(x) x[['route']]))
head(df)
lat lon id
1 40.71938 -73.99323 40.7193+-73.993 40.7096+-73.949
2 40.71992 -73.99292 40.7193+-73.993 40.7096+-73.949
3 40.71984 -73.99266 40.7193+-73.993 40.7096+-73.949
4 40.71932 -73.99095 40.7193+-73.993 40.7096+-73.949
5 40.71896 -73.98981 40.7193+-73.993 40.7096+-73.949
6 40.71824 -73.98745 40.7193+-73.993 40.7096+-73.949

Move values in a data frame from one column to another based on matching criteria

I am receiving output from a JSON object,however the JSON returns three fields sometimes two somtimes one, depending in the input. As a result I have a dataframe which looks like this:
mixed score type
1 1 0.0183232 positive
2 neutral <NA> <NA>
3 -0.566558 negative <NA>
4 0.473484 positive <NA>
5 0.856743 positive <NA>
6 -0.422655 negative <NA>
Mixed can take values of 1 or 0
Score can take a positive or negative value between -1 and +1
Type can take a value of either positive, negative or neutral
I'm wondering how I can rearrange the values in the data.frame so that they are in the correct column i.e.
mixed score type
1 1 0.018323 positive
2 <NA> <NA> neutral
3 <NA> -0.566558 negative
4 <NA> 0.473484 positive
5 <NA> 0.856743 positive
6 <NA> -0.422655 negative
Not an elegant solution at all, but the best I could come up with.
### Seeds initial Dataframe
mixed = c("1", "neutral", "0.473484", "-0.566558", "0.856743", "-0.422655", "-0.692675")
score = c("0.0183232", "0", "positive", "negative", "positive", "negative", "negative")
type = c("positive", "0", "0", "0", "0", "0", "0")
df = data.frame(mixed, score, type)
# Create a new DF (3 cols by nrow ize) for output
df <- as.data.frame(matrix(0, ncol = 3, nrow = i))
setnames(df, old=c("V1","V2", "V3"), new=c("mixed", "score", "type"))
df
# Create a 2nd new DF (3 cols by nrow ize) for output
df.2 <- as.data.frame(matrix(0, ncol = 3, nrow = i))
setnames(df.2, old=c("V1","V2", "V3"), new=c("mixed", "score", "type"))
df.2
#Check each column cell by cell if it does copy it do the shadow dataframe
# Set all <NA> values to Null
df[is.na(df)] <- 0
# Set interation length to column length
l <- 51
# Checked the mixed column for '1' and then copy it to the new frame
for(l in 1:l)
if (df$mixed[l] == '1')
{
df.2$mixed[l] <-df$mixed[l]
}
# Checked the mixed column for a value which is less than 1 and then copy it to the score column in the new frame
for(l in 1:l)
if (df$mixed[l] < '1')
{
df.2$score[l] <-df$mixed[l]
}
# Checked the mixed column for positive/negative/neutral and then copy it to the type column in the new frame
for(l in 1:l)
if (df$mixed[l] == "positive" | df$mixed[l] == "negative" | df$mixed[l] == "neutral")
{
df.2$type[l] <-df$mixed[l]
}
# Checked the score column for '1' and then copy it to mixed column in the new frame
for(l in 1:l)
if (df$score[l] == '1')
{
df.2$mixed[l] <-df$score[l]
}
# Checked the score column for a value which is less than 1 and then copy it to the score column in the new frame
for(l in 1:l)
if (df$score[l] < '1')
{
df.2$score[l] <-df$score[l]
}
# Checked the score column for positive/negative/neutral and then copy it to the type column in the new frame
for(l in 1:l)
if (df$score[l] == "positive" | df$score[l] == "negative" | df$score[l] == "neutral")
{
df.2$type[l] <-df$score[l]
}
# Checked the type column for '1' and then copy it to mixed column in the new frame **This one works***
for(l in 1:l)
if (df$type[l] == '1')
{
df.2$mixed[l] <-df$type[l]
}
# Checked the type column for a value which is less than 1 and then copy it to the score column in the new frame ** this one is erasing data in the new frame**
for(l in 1:l)
if (df$type[l] < '1')
{
df.2$score[l] <-df$type[l]
}
# Checked the type column for positive/negative/neutral and then copy it to the type column in the new frame **This one works***
for(l in 1:l)
if (df$type[l] == "positive" | df$type[l] == "negative" | df$type[l] == "neutral")
{
df.2$type[l] <-df$type[l]
}

Available Filters With Specified Ranges In SSRS

I am working on a Chart in my report.
As I have too many records where CountId = 1, I have set up a filter showing an available values list like this:
CountId :
1
2
3
Between 4 to 6
Between 7 to 9
Above 10
If I set the available value 1 or 2 or 3 it shows results, but I don`t know how to set a filter for between and above.
I want a filter some thing like this - available filters are:
1
2
3
4
Above 5 or greater than equal to 5
You've got a mix of operators, so maybe you should look at an expression based filter to try and handle these different cases, something like:
Expression (type Text):
=Switch(Parameters!Count.Value = "1" and Fields!Count.Value = 1, "Include"
, Parameters!Count.Value = "2" and Fields!Count.Value = 2, "Include"
, Parameters!Count.Value = "3" and Fields!Count.Value = 3, "Include"
, Parameters!Count.Value = "4 to 6" and Fields!Count.Value >= 4 and Fields!Count.Value <= 6, "Include"
, Parameters!Count.Value = "7 to 9" and Fields!Count.Value >= 7 and Fields!Count.Value <= 9, "Include"
, Parameters!Count.Value = "Above 10" and Fields!Count.Value >= 10, "Include"
, true, "Exclude")
Operator:
=
Value:
Include
This assumes a string parameter Count populated with the above values.
This works by calculating the parameter and field combinations to produce a constant, either Include or Exclude, then displaying all rows that return Include.
As mentioned in a comment, it's difficult to follow exactly what you're asking here. I've done my best but if you have more questions it would be best to update the question with some sample data and how you'd like this data displayed.