Tidymodels recipes - add a step that just applies a feature engineering function? - tidymodels

A lot of feature engineering steps are transforms that do not need to be 'trained' on a dataset, for example, creating a new column x2 as x2=2*x1. These 'static transforms' are different are 'trainable' transforms such as demean and rescale.
Instead of relying on recipes package functions such as step_mutate(), I would like to define a function eg do_static_transforms() that takes in a tibble and outputs a transformed tibble. I would like to add this as the first step to a recipe. Alternatively, I would like to add this as the first step in a workflow (another tidymodels package).
Is this a sensible and possible thing to do?

I recommend that you consider implementing these kinds of "static transforms" in a data manipulation step before you start using recipes or other tidymodels packages. For example, if you wanted to take the log() of an outcome such as price or divide a column by a scalar, you could do this before starting with tidymodels:
library(tidymodels)
#> ── Attaching packages ─────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8 ✓ rsample 0.0.7
#> ✓ dplyr 1.0.0 ✓ tibble 3.0.3
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.0
#> ✓ infer 0.5.3 ✓ tune 0.1.1
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.2
#> ✓ parsnip 0.1.2 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
data(ames)
ames_transformed <- ames %>%
mutate(Sale_Price = log(Sale_Price),
Lot_Area = Lot_Area / 1e3)
Created on 2020-07-17 by the reprex package (v0.3.0)
Then this ames_transformed object would be what you start from with splitting into testing and training. For predicting on new observations, you would implement the same transformations. Because these transformations are not learned from the data, there is no risk of data leakage.

Related

duplication of terra rasters and side effects

When modifying some of the attributes of a duplicated SpatRaster, the original is also modified:
library(terra)
r <- rast(ncol=2, nrow=2, vals=c(5.3, 7.1, 3, 1.2))
#class : SpatRaster
#dimensions : 2, 2, 1 (nrow, ncol, nlyr)
#resolution : 180, 90 (x, y)
#extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
#coord. ref. : +proj=longlat +datum=WGS84 +no_defs
#source : memory
#name : lyr.1
#min value : 1.2
#max value : 7.1
xmin(r)
#[1] -180
t <- r # duplication
xmin(t) <- -300 # xmin modification of the duplicated SpatRaster
xmin(r) # the original SpatRaster has also been modified
#[1] -300
Is it an error or a choice? It only occurs for some attributes, not all. If it is a choice, what is the way to create an 'independant" copy, or how to break the link?
This happened because a SpatRaster is just a wrapper around a C++ object. That makes x (below) a shallow copy (i.e. pointing to the same object in memory)
library(terra)
r <- rast()
x <- r
It only matters in some cases, when using replacement methods (your example is no longer affected with current terra). I have also added a copy method that returns a deep copy, that is, a SpatRaster pointing to a different (deep copied) C++ object.
For information, function add has the same problem (in terra 1.0.11):
logo <- rast(system.file("ex/logo.tif", package="terra"))
nlyr(logo)
[1] 3
r <- logo
add(r) <- r[[1]]
nlyr(logo)
#[1] 4
There may be a misunderstanding here, especially from me. The fact that add(r) <- r[[1]] changes r is absolutely normal, as you point out. But the fact that it also changes logo is not at all routine (the add function is supposed to add a layer to r, not to any other object unrelated to the current script line, according to the documentation).
If I understood correctly, and as you explained before, it is because r <- logo doesn't duplicate logo (deep copy) but only creates a pointer (shallow copy) to logo. This choice has important consequences on the use of terra objects since later modifications of r will also modify the logo object (a "side effect", from the user's point of view). I see at least 3 points here:
Unusual behaviour. Perhaps the user should be warned.
"In R semantics, objects are copied by value. This means that modifying the copy leaves the original object intact.": deep or lazy copies are a basic rule of R (https://rlang.r-lib.org/reference/duplicate.html). The user must therefore be well aware that simple assignments in terra (r <- logo) do not copy by value, but by reference, otherwise the risk of error in the scripts is high. However, this unusual behavior is currently not explained in the documentation of terra.
Shallow copies are more a programmer-oriented tool.
When users will undersand that terra copies are shallow copies, they will probably most often prefer to make deep copies, because the use of shallow copies for terra users is probably restricted and uncommon. I don't see too many cases where this behaviour would be useful. Deep or lazy copies are much more common for users.
Heterogeneous behaviour according to functions
But that's not the main question. The main problem in the fact that, for the moment, some operations affect both the original object and the copied object, and others don't. add modifies the original object (side effect), but res doesn't change it :
logo <- rast(system.file("ex/logo.tif", package="terra"))
nlyr(logo)
# [1] 3
res(logo)
# [1] 1 1
r <- logo
add(r) <- r[[1]]
nlyr(logo)
# side effect of add function on the number of layers
# [1] 4
res(r) <- c(10,10)
res(logo)
# no side effect of res function on the resolution
# [1] 1 1
If all these conjectures are right, it would be useful to explain which functions, after an assignment, do or do not affect the original object, otherwise one cannot program reliably.

R: scraping dynamic links with rvest

I'm trying to scrape links to RSS feeds from internet archive that sit under 'dynamic' calendar using rvest, see this link as an example.
<div>
<div class="captures">
<div class="position" style="width: 20px; height: 20px;">
<div class="measure ">
</div>
</div>
12
</div>
<!-- react-empty: 2310 --></div>
For example,
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
doesn't return links I'm interested in, xpath or html_nodes('.captures') return empty results. Any hints would be very helpful, thanks!
One possibility is to use the wayback package (GL) (GH) which has support for querying the Internet Archive and reading in the HTML of saved pages ("mementos"). You can research a bit more abt web archiving terminology (it's a bit arcane IMO) via http://www.mementoweb.org/guide/quick-intro/ & https://mementoweb.org/guide/rfc/ as starter resources.
library(wayback) # devtools::install_git(one of the superscript'ed links above)
library(rvest) # for reading the resulting HTML contents
library(tibble) # mostly for prettier printing of data frames
There are a number of approaches one could take. This is what I tend to do during forensic analysis of online content. YMMV.
First, we get the recorded mementos (basically a short-list of relevant content):
(rss <- get_mementos("http://www.dailyecho.co.uk/news/district/winchester/rss/"))
## # A tibble: 7 x 3
## link rel ts
## <chr> <chr> <dttm>
## 1 http://www.dailyecho.co.uk/news/district/winchester/rss/ original NA
## 2 http://web.archive.org/web/timemap/link/http://www.dailyecho.co… timemap NA
## 3 http://web.archive.org/web/http://www.dailyecho.co.uk/news/dist… timegate NA
## 4 http://web.archive.org/web/20090517035444/http://www.dailyecho.… first me… 2009-05-17 03:54:44
## 5 http://web.archive.org/web/20180712045741/http://www.dailyecho.… prev mem… 2018-07-12 04:57:41
## 6 http://web.archive.org/web/20180812213013/http://www.dailyecho.… memento 2018-08-12 21:30:13
## 7 http://web.archive.org/web/20180812213013/http://www.dailyecho.… last mem… 2018-08-12 21:30:13
The calendar-menu viewer thing at IA is really the "timemap". I like to work with this as it's the point-in-time memento list of all the crawls. It's the second link above so we'll read it in:
(tm <- get_timemap(rss$link[2]))
## # A tibble: 45 x 5
## rel link type from datetime
## <chr> <chr> <chr> <chr> <chr>
## 1 original http://www.dailyecho.co.uk:80/news/d… NA NA NA
## 2 self http://web.archive.org/web/timemap/l… applicatio… Sun, 17 May … NA
## 3 timegate http://web.archive.org NA NA NA
## 4 first memento http://web.archive.org/web/200905170… NA NA Sun, 17 May 20…
## 5 memento http://web.archive.org/web/200908130… NA NA Thu, 13 Aug 20…
## 6 memento http://web.archive.org/web/200911121… NA NA Thu, 12 Nov 20…
## 7 memento http://web.archive.org/web/201001121… NA NA Tue, 12 Jan 20…
## 8 memento http://web.archive.org/web/201007121… NA NA Mon, 12 Jul 20…
## 9 memento http://web.archive.org/web/201011271… NA NA Sat, 27 Nov 20…
## 10 memento http://web.archive.org/web/201106290… NA NA Wed, 29 Jun 20…
## # ... with 35 more rows
The content is in the mementos and there should be as many mementos there as you see in the calendar view. We'll read in the first one:
mem <- read_memento(tm$link)
# Ideally use writeLines(), now, to save this to disk with a good
# filename. Alternatively, stick it in a data frame with metadata
# and saveRDS() it. But, that's not a format others (outside R) can
# use so perhaps do the data frame thing and stream it out as ndjson
# with jsonlite::stream_out() and compress it during save or afterwards.
Then convert it to something we can use programmatically with xml2::read_xml() or xml2::read_html() (RSS is sometimes better parsed as XML):
read_html(mem)
## {xml_document}
## <html>
## [1] <body><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Daily Ec ...
read_memento() has an as parameter to automagically parse the result but I like to store the mementos locally (as noted in the comments) so as not to abuse the IA servers (i.e. if I ever need to get the data again I don't have to hit their infrastructure).
A big caveat is that if you try to get too many resources from the IA in a short period of time you'll get temporarily banned as they have scale but it's a free service and they (rightfully) try to prevent abuse.
Definitely file issues to the package (pick your favourite source code hosting community to do so as I'll work with either but prefer GitLab after the Microsoft takeover of GitHub) if anything is unclear or you feel could be made better. It's not a popular package and I only have occasional need for forensic spelunking so it "works for me" but I'll gladly try to make it more user-friendly (I just need to know the pain points).

Keras model gives wrong predictions of only 1 class

Background
I used Python and Keras to implement the model of [1].
The model's structure is described in Fig.3 of this paper:
[1] Zheng, Y.: Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks, 2014
Problem
The trained model gives predictions of only 1 class out of 4 classes. For example, [3,3,3,...,3] (= all 3's)
My code at Github
Run main_q02.py
The model is defined in mcdcnn_3.py
Utility functions are defined in utils.py and PAMAP2Utils.py
Dataset Download
The code requires only two files:
PAMAP2_Dataset/Protocol/subject101.dat
PAMAP2_Dataset/Protocol/subject102.dat
About the dataset
The dataset classes are NOT balanced.
class: 0, 1, 2, 3
number of samples (%): 28.76%, 36.18%, 18.42%, 16.64%
Note: computed over all 7 subjects
Does one dominate? Classes 0 and 1 dominate around 65% of all samples.
class 0: 28.76%
class 1: 36.18%
Additional details
Operating system: Ubuntu 14.04 LTS
Version of python packages:
Theano (0.8.2)
Keras (1.1.0)
numpy (1.13.0)
pandas (0.20.2)
Details of the model (from the paper):
"separate multivariate time series into univariate ones and perform feature learning on each univariate series individually." [1]
"adopt sigmoid function in all activation layers" [1]
"utilize average pooling without overlapping" [1]
use stochastic gradient descent (SGD) for learning
parameters: momentum = 0.9, decay = 0.0005, learning rate = 0.01

igraph - How to calculate closeness method in iGraph to disconnected graphs

I use igraph in R for calculate graph measure, my graph make in a PIN that not Connected Graph and is Disconnected Graph.
closeness method for connected graph is good and right calculate, and for Disconnected graph in not Good!
library(igraph)
# Create of Graph Matrix for Test Closeness Centrality
g <- read.table(text="A B
1 2
2 4
3 4
3 5", header=TRUE)
gadj <- get.adjacency(graph.edgelist(as.matrix(g), directed=FALSE))
igObject <- graph.adjacency(gadj) # convert adjacency matrix to igraph object
gCloseness <- closeness(igObject,weights = NULL) # Assign Closeness to Variable for print
output :
[1] 0.1000000 0.1428571 0.1428571 0.1666667 0.1000000
my Disconnected Graph:
library(igraph)
# Create of Graph Matrix for Test Closeness Centrality
g <- read.table(text="A B
1 2
3 4
3 5", header=TRUE)
gadj <- get.adjacency(graph.edgelist(as.matrix(g), directed=FALSE))
igObject <- graph.adjacency(gadj) # convert adjacency matrix to igraph object
gCloseness <- closeness(igObject,weights = NULL) # Assign Closeness to Variable for print
output :
[1] 0.06250000 0.06250000 0.08333333 0.07692308 0.07692308
This output is Right ? and if right How to Calculate ?
Please read the documentation of the closeness function; it clearly states how igraph treats disconnected graphs:
If there is no (directed) path between vertex v and i then the total number of vertices is used in the formula instead of the path length.
The calculation then seems to be correct for me, although I would say that closeness centrality itself is not well-defined for disconnected graphs, and what igraph is using here is more of a hack (although a pretty standard hack) than a rigorous treatment of the problem. I would refrain from using closeness centrality on disconnected graphs.

Multiple regression with lagged time series using libsvm

I'm trying to develop a forecaster for electric consumption. So I want to perform a regression using daily data for an entire year. My dataset has several features. Googling I've found that my problem is a Multiple regression problem (Correct me please if I am mistaken).
What I want to do is train a svm for regression with several independent variables and one dependent variable with n lagged days. Here's a sample of my independent variables, I actually have around 10. (We used PCA to determine which variables had some correlation to our problem)
Day Indep1 Indep2 Indep3
1 1.53 2.33 3.81
2 1.71 2.36 3.76
3 1.83 2.81 3.64
... ... ... ...
363 1.5 2.65 3.25
364 1.46 2.46 3.27
365 1.61 2.72 3.13
And the independendant variable 1 is actually my dependant variable in the future. So for example, with a p=2 (lagged days) I would expect my svm to train with the first 2 time series of all three independant variables.
Indep1 Indep2 Indep3
1.53 2.33 3.81
1.71 2.36 3.76
And the output value of the dependent variable would be "1.83" (Indep variable 1 on time 3).
My main problem is that I don't know how to train properly. What I was doing is just putting all features-p in an array for my "x" variables and for my "y" variables I'm just putting my independent variable on p+1 in case I want to predict next day's power consumption.
Example of training.
x with p = 2 and 3 independent variables y for next day
[1.53, 2.33, 3.81, 1.71, 2.36, 3.76] [1.83]
I tried with x being a two dimensional array but when you combine it for several days it becomes a 3d array and libsvm says it can't be.
Perhaps I should change from libsvm to another tool or maybe it's just that I'm training incorrectly.
Thanks for your help,
Aldo.
Let me answer with the python / numpy notation.
Assume the original time series data matrix with columns (Indep1, Indep2, Indep3, ...) is a numpy array data with shape (n_samples, n_variables). Let's generate it randomly for this example:
>>> import numpy as np
>>> n_samples = 100, n_variables = 5
>>> data = np.random.randn(n_samples, n_variables)
>>> data.shape
(100, 5)
If you want to use a window size of 2 time-steps, then the training set can be built as follows:
>>> targets = data[2:, 0] # shape is (n_samples - 2,)
>>> targets.shape
(98,)
>>> features = np.hstack([data[0:-2, :], data[1:-1, :]]) # shape is (n_samples - 2, n_variables * 2)
>>> features.shape
(98, 10)
Now you have your 2D input array + 1D targes that you can feed to libsvm or scikit-learn.
Edit: it might very well be the case that extracting more time-series oriented features such as moving average, moving min, moving max, moving differences (time based derivatives of the signal) or STFT might help your SVM mode make better predictions.