Error when using initial_time_split with drake - tidymodels

Learning how to use drake with tidymodels.
Something about using rsample's initial_time_split(), rather than just initial_split(), is giving me an error, when I run make(plan). I get the following:
#> > target data
#> > target split_data
#> Error in UseMethod("complement"): no applicable method for 'complement' applied to an object of class "rsplit"
Have really been racking my brain on this one. The function works fine, independently (ie. the following works):
I feel like I am missing something pretty basic.
Here is the full drake process in a single file (so that it is easier to post up on stack overflow).
Thanks in advance for the hints, as to what I'm doing wrong.
library(drake)
library(tidyverse)
library(tidymodels)
###################################################################
generate_data <- function() {
tibble(x = rnorm(1e5), y = rnorm(1e5))
}
split_the_data <- function(data) {
data %>%
initial_time_split()
}
fit_model <- function(data) {
summary(lm(y ~ x, data = data))
}
###################################################################
plan <- drake_plan(
data = generate_data(),
split_data = split_the_data(data),
model = fit_model(training(split_data))
)
###################################################################
make(plan)

This should now be fixed in the current development version (as of 93d60ef41119defc0432cc95d2dd6787e4a00b14). You can install it with
install.packages("remotes")
remotes::install_github("ropensci/drake")
The error happened because drake calls NROW() on every target (for dynamic branching purposes) and apparently NROW() errors on rsplit objects.
library(tidyverse)
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8 ✓ rsample 0.0.7
#> ✓ infer 0.5.3 ✓ tune 0.1.1
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.2
#> ✓ parsnip 0.1.2 ✓ yardstick 0.0.7
#> ── Conflicts ────────────────────────────────────────── tidymodels_conflicts() ──
#> x scales::discard() masks purrr::discard()
#> x dplyr::filter() masks stats::filter()
#> x recipes::fixed() masks stringr::fixed()
#> x dplyr::lag() masks stats::lag()
#> x yardstick::spec() masks readr::spec()
#> x recipes::step() masks stats::step()
generate_data <- function() {
tibble(x = rnorm(1e5), y = rnorm(1e5))
}
split_the_data <- function(data) {
data %>%
initial_time_split()
}
NROW(split_the_data(generate_data()))
#> Error in UseMethod("complement"): no applicable method for 'complement' applied to an object of class "rsplit"
Created on 2020-07-23 by the reprex package (v0.3.0)
Issue tracked in https://github.com/ropensci/drake/issues/1300.

Related

rvest error on form submission "`Form` doesn't contain a `action` attribute"

I am trying to send search requests with rvest, but I get always the same error. I have tried several ways included this solution: https://gist.github.com/ibombonato/11507d776d1042f80ca59cd31509afd3
My code is the following.
library(rvest)
url <- 'https://www.saferproducts.gov/PublicSearch'
cocorahs <- html_session(URL)
form.unfilled <- cocorahs %>% html_node("form") %>% html_form()
form.unfilled[["fields"]][[3]][["value"]] <- "input" ## This is the line which I think should be corrected
form.filled <- form.unfilled %>%
set_values("searchParameter.AdvancedKeyword" = "amazon")
session1 <- session_submit(cocorahs, form.filled, submit = NULL)
# or
session <- submit_form(cocorahs, form.filled)
But I get always the following error:
Error in `submission_build()`:
! `form` doesn't contain a `action` attribute
Run `rlang::last_error()` to see where the error occurred.
I think the way is to edit the attributes of those buttons. Maybe has someone the answer to this. Thanks in advance.
An alternative method with httr2
library(tidyverse)
library(rvest)
library(httr2)
data <- "https://www.saferproducts.gov/PublicSearch" %>%
request() %>%
req_body_form(
"searchParameter.Keyword" = "Amazon"
) %>%
req_perform() %>%
resp_body_html()
tibble(
title = data %>%
html_elements(".document-title") %>%
html_text2(),
report_title = data %>%
html_elements(".info") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish()
)
#> # A tibble: 10 × 2
#> title repor…¹
#> <chr> <chr>
#> 1 Self balancing scooter was used off & on for three years. Consumer i… Incide…
#> 2 The consumer stated that when he opened one of the marshmallow roast… Incide…
#> 3 The consumer, 59, stated that he was welding with a brand new auto d… Incide…
#> 4 The consumer reported, that their hover soccer toy caught fire while… Incide…
#> 5 80 yr old male's electric hotplate was set between 1 and 2(of 5) bef… Incide…
#> 6 Amazon Recalls Amazon Basics Desk Chairs Due to Fall and Injury Haza… Recall…
#> 7 The Consumer reported to have been notified by email that the diarrh… Incide…
#> 8 consumer reported about light fixture attached to a photography umbr… Incide…
#> 9 Drive DeVilbiss Healthcare Recalls Adult Portable Bed Rails After Tw… Recall…
#> 10 MixBin Electronics Recalls iPhone Cases Due to Risk of Skin Irritati… Recall…
#> # … with abbreviated variable name ¹​report_title
Created on 2023-01-15 with reprex v2.0.2

Can't tune Naive Bayes' smoothness hyperparameter in tidymodels

I can't seem to set smoothness to tune as a hyperparameter in naive_Bayes(), whereas I'm able to do so with other models such as multinom_reg. What am I doing wrong?
library(tidymodels)
library(reprex)
nb_spec <- naive_Bayes(smoothness = tune()) %>%
set_engine('klaR') %>%
set_mode('classification')
nb_spec %>% extract_parameter_set_dials()
#> Collection of 0 parameters for tuning
#>
#> [1] identifier type object
#> <0 rows> (or 0-length row.names)
multinom_spec <- multinom_reg(penalty = tune(), mixture = tune()) %>%
set_engine('glmnet') %>%
set_mode('classification')
multinom_spec %>% extract_parameter_set_dials()
#> Collection of 2 parameters for tuning
#>
#> identifier type object
#> penalty penalty nparam[+]
#> mixture mixture nparam[+]
Created on 2022-06-08 by the reprex package (v2.0.1)
The naive_Bayes() engines are provided by the discrim package, so you need to load that package to be able to extract the parameter set.
library(tidymodels)
library(discrim)
nb_spec <- naive_Bayes(smoothness = tune()) %>%
set_engine('klaR') %>%
set_mode('classification')
nb_spec %>% extract_parameter_set_dials()
#> Collection of 1 parameters for tuning
#>
#> identifier type object
#> smoothness smoothness nparam[+]

Does TidyModels provide the capability to perform Many Models modeling?

I been using the TidyModels recently and was wondering if it has functionality for Many Models, similar to what was available in Modelr as presented in the Many Models chapter of R For Data Science.
I've looked through the documentation but have not seen this.
The tidymodels framework has more robust and expressive support for the kinds of tasks that modelr allows you to do, such as creating data resamples, piping models, etc. The broom package is part of tidymodels with verbs like tidy() and glance() important parts of the tidymodels approach to modeling, and the rsample package provides tools for resampling.
You may be interested in checking out how to use this kind of approach for bootstrap estimates of model parameters:
library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.6 ✓ recipes 0.1.12
#> ✓ dials 0.0.7 ✓ rsample 0.0.7
#> ✓ dplyr 1.0.0 ✓ tibble 3.0.1
#> ✓ ggplot2 3.3.1 ✓ tune 0.1.0
#> ✓ infer 0.5.2 ✓ workflows 0.1.1.9000
#> ✓ parsnip 0.1.1.9000 ✓ yardstick 0.0.6.9000
#> ✓ purrr 0.3.4
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
library(tidyr)
set.seed(123)
boots <- bootstraps(mtcars, times = 1000, apparent = TRUE)
fit_spline <- function(split) {
data <- analysis(split)
smooth.spline(data$wt, data$mpg, df = 4)
}
boot_models <- boots %>%
mutate(spline = map(splits, fit_spline))
boot_models %>%
sample_n(200) %>%
mutate(aug = map(spline, augment)) %>%
unnest(aug) %>%
ggplot(aes(x, y)) +
geom_line(aes(y = .fitted, group = id), alpha = .2, col = "darkcyan") +
geom_point()
Created on 2020-06-18 by the reprex package (v0.3.0.9001)
If you're interested in creating grids of model parameters, check out the dials package.

R - Scrape a number of URLs and save individually

Disclaimer: I'm not a programmer by trade and my knowledge of R is limited to say the least. I've also already searched Stackoverflow for a solution (but to no avail).
Here's my situation: I need to scrape a series of webpages and save the data (not quite sure in what format, but I'll get to that). Fortunately the pages I need to scrape have a very logical naming structure (they use the date).
The base URL is: https://www.bbc.co.uk/schedules/p00fzl6p
I need to scrape everything from August 1st 2018 (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01) until yesterday (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17).
So far I've figured out to create a list of dates which can be appended to the base URL using the following:
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
I can append these to the base URL with the following:
url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)
However, that's pretty much as far as I've gotten (not very far, I know!) I assume I need to use a for loop but my own attempts at this have proved futile. Perhaps I'm not approaching this the right way?
In case it's not clear, what I'm trying to do is to visit each URL and save the html as an individual html file (ideally labelled with the relevant date). In truth, I don't need all of the html (just the list of programmes and times) but I can extract that information from the relevant files at a later date.
Any guidance on the best way to approach this would be much appreciated! And if you need any more info, just ask.
Have a look at the rvest package and associated tutorials. E.g. https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.
The messy part is extracting the fields the way you want them.
Here is one possible solution:
library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d")
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)
get_data <- function(url){
html <- tryCatch(read_html(url), error=function(e) NULL)
if(is.null(html)) return(data.table(
date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
title=NA, description=NA)) else {
time <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'broadcast__info grid 1/4 1/6#bpb2 1/6#bpw')]") %>%
rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
text <- html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//div[contains(#class, 'programme__body')]") %>%
rvest::html_text() %>%
gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>%
gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>%
gsub("^\n|\n$", "", .) %>%
str_split_fixed(., "\n", 2) %>%
as.data.table() %>% setnames(., c("title", "description")) %>%
.[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
time = time,
description = gsub("\n", " ", description))] %>%
setcolorder(., c("date", "time", "title", "description"))
text
}
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#> date time
#> 1: 2018/08/01 06:00
#> 2: 2018/08/01 09:15
#> 3: 2018/08/01 10:00
#> 4: 2018/08/01 11:00
#> 5: 2018/08/01 11:45
#> ---
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#> title
#> 1: Breakfast—01/08/2018
#> 2: Wanted Down Under—Series 11, Hanson Family
#> 3: Homes Under the Hammer—Series 21, Episode 6
#> 4: Fake Britain—Series 7, Episode 7
#> 5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#> ---
#> 16760: BBC London—Late News, 17/05/2020
#> 16761: Educating Rita
#> 16762: The Real Marigold Hotel—Series 4, Episode 2
#> 16763: Weather for the Week Ahead—18/05/2020
#> 16764: Joins BBC News—18/05/2020
#> description
#> 1: The latest news, sport, business and weather from the BBC's Breakfast team.
#> 2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#> 3: Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#> 4: 7/10 The fake sports memorabilia that cost collectors thousands. (R)
#> 5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#> ---
#> 16760: The latest news, sport and weather from London.
#> 16761: Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762: 2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763: Detailed weather forecast.
#> 16764: BBC One joins the BBC's rolling news channel for a night of news.
Created on 2020-05-18 by the reprex package (v0.3.0)

R: Vector of JSONs to data.frame

I have a vector of JSONs (of the same structure) and transform it to a data.frame. Following example does exactly what I want.
require(jsonlite) # fromJSON()
require(magrittr) # for the pipeline only
require(data.table) # rbindlist()
jsons <- c('{"num":1,"char":"a","list":{"x":1,"y":2}}',
'{"num":2,"char":"b","list":{"x":1,"y":2}}',
'{"num":3,"char":"c","list":{"x":1,"y":2}}')
df <- jsons %>%
lapply(fromJSON) %>%
lapply(as.data.frame.list, stringsAsFactors = F) %>%
rbindlist(fill = T)
Some elements of the JSON are objects, i.e. if I transform it fromJSON() some elements of the list will be lists as well. I cannot use unlist() to each list because I have different variable types so I am using as.data.frame.list() function. This is however too slow to do for each JSON individually. Is there a way how can I do it more effectively?
json <- '{"$schema":"http://json-schema.org/draft-04/schema#","title":"Product set","type":"array","items":{"title":"Product","type":"object","properties":{"id":{"description":"The unique identifier for a product","type":"number"},"name":{"type":"string"},"price":{"type":"number","minimum":0,"exclusiveMinimum":true},"tags":{"type":"array","items":{"type":"string"},"minItems":1,"uniqueItems":true},"dimensions":{"type":"object","properties":{"length":{"type":"number"},"width":{"type":"number"},"height":{"type":"number"}},"required":["length","width","height"]},"warehouseLocation":{"description":"Coordinates of the warehouse with the product","$ref":"http://json-schema.org/geo"}},"required":["id","name","price"]}}'
system.time(
df <- json %>% rep(1000) %>%
lapply(fromJSON) %>%
lapply(as.data.frame.list, stringsAsFactors = F) %>%
rbindlist(fill = T)
) # 2.72
I know that there are plenty of similar questions but most of the answers I saw was about using as.data.frame() or data.frame(). Nobody mentioned the speed. Maybe there is no better solution to this.
I finally found the answer. It will be on CRAN soon.
devtools::install_github("jeremystan/tidyjson")
tidyjson::spread_all()
This function is about 10-times faster than my example above.
Try to collapse all JSONs in the one string. Let's show example of the solution:
require(jsonlite)
require(data.table)
json <- '{"$schema":"http://json-schema.org/draft-04/schema#","title":"Product set","type":"array","items":{"title":"Product","type":"object","properties":{"id":{"description":"The unique identifier for a product","type":"number"},"name":{"type":"string"},"price":{"type":"number","minimum":0,"exclusiveMinimum":true},"tags":{"type":"array","items":{"type":"string"},"minItems":1,"uniqueItems":true},"dimensions":{"type":"object","properties":{"length":{"type":"number"},"width":{"type":"number"},"height":{"type":"number"}},"required":["length","width","height"]},"warehouseLocation":{"description":"Coordinates of the warehouse with the product","$ref":"http://json-schema.org/geo"}},"required":["id","name","price"]}}'
n <- 1000
ex <- rep(json, 1000)
f1 <- function(x) {
res <- lapply(x, fromJSON)
res <- lapply(res, as.data.frame.list, stringsAsFactors = FALSE)
res <- rbindlist(res, fill = TRUE)
return(res)
}
f2 <- function(x) {
res <- fromJSON(paste0("[", paste(x, collapse = ","), "]"), flatten = TRUE)
lst <- sapply(res, is.list)
res[lst] <- lapply(res[lst], function(x) as.data.table(transpose(x)))
res <- flatten(res)
return(res)
}
bench::mark(
f1(ex), f2(ex), min_iterations = 100, check = FALSE
)
#> # A tibble: 2 x 14
#> expression min mean median max `itr/sec` mem_alloc n_gc n_itr #> total_time result memory time
#> <chr> <bch:t> <bch:t> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int> #> <bch:tm> <list> <list> <lis>
#> 1 f1(ex) 2.27s 2.35s 2.32s 2.49s 0.425 0B 5397 100 #> 3.92m <data… <Rpro… <bch…
#> 2 f2(ex) 48.85ms 63.78ms 57.88ms 116.19ms 15.7 0B 143 100 #> 6.38s <data… <Rpro… <bch…
#> # … with 1 more variable: gc <list>