Is there a reason the xgboost code snippet from the usemodels package has one_hot set to TRUE? - tidymodels

Is there a reason the recipe code snippet for xgboost classifier has one_hot = TRUE? This creates "n" dummy variables instead of "n-1". I usually set it to FALSE but just want to make sure I'm not missing something.
Code -
data <- mtcars %>%
as_tibble() %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_xgboost(mpg ~ cyl, data = data)
Output -
xgboost_recipe <-
recipe(formula = mpg ~ cyl, data = data) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(28278)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))

The idea there is that, as a tree-based model, xgboost can handle all the levels (unlike a linear model) and can actually require more splits to fit well if you don't include all the categories. Read more about this here.
You don't see the same for the ranger random forest because it can handle factors natively.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
cars <- as_tibble(mtcars) %>%
mutate(cyl = cyl %>% as.factor)
usemodels::use_ranger(mpg ~ cyl, data = cars)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
#> ranger_recipe <-
#> recipe(formula = mpg ~ cyl, data = cars)
#>
#> ranger_spec <-
#> rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
#> set_mode("regression") %>%
#> set_engine("ranger")
#>
#> ranger_workflow <-
#> workflow() %>%
#> add_recipe(ranger_recipe) %>%
#> add_model(ranger_spec)
#>
#> set.seed(54153)
#> ranger_tune <-
#> tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
Created on 2021-04-07 by the reprex package (v2.0.0)

Related

Tidymodels prediction methods giving different results

I'm a bit confused about getting metrics from resamples using tidymodels.
I seem to be getting 3 different metrics from the same set of resamples, depending on if I use collect_predictions() %>% metrics() or simply collect_metrics()
Here is a simple example...
library(tidyverse)
library(tidymodels)
starwars_df <- starwars %>% select(name:sex) %>% drop_na()
lasso_linear_reg_glmnet_spec <-
linear_reg(penalty = .1, mixture = 1) %>%
set_engine('glmnet')
basic_rec <-
recipe(mass ~ height + sex + skin_color,
data = starwars_df) %>%
step_novel(all_nominal_predictors()) %>%
step_other(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors())
sw_wf <- workflow() %>%
add_recipe(basic_rec) %>%
add_model(lasso_linear_reg_glmnet_spec)
sw_boots <- bootstraps(starwars_df, times = 50)
resampd <- fit_resamples(
sw_wf,
sw_boots,
control = control_resamples(save_pred = TRUE)
)
The following three lines give different results
resampd %>% collect_predictions(resampd, summarize = T) %>% metrics(mass, .pred)
resampd %>% collect_predictions(resampd, summarize = F) %>% metrics(mass, .pred)
resampd %>% collect_metrics()
As an additional question, what would be the best/correct way to get confidence intervals for the rmse in the above example. Here is one way...
individ_metrics <- resampd %>% collect_predictions() %>% group_by(id) %>% rmse(mass, .pred)
confintr::ci_mean(individ_metrics$.estimate)
mean(individ_metrics$.estimate)
Thanks!
The reason that none of those are the same is they are not aggregated in the same way. It turns that taking a mean of a set of means doesn't give you the same (right) result as taking the mean of the whole underlying set. If you were to do something like resampd %>% collect_predictions(summarize = TRUE) %>% metrics(mass, .pred), that is like taking a mean of a set of means.
It turns out that these two things are the same:
## these are the same:
resampd %>%
collect_predictions(summarize = FALSE) %>%
group_by(id) %>%
metrics(mass, .pred)
#> # A tibble: 150 × 4
#> id .metric .estimator .estimate
#> <chr> <chr> <chr> <dbl>
#> 1 Bootstrap01 rmse standard 16.4
#> 2 Bootstrap02 rmse standard 23.1
#> 3 Bootstrap03 rmse standard 31.6
#> 4 Bootstrap04 rmse standard 17.6
#> 5 Bootstrap05 rmse standard 9.59
#> 6 Bootstrap06 rmse standard 25.0
#> 7 Bootstrap07 rmse standard 16.3
#> 8 Bootstrap08 rmse standard 35.1
#> 9 Bootstrap09 rmse standard 25.7
#> 10 Bootstrap10 rmse standard 25.3
#> # … with 140 more rows
resampd %>% collect_metrics(summarize = FALSE)
#> # A tibble: 100 × 5
#> id .metric .estimator .estimate .config
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 Bootstrap01 rmse standard 16.4 Preprocessor1_Model1
#> 2 Bootstrap01 rsq standard 0.799 Preprocessor1_Model1
#> 3 Bootstrap02 rmse standard 23.1 Preprocessor1_Model1
#> 4 Bootstrap02 rsq standard 0.193 Preprocessor1_Model1
#> 5 Bootstrap03 rmse standard 31.6 Preprocessor1_Model1
#> 6 Bootstrap03 rsq standard 0.608 Preprocessor1_Model1
#> 7 Bootstrap04 rmse standard 17.6 Preprocessor1_Model1
#> 8 Bootstrap04 rsq standard 0.836 Preprocessor1_Model1
#> 9 Bootstrap05 rmse standard 9.59 Preprocessor1_Model1
#> 10 Bootstrap05 rsq standard 0.860 Preprocessor1_Model1
#> # … with 90 more rows
Created on 2022-08-23 with reprex v2.0.2

Webscraping Rvest not working from html page, table showing NA'S - Mc Donalds

I am trying to scrape data from https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html to make a dataframe with all the nutri values and allerges drop down menu,(Further information, per 100g, per portion, contained allergies), however my rvest cannot detect the information as a table.
I don't even show any required value
library(rvest)
url4 <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
test <- url4 %>% read_html() %>%
html_nodes(xpath = '//*[#id="collapseOne"]/div/div/div/div[1]') %>%
html_table()
test <- as.data.frame(test)
I also tried this
library(rvest)
library(stringr)
library(tidyr)
url <- "https://www.mcdonalds.com/de/de-de/product/grand-cheese-n-beef-classic-5642.html"
webpage <- read_html(url)
sb_table <- html_nodes(webpage, 'table')
sb <- html_table(sb_table)[[1]]
head(sb)
How could that be done, I'm very new to web scraping don't know if it's Html tags are correct
------ This is scraping data I want---------
link correct or not.
You can request the information from their json API
library(tidyverse)
library(httr2)
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$nutrient_facts %>%
.$nutrient %>%
as_tibble %>%
select(4:9)
# A tibble: 10 x 6
id name nutrient_~1 uom uom_d~2 value
<int> <chr> <chr> <chr> <chr> <chr>
1 1 Serving Size primary_se~ g grams 302
2 2 Brennwert energy_kJ kJ kiloJo~ 2992
3 3 Brennwert energy_kcal kcal kilo c~ 716
4 4 Fett fat g grams 40
5 5 davon gesättigte Fettsäuren saturated_~ g grams 16
6 6 Kohlenhydrate carbohydra~ g grams 44
7 7 davon Zucker sugar g grams 11
8 8 Ballaststoffe fiber g grams 3.3
9 9 Eiweiß protein g grams 40
10 10 Salz salt g grams 2.4
# ... with abbreviated variable names 1: nutrient_name_id,
# 2: uom_description
Information on the allergies
"https://www.mcdonalds.com/dnaapp/itemDetails?country=de&language=de&showLiveData=true&item=201799" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
.$item %>%
.$item_allergen %>%
str_split(pattern = ", ") %>%
getElement(1)
[1] "Milch (einschl. Laktose)"
[2] "Eier"
[3] "Glutenhaltiges Getreide: Weizen (wie Dinkel und Khorasan-Weizen)"
[4] "Senf"
[5] "Sesamsamen"

Error when web scraping in R: Error in UseMethod("xml_find_all") :

I wrote some code to webscrape air quality data in R. It worked perfectly fine and I had no issues. But now, when I recently reran it, I'm getting an error when using the html_nodes() function.
Here is my code:
library(rvest)
library(tidyverse)
library(lubridate)
## Download MOE Location Data
# https://stackoverflow.com/questions/25677035/how-to-create-a-range-of-dates-in-r
## Create a tibble of dates
start_date <- "2021/1/1"
end_date <- "2021/12/31"
dates <- seq(as.Date(start_date), as.Date(end_date), "days")
df <- NULL
for (datex in dates) {
datef = as.Date(datex, origin = "1970-01-01")
Day = day(datef)
Month = month(datef)
Year = year(datef)
for (hour in 1:24) {
url.new <-
paste(
"http://www.airqualityontario.com/aqhi/locations.php?start_day=",
Day,
"&start_month=",
Month,
"&start_year=",
Year,
"&my_hour=",
hour,
"&pol=36&text_only=1&Submit=Update",
sep = ""
)
download.file(url.new, destfile = "scrapedpage.html", quiet=TRUE)
simple <- read_html("scrapedpage.html")
test <- simple %>%
html_nodes("td") %>%
html_text()
test <- as_tibble(test)
df.temp <-
as.data.frame(matrix(
unlist(test, use.names = FALSE),
ncol = 3,
byrow = TRUE
)) %>%
mutate(date = paste(datef)) %>%
mutate(hour = hour)
df <- rbind(df, df.temp)
}
}
df <- as_tibble(df)
colnames(df) <- c("Station","Address","SurfaceConc","SurfaceDate","Hour")
MOE_data <- df %>%
filter(Address != "Bay St. Wellesley St. W.") %>%
select(-Address) %>%
mutate(Station = trimws(Station)) %>%
# filter(str_detect(Station, 'Toronto')) %>%
mutate(Hour = paste(Hour, ":00:00", sep = "")) %>%
mutate(Hour = hms::as_hms(Hour)) %>%
mutate(SurfaceDate = paste(SurfaceDate, Hour)) %>%
mutate(SurfaceDate = as_datetime(SurfaceDate)) %>%
select(-Hour)
MOE_data <- as_tibble(MOE_data)
rm(list=setdiff(ls(), "MOE_data_2021"))
# save.image(file='Jan2019_Dec2021.RData')
This is the error I get:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "xml_document"
What I don't understand is why it happens for some values, some of the time. For example, I get an error when the hour = 16. But when I rerun it, it may work, it's just not consistent.

How can I get multiple outputs from single operation?

These are the codes that have been written to analyze within and between interactions of different species.
in this code, I tried to get separate outputs from each analyzes unsuccessfully.
lapply(data.list, function(x) {
grp <- factor(x$species)
window <- ripras(x$utmX, x$utmY)
pp.grp <- ppp(x$utmX, x$utmY, window=window, marks=grp)
split.grp <- split(pp.grp)
L <- (alltypes(pp.grp, "L"))
LE <- alltypes(pp.grp, Lcross, nsim = 100, envelope = TRUE)
return("L", "LE")
})
plot(L[1])
So my question is how I can get multiple outputs from a single operation?
Thank you so much in advance!
The most common way to handle multiple outputs from a function in R is to put
the results in a list and return that. Hopefully this can inspire you:
f <- function(x){
L <- x
LE <- matrix(x, 2, 2)
rslt <- list(L = L, LE = LE)
return(rslt)
}
y <- f(7)
Now y is a list with two elements: L and LE
y
#> $L
#> [1] 7
#>
#> $LE
#> [,1] [,2]
#> [1,] 7 7
#> [2,] 7 7
Use $ to get a named element (in this case L – same as y[[1]]):
y$L
#> [1] 7
Created on 2019-03-16 by the reprex package (v0.2.1)

Extract nested JSON from R dataframe without knowing keys

I am trying to extract JSON from a TSV column. The difficulty is the JSON is shallowly nested, and the key values may not be present in every row.
I have a minimal example to illustrate my point.
df <- tibble(index = c(1, 2),
data = c('{"json_char":"alpha", "json_list1":["x","y"]}',
'{"json_char":"beta", "json_list1":["x","y","z"], "json_list2":["a","b","c"]}'))
The desired result:
df <- tibble::tibble(index = list(1, 2),
json_char = list("alpha", "beta"),
json_list1 = list(list("x","y"), list("x","y","z")),
json_list2 = list(NA, list("a","b","c")))
After a fair amount of experimentation, I have this function:
extract_json_column <- function(df) {
df %>%
magrittr::use_series(data) %>%
purrr::map(jsonlite::fromJSON) %>%
purrr::map(purrr::simplify) %>%
tibble::enframe() %>%
tidyr::spread("name", "value") %>%
purrr::flatten_dfr()
}
Which gives me the following error: Error in bind_rows_(x, .id) : Argument 2 must be length 3, not 7.
The first row sets the number of parameters for the rest of dataframe. Is there anyway to avoid that behavior?
I modified your function to the following. I hope this helps.
library(tidyverse)
library(rjson)
extract_json_column <- function(df){
df %>%
rowwise() %>%
mutate(data = map(data, fromJSON)) %>%
split(.$index) %>%
map(~.$data[[1]]) %>%
map(~map_if(., function(x) length(x) != 1, list)) %>%
map(as_data_frame) %>%
bind_rows(.id = "index")
}
extract_json_column(df)
# A tibble: 2 x 4
index json_char json_list1 json_list2
<chr> <chr> <list> <list>
1 1 alpha <chr [2]> <NULL>
2 2 beta <chr [3]> <chr [3]>