Specifying One-hot Encoding on Combinations in Tidymodels - tidymodels

I have a data set with two columns Value and Pattern. Value takes one of two values 0 or 1. The Pattern column is a three-character string. The three-character string or Pattern is derived from 1 of 7 characters (X, 4, 3, 2, 1, 0, T) the characters can repeat so there are (7 x 7 x 7) = 343 possible Patterns.
I would like to use tidymodels to do some classification modeling and was planning on using one-hot encoding on the Pattern column, but have the following questions:
Is there a limit to the number of columns tidymodels can handle?
How do I handle a situation where not all of the patterns are represented in the dataset or that the test dataset has patterns no found in the training dataset.
Is one-hot encoding the best way to go here or does tidymodels offer an alternative better suited to my use case.

{tidymodels} doesn't inherently have a limit on the number of columns it can handle. The bottleneck you are more likely to see would be in the parsnip engines no being able to handle many columns. {tidymodels} would also complain if you are having so many levels in the column you are applying one-hot encoding to, that the resulting data.frame doesn't fit in your RAM.
As long as the Pattern column in the data.frame you are supplying to recipe() is a factor with all the possible levels then step_dummy() will make sure all the levels will be created in both the training data set and testing data set. Since you know all the possible values Pattern can take you should manually specify the levels in the training data set you are supplying to recipe().
library(tidymodels)
patterns <- c("X", "4", "3", "2", "1", "0", "T")
patern_constructor <- function(x) {
paste(sample(patterns, size = 3, replace = FALSE, prob = seq_along(patterns)),
collapse = "")
}
all_patterns <- expand.grid(patterns, patterns, patterns) %>%
tidyr::unite(pattern, sep = "") %>%
pull(pattern)
data_patterns <- tibble(value = factor(sample(c(0,1), 10000, TRUE), c(0, 1)),
pattern = factor(map_chr(seq_len(10000), patern_constructor),
levels = all_patterns))
rec_patterns <- recipe(value ~ pattern, data = data_patterns) %>%
step_dummy(pattern, one_hot = TRUE) %>%
prep()
bake(rec_patterns, new_data = data_patterns)
#> # A tibble: 10,000 x 344
#> value pattern_XXX pattern_X4XX pattern_X3XX pattern_X2XX pattern_X1XX
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 0 0
#> 2 0 0 0 0 0 0
#> 3 1 0 0 0 0 0
#> 4 1 0 0 0 0 0
#> 5 1 0 0 0 0 0
#> 6 0 0 0 0 0 0
#> 7 0 0 0 0 0 0
#> 8 1 0 0 0 0 0
#> 9 1 0 0 0 0 0
#> 10 1 0 0 0 0 0
#> # … with 9,990 more rows, and 338 more variables: pattern_X0XX <dbl>,
#> # pattern_TXX <dbl>, pattern_X4X <dbl>, pattern_X44X <dbl>,
#> # pattern_X34X <dbl>, pattern_X24X <dbl>, pattern_X14X <dbl>, …
bake(rec_patterns, new_data = tibble(value = factor(1),
pattern = factor("000")))
#> # A tibble: 1 x 344
#> value pattern_XXX pattern_X4XX pattern_X3XX pattern_X2XX pattern_X1XX
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0 0 0
#> # … with 338 more variables: pattern_X0XX <dbl>, pattern_TXX <dbl>,
#> # pattern_X4X <dbl>, pattern_X44X <dbl>, pattern_X34X <dbl>,
#> # pattern_X24X <dbl>, pattern_X14X <dbl>, pattern_X04X <dbl>, …
Created on 2021-03-04 by the reprex package (v0.3.0)
This will depend on the distribution of patterns you see in your data. You may have to remove patterns that didn't show up in your training data with step_zv() if the model you are using can't handle zero variance predictors. If you are having a highly skewed distribution you could use step_other() to collapse infrequent levels into one level. The embed package also includes some steps to combine multiple levels into a lower space.

Related

Webscraping does not identify the full HTML link

My problem is very similar to this one. I want to identify all the HTML links in this website so I can then open the link and download the tables.
The problem is that when I create the extract_links functions as pointed out in that answer, I get a list of all the HTMLs, but this are not complete.
To make it more clear:
If you press "Junio" in year "2022" the real HTML is the following:
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=7658
but the HTML that I am recovering from the source of the website lacks the last bit (&ia=7658):
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022
Which does not direct me to the table I want.
The problem is that these numbers do not seem to follow any logic and change between year/month links. Any help on how to retrieve the full HTML links will be greatly appreciated. If you also happen to know how can I retrieve the year/month of the file to add as an extra column that would also be great.
Thanks to the help of #margusl I was able to realize that rvest redirects automatically and that solves my problem.
I am trying to use the following code to loop over different links to obtain the tables, store them in a data frame and then download them:
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
combined_df <- data.frame()
for (yr in yr.list){
for (mes in mes.list) {
root <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21"
# Full link
url <- paste(root,"&m=",mes,"&a=",yr,sep="")
# Parse HTML File
file<-read_html(url, encoding = "latin1")
file<- rvest::html_table(file)
str(file)
# This is the relevant table
table <- as.data.frame(file[[1]])
# in your loop, add the files that you read to the combined_df
combined_df <- rbind(combined_df, table)
}
}
It does not work because the read_html code with the encoding works only for some years, but not for all. for example, when running:
url <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015"
file<-read_html(url, encoding = "latin1")
It does not recover the tables with names/surnames that recovers in the previous months but something else. Why can't this work on all the sub-pages? Is this a encoding problem again?
If you open that last page you had issues with, you'll see that it serves a sort of a submenu with 2 more links - http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015 . Meaning that it's not enough to just generate links for each month & year and extract first table of each page, all those pages should be checked for content and exceptions should be handled.
Though I took somewhat opportunistic approach and it happened to work with URL range defined in question + those few odd samples, but there could be other surprises down the road. Switched to httr for making requests as it allows to collect and monitor response headers, separating content retrieval and parsing also seems to work around encoding issues, at least in this case. First collecting and then parsing also simplifies debugging, you can check if certain responses / headers were different from the rest (i.e. response length being 10x smaller than average or final, redirected, url differs from the rest). And it's easy to change content handling
/ parsing for a small subset of responses, if needed. If you are not sure what rvest has retrieved, you can always save the response to a html file and check it with browser or editor, something like
html <- read_html(url_or_text_content); write(as.character(html), "dump.html")
library(rvest)
library(httr)
library(purrr)
library(dplyr)
library(tidyr)
library(stringr)
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
# combine mes.list & yr.list
url.params <- expand.grid(mes = mes.list, yr = yr.list)
# few extra samples:
url.params <- rbind(url.params,
list(mes = 6, yr = 2022), # here rvest strugglest with correct encoding
list(mes = 3, yr = 2015) # returns page with sub-categories
)
url.list <- str_glue("http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m={url.params$mes}&a={url.params$yr}")
url.list
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=1&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=2&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2019
#> ...
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=11&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=12&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=6&a=2022
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# url list for input, output is a tibble with all responses (incl. "url", "date",
# "status_code", header details and response body)
fetch_urls <- function(url.list){
# collect all responses to a list with httr, enable verbose, parse responses later
# add progress bar - requests take a while
resp.list = vector(mode = "list", length = length(url.list))
pb <- txtProgressBar(max = length(url.list), style = 3)
for (i in seq_along(url.list)){
resp.list[[i]] <- GET(url.list[i])
setTxtProgressBar(pb,i)
}
close(pb)
# turn responses into tibble to check urls, response sizes and status codes
resp.tibble <- bind_cols(
map_df(resp.list, ~ .[c("url", "date", "status_code")], .id = "req_id"),
map_df(resp.list, headers) %>% rename_with(~ paste0("header_",.x)),
# map_df(resp_follow.list, "times"),
map_chr(resp.list, content, as = "text") %>% tibble(html_doc = .)
)
return(resp.tibble)
}
resp.tibble <- fetch_urls(url.list)
# check resulting table without html_doc column
# View(resp.tibble[-ncol(resp.tibble)])
resp.tibble %>%
select(req_id:status_code,`header_content-length`) %>%
arrange(`header_content-length`)
#> # A tibble: 26 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 14 http://transparencia.uantof.cl/in… 2022-10-31 17:29:12 200 21371
#> 2 26 http://transparencia.uantof.cl/in… 2022-10-31 17:31:45 200 2230
#> 3 24 http://transparencia.uantof.cl/in… 2022-10-31 17:31:21 200 24035
#> 4 21 http://transparencia.uantof.cl/in… 2022-10-31 17:30:42 200 24173
#> 5 20 http://transparencia.uantof.cl/in… 2022-10-31 17:30:29 200 24183
#> 6 23 http://transparencia.uantof.cl/in… 2022-10-31 17:31:08 200 24184
#> 7 22 http://transparencia.uantof.cl/in… 2022-10-31 17:30:55 200 24207
#> 8 18 http://transparencia.uantof.cl/in… 2022-10-31 17:30:04 200 24405
#> 9 16 http://transparencia.uantof.cl/in… 2022-10-31 17:29:38 200 24715
#> 10 7 http://transparencia.uantof.cl/in… 2022-10-31 17:27:32 200 24716
#> # … with 16 more rows, and abbreviated variable names ¹​status_code,
#> # ²​`header_content-length`
# 26. is kind of suspicious:
# 25 http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=76…
# 26 http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# looks like there has been no redirection and its header_content-length is about 10x smaller than for other responses
# checking it more closely reveals that the page includes a "submenu" instead of table(s):
# <p class="subMenu_interiores">
# <b>2015 - Marzo</b>
# ABRIL 2015
# Marzo 2015
# </p>
# lets' collect urls that were not redirected from our tibble and harvest links from stored html:
suburl.list <- resp.tibble %>%
# urls that do NOT include "plantillas_generar_plantilla"
filter(!str_detect(url, "plantillas_generar_plantilla")) %>%
pull(html_doc) %>%
# rvest does not like lists, thus let's map()
map( ~ read_html(.x) %>% html_elements("#columna1_interiores a") %>% html_attr("href")) %>%
unlist() %>%
paste0("http://transparencia.uantof.cl/",.)
suburl.list
#> [1] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=772"
#> [2] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=648"
# fetch content from those submenu urls
subresp.tibble <- fetch_urls(suburl.list)
# sanity check:
subresp.tibble %>%
select(req_id:status_code,`header_content-length`)
#> # A tibble: 2 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 1 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:52 200 25385
#> 2 2 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:59 200 25332
#> # … with abbreviated variable names ¹​status_code, ²​`header_content-length`
# better, sizes align with previous results.
# collect all relevant responses
table_1 <- resp.tibble %>%
filter(str_detect(url, "plantillas_generar_plantilla")) %>%
bind_rows(subresp.tibble) %>%
# extract html (as strings)
pull(html_doc) %>%
# rvest does not like lists, thus let's map(), pluck(1) extracts first table (from each page)
map(~ read_html(.x) %>% html_table() %>% pluck(1)) %>%
# first attempt to bind rows fails, aparently column types differ
# change all non-character columns to character
map (~ mutate(.x, across(!where(is.character),as.character))) %>%
# bind all tables by rows
bind_rows()
# columns vary across tables so number of NA fields in final result is rather high
Final result for 26 pages, a 10,987 × 30 tibble:
table_1
#> # A tibble: 10,987 × 30
#> Nº PLANTA PATERNO MATERNO NOMBRES G TITULO CARGO REGION ASIGN…¹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 DIRECTIVO ABARZA CASTRO JESSIC… 6 ADMIN… JEFE… SEGUN… (1)(8)…
#> 2 2 PROFESIONAL ABASOLO QUINTE… NURY D… 12 EDUCA… PROF… SEGUN… (4)(8)…
#> 3 3 ACADEMICO ACOSTA PENA ROXANA… 11 EDUCA… PROF… SEGUN… (2)(8)…
#> 4 4 AUXILIARES ACOSTA PIZARRO ROBERT… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 5 5 DIRECTIVO AGENO SIERRA ROSELL… 4 MATRO… DIRE… SEGUN… (1)(8)…
#> 6 6 AUXILIARES AGUIRRE LAZO RENE G… 16 LICEN… AUXI… SEGUN… (7)(8)…
#> 7 7 TECNICOS ALAMOS MARIN SERGIO… 13 TECNI… TECN… SEGUN… (5)(8)…
#> 8 8 AUXILIARES ALAYANA CORTES CHRIST… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 9 9 ACADEMICO ALCOTA AGUIRRE PATRIC… 9 ING. … PROF… SEGUN… (2)(8)…
#> 10 10 ADMINISTRATI… ALFARO BARRAZA MARIA … 23 LICEN… ADMI… SEGUN… (6)(8)…
#> # … with 10,977 more rows, 20 more variables: `UNID MONETARIA` <chr>,
#> # `REMUNERACION MENSUAL BRUTA` <chr>, HORAS <chr>, `CANT. HORAS` <chr>,
#> # `MONTO HORAS EXTRAS` <chr>, `FECHA DE INGRESO` <chr>, `F. HASTA` <chr>,
#> # OBSERVACIONES <chr>, GRADO <chr>, ESTAMENTO <chr>,
#> # `Apellido Paterno` <chr>, `Apellido Materno` <chr>, Nombres <chr>,
#> # `Grado ERUA` <chr>, `CALIFICACION PROFESIONAL O FORMACION` <chr>,
#> # `CARGO O FUNCION` <chr>, `R BRUTA` <chr>, `Horas Extras` <chr>, …
Created on 2022-10-31 with reprex v2.0.2

tidy() function cant process last_fit() obejcts

Functions like last_fit() from the tune package produces last_fit objects which are large nested lists containing the fit results. I tried to transform them into data.frames using the tidy() function from the broom package but this resulted in the following error:
MRE :
library(tidymodels)
library(tidyverse)
data <- mtcars
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine('xgboost',objective = 'reg:squarederror')
wf <- workflow() %>%
add_model(model_default) %>%
add_recipe(recipe(mpg~.,data))
lf <- last_fit(wf,split)
tidy_lf <- tidy(lf)
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
is.atomic(x) is not TRUE
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
5: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
6: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
7: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
Question : How can I use tidy() with an last_fit() output?
The object that last_fit() creates is a tibble (containing metrics, predictions, etc), not a model that can be tidied. You can use extract_workflow() to extract out the fitted workflow from the object created by last_fit(), and this object can be tidied:
library(tidymodels)
car_split <- initial_split(mtcars)
wf <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(recipe(mpg ~ ., mtcars))
lf <- last_fit(wf, car_split)
lf
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [24/8]> train/test split <tibble> <tibble> <tibble [8 × 4]> <workflow>
lf %>%
extract_workflow() %>%
tidy()
#> # A tibble: 11 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -33.6 36.0 -0.935 0.367
#> 2 cyl -0.0296 1.34 -0.0221 0.983
#> 3 disp 0.0252 0.0269 0.934 0.367
#> 4 hp -0.00539 0.0319 -0.169 0.868
#> 5 drat -0.167 2.54 -0.0659 0.948
#> 6 wt -5.69 2.79 -2.04 0.0623
#> 7 qsec 3.32 1.76 1.89 0.0820
#> 8 vs -4.40 3.80 -1.16 0.268
#> 9 am 2.54 2.67 0.950 0.360
#> 10 gear 2.69 2.28 1.18 0.259
#> 11 carb -0.0486 1.11 -0.0439 0.966
Created on 2022-03-23 by the reprex package (v2.0.1)

Turning data frame observations into list of lists

I have a data frame in R with three columns.
lhs
rhs
conviction
example <- data.frame(lhs=c('phones', 'phones', 'phones',
'shoes', 'shoes', 'shoes'),
rhs=c('chargers', 'headphones', 'shoes',
'shirts', 'pants', 'socks'),
conviction=c(1.376, 1.259, 1.087,
1.295, 1.083, 0.978))
Here's a look at the output.
What I want to be able to do is turn this into a data frame with one column per item in lhs and a list of lists as the second column with the format [[rhs, conviction],[rhs,conviction]]
Something like this:
The end goal of all of this is to have a nested JSON file.
Final JSON should resemble this:
Thanks for any help.
You can use tidyverse to nest part of the data frame. This will still leave you with a nested tibble column. To convert this column to a list, you can use map and lapply like so
library(tidyverse)
ans <- example %>%
nest(-lhs) %>%
mutate(data = map(data, ~lapply(1:nrow(.x), function(i) .x[i,]))) %>%
rename(rhs = data)
Here's what the rhs column looks like
ans$rhs
# [[1]]
# [[1]][[1]]
# # A tibble: 1 x 2
# rhs conviction
# <fctr> <dbl>
# 1 chargers 1.376
# [[1]][[2]]
# # A tibble: 1 x 2
# rhs conviction
# <fctr> <dbl>
# 1 headphones 1.259
# [[1]][[3]]
# # A tibble: 1 x 2
# rhs conviction
# <fctr> <dbl>
# 1 shoes 1.087
# [[2]]
# [[2]][[1]]
# # A tibble: 1 x 2
# rhs conviction
# <fctr> <dbl>
# 1 shirts 1.295
# [[2]][[2]]
# # A tibble: 1 x 2
# rhs conviction
# <fctr> <dbl>
# 1 pants 1.083
# [[2]][[3]]
# # A tibble: 1 x 2
# rhs conviction
# <fctr> <dbl>
# 1 socks 0.978
EDIT to return specific format of output
I realized you still get a list of tibbles with the above answer, to convert to a list of vectors, use the following (unlist added)
mutate(data = map(data, ~lapply(1:nrow(.x), function(i) unlist(.x[i,]))))
To get the requisite JSON structure, you really need a list, as a data.frame can't give you the nested structure you need. Using a bit of dplyr, you can summarise each set of grouped data into a data.frame, using the rhs as names for each value of conviction. Setting the names of the resulting list to the values of rhs and converting to JSON, you get
library(dplyr)
example <- data.frame(lhs=c('phones', 'phones', 'phones', 'shoes', 'shoes', 'shoes'),
rhs=c('chargers', 'headphones', 'shoes', 'shirts', 'pants', 'socks'),
conviction=c(1.376, 1.259, 1.087, 1.295, 1.083, 0.978))
example %>%
group_by(lhs) %>%
summarise(rest = list(as.data.frame(t(setNames(conviction, rhs))))) %>%
{ setNames(.$rest, .$lhs) } %>%
jsonlite::toJSON(pretty = TRUE)
#> {
#> "phones": [
#> {
#> "chargers": 1.376,
#> "headphones": 1.259,
#> "shoes": 1.087
#> }
#> ],
#> "shoes": [
#> {
#> "shirts": 1.295,
#> "pants": 1.083,
#> "socks": 0.978
#> }
#> ]
#> }

How to unpack nested JSON type column in a dataframe with R(plus RegEx issue)

I'm very new to R,and I'm currently stuck on this problem:
so I imported a JSON file and already***convert it to a dataframe***, now I need to return rows under condition:
As you can see in the picture, I have a column recording hours(payload.hours)
My GOAL is to find out the hours that meet: 1. sunday 2. time ealier than 10AM.
I tried several ways but somehow it even doesn't come close at all... I havent dealt with such nested form before... so I have to seek your idea&help...
e.g. one element in payload.hours column
payload.hours
...
[530] "{\"monday\":[[\"10:30\",\"16:00\"]],\"tuesday\":[[\"10:30\",\"16:00\"]],\"wednesday\":[[\"10:30\",\"16:00\"]],\"thursday\":[[\"10:30\",\"16:00\"]],\"friday\":[[\"10:30\",\"16:00\"]],\"saturday\":[[\"10:30\",\"16:00\"]],\"sunday\":[[\"10:30\",\"16:00\"]]}"
this is what I used for unpacking the nested lists in "hours" column...but it doesn't work...
library(ndjson)
json<- ndjson::stream_in("#localpath")
#successfully converted json to a dataframe...but each element in payload.hours column remains nested like above.
lapply(json$payload.hours, jsonlite::fromJSON)
#continue unwarp nested jason column BUT RESULT Error in if (is.character(txt) && length(txt) == 1 && nchar(txt, type = "bytes") < :missing value where TRUE/FALSE needed
Another approach I tried(FOR A LONG TIME) is RegEx
hrs<-json1$payload.hours #select column hours into hrs
tme<-"sunday{1}.{8}[0-9]{1}\"" # ???(not sure about this...seruously)...? match string with sunday and after 8characters..aka find preceding digit{1} when meet ":"
iftme<-grepl(tme,hrs) #set logical factor T/F if matches
checkhrs<-hrs[iftme] #check if open hours are correct
checkhrs
And this seems to work...but I am not sure why...(YES.IDK WHY)...so if anyone could explain to me that would be great!
This is the original json file:
https://drive.google.com/open?id=0B-jU6pp4pjS4Smg2RGpHSTlvN2c
This is RegEx output...seems right...but I am not sure about my expression..LOL
Unpacking JSON can be a lot of work, particularly if it is deeply nested. Most JSON reading packages (jsonlite, RJSONIO, etc.) can turn data into something close to a data.frame, but fixing the structure requires an understanding that the reader functions don't have. Since JSON most nearly corresponds to R's lists, cleaning up data coming from JSON typically involves a lot of lapply and its variants. Here I'll use purrr, which has many useful variants and helper functions and works neatly with dplyr.
library(tidyverse)
# Read data
json <- jsonlite::stream_in(file('~/Downloads/jsondata.json'))
# Initial cleanup to proper data.frame
json <- json$payload %>% map_df(simplify_all) %>% dmap(simplify) %>%
mutate(uuid = json$uuid, # re-add uuid subset out at beginning
# Convert hours to a list column of data.frames
hours = hours %>% map_if(negate(is.na), jsonlite::fromJSON) %>%
map(~map_df(.x, as_data_frame, .id = 'day')),
# Add Boolean variable for whether Sunday opening hours are before 10a. Subset,
open_sun_before_10 = hours %>% map(~.x %>% filter(day == 'sunday') %>% .[[2]]) %>%
map(as.POSIXct, format = '%H:%M') %>% # convert to datetime,
map(~.x < as.POSIXct('10:00', format = '%H:%M')) %>% # compare to 10a
map_lgl(~ifelse(length(.x) == 0, NA, .x))) # and cleanup.
Whereas stream_in returned a data.frame with two columns (one very deeply nested), the columns are now less nested. There are still JSON structures in some of the untouched columns, though, which will have to be addressed if you want to use the data.
json
#> # A tibble: 538 × 42
#> existence_full geo_virtual latitude
#> <dbl> <chr> <chr>
#> 1 1.000000 ["56.9459720|-2.1971226|20|within_50m|4"] 56.945972
#> 2 1.000000 ["56.237480|-5.073578|20|within_50m|4"] 56.237480
#> 3 1.000000 ["51.483872|-0.606820|100|rooftop|2"] 51.483872
#> 4 1.000000 ["57.343233|-2.191955|100|rooftop|4"] 57.343233
#> 5 1.000000 ["53.225815|-4.094775|20|within_50m|4"] 53.225815
#> 6 1.000000 ["58.9965740|-3.1882195|20|within_50m|4"] 58.996574
#> 7 1.000000 ["57.661419|-2.520144|100|rooftop|4"] 57.661419
#> 8 1.000000 ["51.642727|-3.934845|20|within_50m|4"] 51.642727
#> 9 0.908251 <NA> <NA>
#> 10 1.000000 ["56.510558|-5.401638|100|rooftop|2"] 56.510558
#> # ... with 528 more rows, and 39 more variables: locality <chr>,
#> # `_records_touched` <chr>, address <chr>, email <chr>,
#> # existence_ml <dbl>, domain_aggregate <chr>, name <chr>,
#> # search_tags <list>, admin_region <chr>, existence <dbl>,
#> # category_labels <list>, post_town <chr>, region <chr>,
#> # review_count <chr>, geocode_level <chr>, tel <chr>, placerank <int>,
#> # longitude <chr>, placerank_ml <dbl>, fax <chr>,
#> # category_ids_text_search <chr>, website <chr>, status <chr>,
#> # geocode_confidence <chr>, postcode <chr>, category_ids <list>,
#> # country <chr>, `_geocode_quality` <chr>, hours_display <chr>,
#> # hours <list>, neighborhood <list>, languages <chr>,
#> # address_extended <chr>, status_closed <chr>, po_box <chr>,
#> # name_variants <list>, yext_id <chr>, uuid <chr>,
#> # open_sun_before_10 <lgl>
And the columns created:
json %>% select(hours, open_sun_before_10)
#> # A tibble: 538 × 2
#> hours open_sun_before_10
#> <list> <lgl>
#> 1 <tibble [1 × 2]> NA
#> 2 <tibble [1 × 2]> NA
#> 3 <tibble [7 × 3]> FALSE
#> 4 <tibble [1 × 2]> NA
#> 5 <tibble [7 × 3]> FALSE
#> 6 <tibble [1 × 2]> NA
#> 7 <tibble [1 × 2]> NA
#> 8 <tibble [6 × 3]> NA
#> 9 <tibble [1 × 2]> NA
#> 10 <tibble [7 × 3]> TRUE
#> # ... with 528 more rows

convert HTML Character Entity Encoding in R

Is there a way in R to convert HTML Character Entity Encodings?
I would like to convert HTML character entities like
& to & or
> to >
For Perl exists the package HTML::Entities which could do that, but I couldn't find something similar in R.
I also tried iconv() but couldn't get satisfying results. Maybe there is also a way using the XML package but I haven't figured it out yet.
Unescape xml/html values using xml2 package:
unescape_xml <- function(str){
xml2::xml_text(xml2::read_xml(paste0("<x>", str, "</x>")))
}
unescape_html <- function(str){
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
Examples:
unescape_xml("3 < x & x > 9")
# [1] "3 < x & x > 9"
unescape_html("€ 2.99")
# [1] "€ 2.99"
Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.
Try something along the lines of:
# load XML package
library(XML)
# Convenience function to convert html codes
html2txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}
# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn&apos;t"
# converted string
html2txt(x)
[1] "isn't"
UPDATE: Edited the html2txt() function so it applies to more situations
While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use sapply for a longer character vector.
To demonstrate this, I first create a large character vector:
set.seed(123)
strings <- c("abcd", "& &apos; >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
And apply the function:
unescape_html <- function(str) {
xml2::xml_text(xml2::read_html(paste0("<x>", str, "</x>")))
}
system.time(res <- sapply(many_strings, unescape_html, USE.NAMES = FALSE))
## user system elapsed
## 2.327 0.000 2.326
head(res)
## [1] "& ' >" "€ <" "& ' >" "€ <" "€ <" "abcd"
It is much faster if all the strings in the character vector are combined into a single, large string, such that read_html() and xml_text() need only be used once. The strings can then easily be separated again using strsplit():
unescape_html2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_html(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
system.time(res2 <- unescape_html2(many_strings))
## user system elapsed
## 0.011 0.000 0.010
identical(res, res2)
## [1] TRUE
Of course, you need to be careful that the string that you use to combine the various strings in str ("#_|" in my example) does not appear anywhere in str. Otherwise, you will introduce an error, when the large string is split again in the end.
library(xml2)
xml_text(read_html(charToRaw("& >")))
gives:
[1] "& >"
Based on Stibu's answer, I went to benchmark the functions.
# first create large vector as in Stibu's answer
set.seed(123)
strings <- c("abcd", "& &apos; >", "&", "€ <")
many_strings <- sample(strings, 10000, replace = TRUE)
# then benchmark the functions by Stibu and Jeroen
bench::mark(
textutils::HTMLdecode(many_strings),
map_chr(many_strings, unescape_html),
unescape_html2(many_strings)
)
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 textutils::HTMLdecode(many_strings) 855.02ms 855.02ms 1.17 329.18MB 10.5 1 9 855.02ms <chr … <Rpro… <bch…
2 map_chr(many_strings, unescape_html) 1.09s 1.09s 0.919 6.79MB 5.51 1 6 1.09s <chr … <Rpro… <bch…
3 unescape_html2(many_strings) 4.85ms 5.13ms 195. 581.48KB 0 98 0 503.63ms <chr … <Rpro… <bch…
# … with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Here I vectorize Jeroen's unescape_html function by purrr::map_chr operator. So far, this just confirms Stibu's claim that the unescape_html2 is indeed many times faster! It is even way faster than textutils::HTMLdecode function.
But I also found that the xml version could be even faster.
unescape_xml2 <- function(str){
html <- paste0("<x>", paste0(str, collapse = "#_|"), "</x>")
parsed <- xml2::xml_text(xml2::read_xml(html))
strsplit(parsed, "#_|", fixed = TRUE)[[1]]
}
However, this function fails when dealing with the many_strings object (maybe because read_xml can not read Euro symbol. So I have to try a different way for benchmarking.
library(tidyverse)
library(rvest)
entity_html <- read_html("https://dev.w3.org/html5/html-author/charref")
entity_mapping <- entity_html %>%
html_node(css = "table") %>%
html_table() %>%
rename(text = X1,
named = X2,
hex = X3,
dec = X4,
desc = X5) %>%
as_tibble
s2 <- entity_mapping %>% pull(dec) # dec can be replaced by hex or named
bench::mark(
textutils::HTMLdecode(s2),
map_chr(s2, unescape_xml),
map_chr(s2, unescape_html),
unescape_xml2(s2),
unescape_html2(s2)
)
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s2) 191.7ms 194.9ms 5.16 64.1MB 10.3 3 6 582ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s2, unescape_xml) 73.8ms 80.9ms 11.9 1006.9KB 5.12 7 3 586ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s2, unescape_html) 162.4ms 163.7ms 5.83 1006.9KB 5.83 3 3 514ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s2) 459.2µs 473µs 2034. 37.9KB 2.00 1017 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s2) 590µs 607.5µs 1591. 37.9KB 2.00 796 1 500ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
We can also try on hex ones.
> bench::mark(
+ # gsubreplace_mapping(s2, entity_mapping),
+ # gsubreplace_local(s2),
+ textutils::HTMLdecode(s3),
+ map_chr(s3, unescape_xml),
+ map_chr(s3, unescape_html),
+ unescape_xml2(s3),
+ unescape_html2(s3)
+ )
# A tibble: 5 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 textutils::HTMLdecode(s3) 204.2ms 212.3ms 4.72 64.1MB 7.87 3 5 636ms <chr … <Rprofm… <bch:… <tibb…
2 map_chr(s3, unescape_xml) 76.4ms 80.2ms 11.8 1006.9KB 5.04 7 3 595ms <chr … <Rprofm… <bch:… <tibb…
3 map_chr(s3, unescape_html) 164.6ms 165.3ms 5.80 1006.9KB 5.80 3 3 518ms <chr … <Rprofm… <bch:… <tibb…
4 unescape_xml2(s3) 487.4µs 500.5µs 1929. 74.5KB 2.00 965 1 500ms <chr … <Rprofm… <bch:… <tibb…
5 unescape_html2(s3) 611.1µs 627.7µs 1574. 40.4KB 0 788 0 501ms <chr … <Rprofm… <bch:… <tibb…
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Here the xml version is even more faster than the html version.