How do I convert a Tibble to HTML Table in R tidyverse? - html

I'm wanting a way to convert the results of a pipeline manipulation into a table so it can be rendered as a HTML table in R Markdown.
Sample data:
Category <- sample(1:6, 394400)
Category <- sample(1:6, 394400, replace=TRUE)
Category <- factor(Category,
levels = c(1,2,3,4,5,6),
labels = c("First",
"Second",
"Third",
"Fourth",
"Fifth",
"Sixth"))
data <- data.frame(Category)
Then I build a frequency table using the pipeline:
Table <- data %>%
group_by(Category) %>%
summarise(N= n(), Percent = n()/NROW(data)*100) %>%
mutate(C.Percent = cumsum(Percent))
Which gives me this nice little summary table here:
# A tibble: 6 × 4
Category N Percent C.Percent
<fctr> <int> <dbl> <dbl>
1 First 65853 16.69701 16.69701
2 Second 66208 16.78702 33.48403
3 Third 65730 16.66582 50.14985
4 Fourth 65480 16.60243 66.75228
5 Fifth 65674 16.65162 83.40390
6 Sixth 65455 16.59610 100.00000
However if I try to convert that to a table to then convert to HTML, it tells me it cannot coerce Table to a table. This is the same with data frames as well.
Does anyone know a way, as I'd quite like to customise the appearance of the output?

There are several packages for that. Here are some:
knitr::kable(Table)
htmlTable::htmlTable(Table)
ztable::ztable(as.data.frame(Table))
DT::datatable(Table)
stargazer::stargazer(Table, type = "html")
Each of these has different customization options.

Related

R web scraping with Rselenium and rvest

I need to scrap this webpage so I could have a data.frame like this:
value01 value02 id
SECTION I LIVE ANIMALS ANIMAL PRODUCTS sectionI
CHAPTER 1 LIVE ANIMALS chap0100000000
0101 Live horses, asses, mules and hinnies : (TN701) 0101000000-1
- Horses : 0101210000-2
0101 21 - - Pure-bred breeding animals (NC018) 0101210000-80
0101 29 - - Other : 0101290000-3
0101 29 10 - - - For slaughter 0101291000-80
0101 29 90 - - - Other 0101299000-80
0101 30 - Asses 0101300000-80
To obtain the first two rows of value01 and value02 I use:
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.section') %>% html_table())[2])
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.chapter') %>% html_table())[2])
To obtain the rest of values of value01 and value02 I use (I need to clean the obtained values after I got them with this code, but I think there is better way to obtain the data):
remDr$getPageSource()[[1]] %>% read_html() %>% html_element(xpath = '//*[#id="div_description"]') %>% html_table()
So my problem now is to get the id column of the data.frame I want and to put it all together. Any advice on how to proceed from here to achieve my goal?
The code you need to run to function the previous examples:
suppressMessages(suppressWarnings(library(RSelenium)))
suppressMessages(suppressWarnings(library(rvest)))
rD <- rsDriver(browser = 'firefox', port = 6000L, verbose = FALSE)
remDr <- rD[['client']]
remDr$navigate('https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Domain=TARIC&Offset=0&ShowMatchingGoods=false&callbackuri=CBU-1&SimDate=20220719')
It is not quite clear to me what you want to scrape exactly from that page, but this is how you can get the data I think you are after.
pg <- remDr$getPageSource()[[1]]
doc <- xml2::read_html(pg)
# first two lines
rvest::html_elements(doc, '#sectionI table , .chapter') |>
rvest::html_table()
# get the data from each further line
lines <- rvest::html_elements(doc, ".evenLine")
data <- rvest::html_table(lines)
ids <- rvest::html_attrs(lines) |> sapply(function(x) x[1])
You'll need to clean the scraped data to your liking.
If this is not what you are looking for, you should clarify your question further.

How can I filter out numbers from an html table in R?

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.
I have used the following approach in R:
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)
# show BID/ASK block:
doc.tables[2]
Which (doc.tables[2]) gives me in this case the result:
$`NULL`
Bid 0,765
1 Ask 0,80
How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?
The issue is the 0.765 is actually the name of your data.frame column.
Your data frame being doc.tables[[2]]
You can grab the name by calling names(doc.tables[[2]])[2])
store that as a variable like name <- names(doc.tables[[2]])[2])
then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.
Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])
Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.
library(httr)
library(rvest)
library(dplyr)
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
tbl <- page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
stringr::str_replace_all(",", ".")
tibble(name = stringr::str_extract(tbl, "Ask|Bid"),
value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-26 by the reprex package (v2.0.1)
Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.
library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
str_replace_all(",", ".") %>%
tibble(name = str_extract(., "Ask|Bid"),
value = readr::parse_number(.)) %>%
.[-1]
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-27 by the reprex package (v2.0.1)
This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.
library(httr)
library(XML)
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)
# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

Issue loading HTML Table into R

I want to load the table at the bottom of the following webpage into R, either as a dataframe or table: https://www.lawschooldata.org/school/Yale%20University/18. My first instinct was to use the readHTMLTable function in the XML package
library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table
However, this returns an empty list and gives me the following warning:
Warning message:XML content does not seem to be XML: ''
I also tried adapting code I found here Scraping html tables into R data frames using the XML package. This worked for 5 of the 6 tables on the page, but just returned the header row and one row with values from the header row for the 6th table, which is the one I am interested in. Code below:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row
head(applicanttable)
Any insights would be greatly appreciated! For reference, I have also consulted the following posts that appear to have similar goals, but could not find a solution there:
Scraping html tables into R data frames using the XML package
Extracting html table from a website in R
The data is dynamically pulled from a nested JavaScript array, within a script tag when JavaScript runs in the browser. This doesn't happen when you use rvest to retrieve the non-rendered content (as seen in view-source).
You can regex out the appropriate nested array and then re-construct the table by splitting out the rows, adding the appropriate headers and performing some data manipulations on various columns e.g. some columns contain html which needs to be parsed to obtain the desired value.
As some columns e.g. Name contain values which could be interpreted as file paths , when using read_html, I use htmltidy to ensure handling as valid html.
N.B. If you use RSelenium then the page will render and you can just grab the table direct without reconstructing it.
TODO:
There are still some data type manipulations you could choose to apply to a few columns.
There is some more logic to be applied to ensure only Name is returned in Name column. Take the case of df$Name[10], this returns "Character and fitness issues" instead of Anxiousboy, due to the required value actually sitting in element.nextSibling.nextSibling of the p tag which is actually selected. These, infrequent, edge cases, need some additional logic built in. In this case, you might test for a particular string being returned then resort to re-parsing with an xpath expression.
R:
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
library(htmltidy)
#> Warning: package 'htmltidy' was built under R version 4.0.3
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
get_value <- function(input) {
value <- tidy_html(input) %>%
read_html() %>%
html_node("a, p, span") %>%
html_text(trim = T)
result <- ifelse(is.na(value), input, value)
return(result)
}
tidy_result <- function(result) {
return(gsub("<.*", "", result))
}
page <- read_html("https://www.lawschooldata.org/school/Yale%20University/18")
s <- page %>% toString()
headers <- page %>%
html_nodes("#applicants-table th") %>%
html_text(trim = T)
s <- stringr::str_extract(s, regex("DataTable\\(\\{\n\\s+data:(.*\\n\\]\\n\\])", dotall = T)) %>%
gsub("\n", "", .)
rows <- stringr::str_extract_all(s, regex("(\\[.*?\\])", dotall = T))[[1]] %>% as.list()
df <- sapply(rows, function(x) {
stringr::str_match_all(x, "'(.*?)'")[[1]][, 2]
}) %>%
t() %>%
as_tibble(.name_repair = "unique")
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...
names(df) <- headers
df <- df %>%
rowwise() %>%
mutate(across(c("Name", "GRE", "URM", "$$$$"), .f = get_value)) %>%
mutate_at(c("Result"), tidy_result)
write.csv(df, "Yale Applications.csv")
Created on 2021-06-23 by the reprex package (v0.3.0)
Sample output:

How to export an R object to HTML file without writing first to .Rmd?

I want to print the content of R objects and save it as a rendered HTML file. To this end, I find the pander package useful. However, I don't know how to go from the markdown pander::pander() generates to an HTML file in a single code execution, without intermediate save to .Rmd file. This question comes after a different question I posted about getting a png export for the same process.
I'm going to use the same example I used in the other post.
Example
Let's say that we have the mtcars data, and we want to extract some information about it:
Number of rows in the data
Average of mpg
The factor levels available in cyl
Regression summary for predicting mpg ~ cyl
To this end, I'll compute each of the above and assign them to objects. Finally, I'll bundle all the info in a list object.
library(dplyr)
library(tibble)
library(broom)
number_of_rows <- nrow(mtcars)
mpg_mean <- mean(mtcars$mpg)
cyl_levels <- mtcars %>% select(cyl) %>% unique() %>% remove_rownames()
model_summary <- lm(mpg ~ cyl, mtcars) %>% broom::tidy()
my_data_summary <- lst(number_of_rows,
mpg_mean,
cyl_levels,
model_summary)
> my_data_summary
## $number_of_rows
## [1] 32
## $mpg_mean
## [1] 20.09062
## $cyl_levels
## cyl
## 1 6
## 2 4
## 3 8
## $model_summary
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.9 2.07 18.3 8.37e-18
## 2 cyl -2.88 0.322 -8.92 6.11e-10
So how can I print the contents of my_data_summary as an HTML file?
I thought that pander::pander() is a good candidate because it accepts the list object in its entirety.
library(pander)
> pander(my_data_summary)
* **number_of_rows**: _32_
* **mpg_mean**: _20.09_
* **cyl_levels**:
-----
cyl
-----
6
4
8
-----
* **model_summary**:
------------------------------------------------------------
term estimate std.error statistic p.value
------------- ---------- ----------- ----------- -----------
(Intercept) 37.88 2.074 18.27 8.369e-18
cyl -2.876 0.3224 -8.92 6.113e-10
------------------------------------------------------------
<!-- end of list -->
But now I'm stuck. How could I progress from pander(my_data_summary) to an .html file in my directory? I want an .html that when opened, looks like this:
Is there a way to write a single code that goes directly from my_data_summary to an .html export without generating an intermediate file?
Thanks!
EDIT 1
#Konrad commented that it's impossible to progress from pander's markdown output directly. So if we ignore pander, is there any way to export the contents of my_data_summary to an HTML file directly?
EDIT 2
I made some progress with pander.
info_as_chr_vec <-
my_data_summary %>%
pander::pander_return() %>%
paste(., collapse = "\n")
info_as_chr_vec %>%
pander::Pandoc.brew(text = ., output = ~path/to/html/output/file, convert = "html")
Which writes an html file. This seems OK:
But I dislike the unnecessary table of contents and the footer. Nor do I know how to change the overall style of the tables. I'd much rather to have something more plain. Obviously, I could have edited the html code after the fact. But the whole point is to get the desired html file in the original execution of the R code.

R loops with JSON API Source

I'm trying to get data for books prices from API (http://www.knigoed.info/api-prices.html) based on ISBN.
The idea is to submit vector of ISBNs to the function to get a data frame with all available info (or at least Data.Frame with prices from different vendors)
isbns<- c("9785170922789", "9785170804801", "9785699834174", "9785699717255", "9785170869237")
getISBNprice <- function(ISBN, source="http://www.knigoed.info/api/Prices?code=") {
pathA <- source
for (i in 1:length(ISBN)) {
ISB <- ISBN[i]
AAA <- paste(pathA, ISB, "&sortPrice=DESC&country=RU", sep="")
document <- fromJSON(AAA, flatten = FALSE)
dfp <- document$prices
dfp <- cbind(dfp,ISB )
# dfp <- cbind(dfp,BookID=document$bookId)
# dfp <- cbind(dfp,Title=document$title)
# dfp <- cbind(dfp,Author=document$author)
# dfp <- cbind(dfp,Publisher=document$publisher)
# dfp <- cbind(dfp,Series=document$series)
# dfp <- cbind(dfp,Picture=document$picture)
if (!exists("AAAA")) {AAAA<- dfp} else {bind_rows(AAAA, dfp) }
}
AAAA
}
But the function returns error:
1. In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
3: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
4: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
It's easiest make a list from the start, which will make simplifying later easier. The purrr package can make working with lists much easier, though the usages here can be replaced with base's lapply and mapply/Map if you prefer.
library(purrr)
# Paste is vectorized, so make a list of URLs all at once.
# `httr` can make a URL out of a list of named parameters, if it's more convenient.
results <- paste0("http://www.knigoed.info/api/Prices?code=",
isbns,
"&sortPrice=DESC&country=RU") %>%
# Iterate over vector of URLs, using fromJSON to pull and parse the request.
# map, like lapply, will put the results into a list.
map(jsonlite::fromJSON, flatten = FALSE)
# Grab "prices" element of each top-level list element
results %>% map('prices') %>%
# Iterate in parallel (like mapply/Map) over prices and isbns, making a data.frame of
# each. map2_df will coerce the resulting list of data.frames to a single data.frame.
map2_df(isbns, ~data.frame(isbn = .y, .x, stringsAsFactors = FALSE)) %>%
# For pretty printing
tibble::as_data_frame()
## # A tibble: 36 x 10
## isbn shopId name domain
## <chr> <chr> <chr> <chr>
## 1 9785170922789 29 Магистр booka.ru
## 2 9785170922789 3 Лабиринт labirint.ru
## 3 9785170922789 20 LitRes.ru litres.ru
## 4 9785170804801 29 Магистр booka.ru
## 5 9785170804801 2 Read.ru read.ru
## 6 9785170804801 3 Лабиринт labirint.ru
## 7 9785170804801 63 Эксмо eksmo.ru
## 8 9785170804801 1 OZON.ru ozon.ru
## 9 9785170804801 4 My-shop.ru my-shop.ru
## 10 9785170804801 1 OZON.ru ozon.ru
## # ... with 26 more rows, and 6 more variables: url <chr>, available <lgl>, downloadable <lgl>,
## # priceValue <dbl>, priceSuffix <chr>, year <int>