I consider bsplus package as relevant when developing dynamic webs. I use R Markdown in Rstudio.
However, I find particularly tricky the way to integrate bsplus functions with R outputs.
Let's see an example with the bs_accordion function, using mtcars dataset
head <- head(mtcars)
tail <- tail(mtcars)
bs_accordion(id ="Data: mtcars") %>%
bs_append(title = "Head of mtcars", content = head) %>%
bs_append(title = "Tail of mtcars", content = tail)
I would like to display R outputs in the accordion function, displaying the data frames head and tail.
Now, it only displays the first numerical row in the head.
Is there any possibility to include R code within the content attribute in the bsplus functions?
In this way we could be able to display R results in a dynamic way.
This should work for your example. You have to create a datatable somehow, just including it wont render it as a table.
Note: I changed the id of the accordion to Data-mtcars. Using a whitespace, ":" or ";" will disable the collapsing.
library(shiny)
library(bsplus)
library(DT)
ui <- fluidPage(
bs_accordion(id ="Data-mtcars") %>%
bs_set_opts(panel_type = "primary", use_heading_link = T) %>%
bs_append(title = "Head of mtcars", content = DT::dataTableOutput("table1")) %>%
bs_set_opts(panel_type = "primary", use_heading_link = T) %>%
bs_append(title = "Tail of mtcars", content = DT::dataTableOutput("table2"))
)
server <- function(input, output) {
output$table1 <- DT::renderDataTable({
head
})
output$table2 <- DT::renderDataTable({
tail
})
}
shinyApp(ui, server)
Related
I am webscraping a website in Jordan. The first page I'm scraping is https://alrai.com/search?date-from=2004-09-21&pgno=1.
I'm trying to make R run through each date and then each nested link that takes you to other pages (pgno=1,2,3 etc). The for loop works when I only use to obtain the links on 2004-09-21, but I need to be able to move up in dates.
I thought using another for loop around the first one that cycles through dates would work. But now the code as it is only returns the 10 elements on the first page and doesn't even go through the other page numbers.
for (i in seq_along(days)){
for (pagenumber in seq(from = 1, to = 10, by = 1)){
link = paste("https://alrai.com/search?date-from=",(days[i]), "&pgno=",
pagenumber, sep = "")
page = read_html(link)
}
}
readlink <- read_html(link)
text_title <- readlink %>%
html_elements(".font-700") %>%
html_text2()
article_links <- readlink %>%
html_elements(".font-700") %>%
html_attr("href")
Scraping the first 5 pages with purrr::map_dfr (without loop).
library(tidyverse)
library(rvest)
scraper <- function(page) {
site <- str_c("https://alrai.com/search?date-from=2004-09-21&pgno=",
page) %>%
read_html()
tibble(title = site %>%
html_elements(".font-700") %>%
html_text2())
}
map_dfr(1:5, scraper)
I managed to scrape one page from a newspaper archive according to explanations here.
Now I am trying to automatise the process to access a list of pages by running one code.
Making a list of URLs was easy as the newspaper's archive has a similar pattern of links:
https://en.trend.az/archive/2021-XX-XX
The problem is with writing a loop to scrape such data as title, date, time, category. For simplicity, I tried to work only with article headlines from 2021-09-30 to 2021-10-02.
## Setting data frames
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
list_of_url <- character() # or str_c()
## Generating subpage list
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d")) {
list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)
# Launching browser
driver <- rsDriver(browser = c("firefox")) #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
page <- read_html(remDr$getPageSource()[[1]])
# Scraping article headlines
get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
get_time <- str_sub(get_time, start= -5)
length(get_time)
}
}
In total length should have been 157+166+140=463. In fact, I did not manage to collect all data even from one page (length(get_time) = 126)
I considered that after the first set of commands in the loop, I obtained three remDr for the 3 dates specified, but they were not recognised later independently.
Because of that I tried to initiate a second loop inside the initial one before or after page <- by
for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0
OR
page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0
However, these attempts ended with different errors.
I would appreciate the help of specialists as it is my first time using R for such purposes.
Hope it will be possible to correct the existing loop that I made or maybe even suggest a shorter pathway, by making a function, for example.
Slight broadening for scraping multiple categories
library(RSelenium)
library(dplyr)
library(rvest)
Mention the date period
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
dt = seq(d1, d2, by="days")#contains the date sequence
#launch browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
### `get_headline` Function for newspaper headlines
get_headline = function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
headlines = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
headlines
return(headlines)
}
get_time Function for the time of publishing
get_time <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selector of time on the website
time <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-date') %>%
html_text() %>%
str_sub(start= -5)
time
return(time)
}
Numbering of all articles from one page/day
get_number <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of headlines on the website
headline <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
number <- seq(1:length(headline))
return(number)
}
Collection of all functions into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
headline <- get_headline(x)
time <- get_time(x)
headline_number <- get_number(x)
# Combine into a tibble
combined_data <- tibble(Num = headline_number,
Article = headline,
Time = time)
}
Used lapply to loop through all the dates in dt
df = lapply(dt, get_data_table)
I would like to parse addresses of all stores on the following website:
https://www.carrefour.fr/magasin/region/ looping through the regions. So starting for example with the region "auvergne-rhone-alpes-84", hence full url = https://www.carrefour.fr/magasin/region/auvergne-rhone-alpes-84. Note that I can add more regions afterwards, I just want to make it work with one for now.
carrefour <- "https://www.carrefour.fr/magasin/region/"
addresses_vector = c()
for (current_region in c("auvergne-rhone-alpes-84")) {
current_region_url = paste(carrefour, current_region, "/", sep="")
x <- GET(url=current_region_url)
html_doc <- read_html(x) %>%
html_nodes("[class = 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2']")
addresses_vector <- c(addresses_vector, html_doc %>%
rvest::html_nodes('body')%>%
xml2::xml_find_all(".//div[contains(#class, 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2')]") %>%
rvest::html_text())
}
I also tried with x%>% read_html() %>% rvest::html_nodes(xpath="/html/body/main/div[1]/div/div[2]/div[2]/ol/li[1]/div/div[1]/div[2]/div[2]")%>% rvest::html_text() (copying the whole xpath by hand) or x%>%read_html() %>%html_nodes("div.ds-body-text.ds-store-card__details--content.ds-body-text--size-m.ds-body-text--color-standard-2") %>%html_text() and several other ways but I always get a character(0) element returned.
Any help is appreciated!
You could write a couple of custom functions to help then use purrr to map the store data function to inputs from the output of the first helper function.
First, extract the region urls and extract the region names and region ids. Store these in a tibble. This is the first helper function get_regions.
Then use another function, get_store_info, to extract from these region urls the store info, which is stored in a div tag, from which it is dynamically extracted when JavaScript runs in the browser, but not when using rvest.
Apply the function that extracts the store info over the list of region urls and region ids.
If you use map2_dfr to pass both region id and region link to the function which extracts store data, you then have the region id to link back on to join the result of the map2_dfr to that of region tibble generated earlier.
Then do some column cleaning e.g., drop ones you don't want.
library(rvest)
library(purrr)
library(dplyr)
library(readr)
library(jsonlite)
get_regions <- function() {
url <- "https://www.carrefour.fr/magasin"
page <- read_html(url)
regions <- page %>% html_nodes(".store-locator-footer-list__item > a")
t <- tibble(
region = regions %>% html_text(trim = T),
link = regions %>% html_attr("href") %>% url_absolute(url),
region_id = NA_integer_
) %>% mutate(region_id = str_match(link, "-(\\d+)$")[, 2] %>%
as.integer())
return(t)
}
get_store_info <- function(region_url, r_id) {
region_page <- read_html(region_url)
store_data <- region_page %>%
html_node("#store-locator") %>%
html_attr(":context-stores") %>%
parse_json(simplifyVector = T) %>%
as_tibble()
store_data$region_id <- r_id
return(store_data)
}
region_df <- get_regions()
store_df <- map2_dfr(region_df$link, region_df$region_id, get_store_info)
final_df <- inner_join(region_df, store_df, by = 'region_id') # now clean columns within this.
EDIT: Because I had set the format options in a global function, I have to set either latex_options or bootstrap_options in the kable_styling() call. I was using bootstrap_options which wasn't being read by the latex. My work-around is to make the tables twice, once in a chunk for html, and once in a chunk for latex. Not great, but it works if I click the Knit button and choose Knit to PDF. However, it throws the original error when I try to run it in the shiny app.
I have created a test version (MiniTest) of my project. What I need to do is have a shiny app run with a tab that will produce an html file for a user-chosen (reactive) Country, and provide an Excel download (I have that working so kept it out of this example), and a PDF download. I knit in an .Rmd which chooses the format and allows for parameterization. (The shiny part was set up by someone else, from whom I took over this project when they left before finishing it.)
I use kable and kableExtra to create and format tables, as I heard it words for both html and LaTeX output. The HTML is more as less as I want it. I can knit either html or PDF, and it runs, BUT when in the shiny app, only the html portion works. I think I have narrowed down the PDF issue(s) to column_spec crashing the download. If I comment out the column_spec lines in t01 and t02, the Download PDF runs. But I need that formatting. I'm sorry, but I've lost track of all the sites I have searched.
In global.R, I set:
countries <- c("ABC", "DEF", "GHI", "JKL")
In the .Rmd, I have YAML set up (with two-space indents for Country and output types):
params:
Country: ABC
output:
pdf_document: default
html_document: default
Relevant .Rmd chunks and inline code include:
knitr::opts_chunk$set(echo = FALSE)
options(knitr.table.format = function() {
if (knitr::is_latex_output()) "latex" else "html"
})
library(shiny)
library(htmlwidgets)
library(shinythemes)
library(shinydashboard)
library(shinyjs)
library(shinycssloaders)
library(markdown)
library(tidyr)
library(tidyverse)
library(janitor)
library(kableExtra)
options(scipen = 999)
mini <- mtcars %>%
tibble::rownames_to_column(var = "car") %>%
mutate(Country = c(rep("ABC", 8), rep("DEF", 8), rep("GHI", 8), rep("JKL", 8)))
## https://bookdown.org/yihui/rmarkdown-cookbook/font-color.html
colorize <- function(x, color) {
if (knitr::is_latex_output()) {
## hack setting color='blue' instead of a hexcode with # that breaks the LaTeX code
sprintf("\\textcolor{%s}{%s}", color = 'blue', x) ## works, but isn't right blue
} else if (knitr::is_html_output()) {
sprintf("<span style='color: %s;'>%s</span>", color, x)
} else x
}
## make two tables with `kable` and `kableExtra`
new_title <- paste0("Dynamically Changing Country Name in column", params$Country)
t01 <- mini %>%
filter(Country == params$Country) %>%
select(car, mpg:hp) %>%
rename({{new_title}} := car) %>%
kable(align = c("l", "c", "c", "c", "c")) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "condensed")) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2:3, width = "5em") %>%
row_spec(0, color = "#2A64AB") %>%
row_spec(6, bold = TRUE)
t02_title <- paste0(params$Country, " Table with Dollar Signs in Var Names")
t02 <- mini %>%
filter(Country == params$Country) %>%
select(car, drat, wt) %>%
mutate(car = case_when(car == "Mazda RX4" ~ "Mazda RX4 (US\\$)*", TRUE ~ as.character(car))) %>%
## want to blank out column names - removing them entirely would be best, but it fails
kable(align = c("l", "r", "c"), escape = TRUE, col.names = c("", "", "")) %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "condensed")) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, width = "10em") %>%
footnote(general = "*Never smart to start with an asterisk, but here we are", general_title = "")
## make two charts with `ggplot2`
chart1 <- mini %>%
filter(Country == params$Country) %>%
select(car, mpg:hp) %>%
ggplot2::ggplot(mapping = aes(x = mpg)) +
geom_col(aes(y = `cyl`, fill = "cyl"), color = "black")
c1_title <- paste0("Some fab title here for ", params$Country)
chart2 <- mini %>%
filter(Country == params$Country) %>%
select(car, vs:carb) %>%
ggplot2::ggplot(mapping = aes(x = carb)) +
geom_col(aes(y = `gear`, fill = "gear"), color = "black")
c2_title <- paste0("Another chart, ", params$Country)
## make a "tiny" LaTeX environment that is only generated for LaTeX output, with chunk setting `include = knitr::is_latex_output()`.
knitr::asis_output('\n\n\\begin{tiny}')
## Table 1
t01
I expect a PDF to pop up, but instead a Save File box pops up asking to save "DownloadPDF" with no file extension. The ui.R is supposed to name it as "FactCountryName.pdf" where "CountryName" is input from the Country the user chose in the drop-down list. Regardless of whether I choose Save (nothing happens) or Cancel, my R throws the following error:
```
! LaTeX Error: Illegal character in array arg.
```
If I comment out the line column_spec(1, bold = TRUE) %>%, the error changes to:
```
! Use of \#array doesn't match its definition.
\new#ifnextchar ...served#d = #1\def \reserved#a {
#2}\def \reserved#b {#3}\f...
l.74 ...m}|>{\centering\arraybackslash}p{5em}|c|c}
```
Please help!
Turns out that using the Knit button in R automatically loads the required LaTeX packages, such as booktabs. Running the file in the Shiny app was not loading all the packages needed. All I had to do was specifically call the extra packages in the YAML (which I found by looking at the .tex file made from the PDF through Knit button).
---
params:
Country: ABC
header-includes:
- \usepackage{booktabs}
- \usepackage{longtable}
- \usepackage{array}
- \usepackage{multirow}
- \usepackage{wrapfig}
- \usepackage{float}
- \usepackage{colortbl}
- \usepackage{pdflscape}
- \usepackage{tabu}
- \usepackage{threeparttable}
- \usepackage{threeparttablex}
output:
pdf_document:
keep_tex: true
html_document: default
---
I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}