Scraping several webpages from a website (newspaper archive) using RSelenium

Scraping several webpages from a website (newspaper archive) using RSelenium - html

I managed to scrape one page from a newspaper archive according to explanations here.
Now I am trying to automatise the process to access a list of pages by running one code.
Making a list of URLs was easy as the newspaper's archive has a similar pattern of links:
https://en.trend.az/archive/2021-XX-XX
The problem is with writing a loop to scrape such data as title, date, time, category. For simplicity, I tried to work only with article headlines from 2021-09-30 to 2021-10-02.
## Setting data frames
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
list_of_url <- character() # or str_c()
## Generating subpage list
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d")) {
list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)
# Launching browser
driver <- rsDriver(browser = c("firefox")) #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
page <- read_html(remDr$getPageSource()[[1]])
# Scraping article headlines
get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
get_time <- str_sub(get_time, start= -5)
length(get_time)
}
}
In total length should have been 157+166+140=463. In fact, I did not manage to collect all data even from one page (length(get_time) = 126)
I considered that after the first set of commands in the loop, I obtained three remDr for the 3 dates specified, but they were not recognised later independently.
Because of that I tried to initiate a second loop inside the initial one before or after page <- by
for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0
OR
page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0
However, these attempts ended with different errors.
I would appreciate the help of specialists as it is my first time using R for such purposes.
Hope it will be possible to correct the existing loop that I made or maybe even suggest a shorter pathway, by making a function, for example.

Slight broadening for scraping multiple categories
library(RSelenium)
library(dplyr)
library(rvest)
Mention the date period
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
dt = seq(d1, d2, by="days")#contains the date sequence
#launch browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
### `get_headline` Function for newspaper headlines
get_headline = function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
headlines = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
headlines
return(headlines)
}
get_time Function for the time of publishing
get_time <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selector of time on the website
time <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-date') %>%
html_text() %>%
str_sub(start= -5)
time
return(time)
}
Numbering of all articles from one page/day
get_number <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of headlines on the website
headline <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
number <- seq(1:length(headline))
return(number)
}
Collection of all functions into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
headline <- get_headline(x)
time <- get_time(x)
headline_number <- get_number(x)
# Combine into a tibble
combined_data <- tibble(Num = headline_number,
Article = headline,
Time = time)
}
Used lapply to loop through all the dates in dt
df = lapply(dt, get_data_table)

Related

R: Webscraping double loop does not go through the dates

I am webscraping a website in Jordan. The first page I'm scraping is https://alrai.com/search?date-from=2004-09-21&pgno=1.
I'm trying to make R run through each date and then each nested link that takes you to other pages (pgno=1,2,3 etc). The for loop works when I only use to obtain the links on 2004-09-21, but I need to be able to move up in dates.
I thought using another for loop around the first one that cycles through dates would work. But now the code as it is only returns the 10 elements on the first page and doesn't even go through the other page numbers.
for (i in seq_along(days)){
for (pagenumber in seq(from = 1, to = 10, by = 1)){
link = paste("https://alrai.com/search?date-from=",(days[i]), "&pgno=",
pagenumber, sep = "")
page = read_html(link)
}
}
readlink <- read_html(link)
text_title <- readlink %>%
html_elements(".font-700") %>%
html_text2()
article_links <- readlink %>%
html_elements(".font-700") %>%
html_attr("href")

Scraping the first 5 pages with purrr::map_dfr (without loop).
library(tidyverse)
library(rvest)
scraper <- function(page) {
site <- str_c("https://alrai.com/search?date-from=2004-09-21&pgno=",
page) %>%
read_html()
tibble(title = site %>%
html_elements(".font-700") %>%
html_text2())
}
map_dfr(1:5, scraper)

R: Inferring a Common Merge Key

I am trying to webscrape a site to get addresses for a set of names (part A) along with the longitude and latitudes (part B). I don't know how to do this all together, so I did this in two parts:
# part A
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.mywebsite/",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Here is part B:
# Recognize pattern in websites
part1 = "https://www.mywebsite/"
part2 = c(0:55)
part3 = "extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
df_2 = df_2 %>% filter(X1 != "LatLngBounds();")
> dim(df_2)
[1] 86 3
We can see that df_1 and df_2 have a different number of rows - but also, there is no common merge key between df_1 and df_2. How can I re-write my code in such a way that I can create a merge key between df_1 and df_2 such that I can merge the common records between these files together?

I am not sure multiple requests to the same URIs are needed. There are some lat long values not listed either on the results pages or on the result specific linked webpage e.g.Toronto Beaches Dentist from current page 2 results has no lat long shown on either page 2 or the website specific page. In these cases, you may choose to fill the blanks using another service which returns lat long based on an address.
You can re-write your function and alter your regex patterns to produce 2 dataframes which can be joined and the resultant dataframe returned. With the appropriate regex changes, as given below, you can use the address column to join the 2 dataframes. I dislike a key which is an address but it does appear to be internally consistent across the result page. I have used a left join to return all rows from the dentist listings i.e. the practice business names.
library(tidyverse)
library(rvest)
urls <- sprintf("https://www.dentistsearch.ca/search-doctor/%i?category=0&services=0&province=55&city=&k=", 1:10)
pages <- lapply(urls, read_html)
get_dentist_info <- function(page) {
page_text <- page %>% html_text()
address_keys <- page_text %>%
str_match_all('marker_\\d+\\.set\\("content", "(.*?)"\\);') %>%
.[[1]] %>%
.[, 2]
lat_long <- page_text %>%
str_match_all("LatLng\\((.*)\\);(?![\\s\\S]+myOptions)") %>%
.[[1]] %>%
.[, 2]
lat_lon <- tibble(address = address_keys, lat_long = lat_long) %>%
separate(lat_long, into = c("lat", "long"), sep = ", ") %>%
mutate(lat = as.numeric(lat), long = as.numeric(long))
practice_info <- tibble(
title = page %>% html_elements(".title > a") %>% html_text(trim = T),
address = page %>% html_elements(".marker") %>% html_text()
)
dentist_info <- left_join(practice_info, lat_lon, by = "address")
return(dentist_info)
}
all_dentist_info <- map_dfr(pages, get_dentist_info)

Obtaining data from NCBI gene database with R

Rentrez package
I was discovering rentrez package in RStudio (Version 1.1.442) on a lab computer in Linux (Ubuntu 20.04.2) according to this manual.
However, later when I wanted to run the same code on my laptop in Windows 8 Pro (RStudio 2021.09.0 )
library (rentrez)
entrez_dbs()
entrez_db_searchable("gene")
#res <- entrez_search (db = "gene", term = "(Vibrio[Organism] OR vibrio[All Fields]) AND (16s[All Fields]) AND (rna[All Fields]) AND (owensii[All Fields] OR navarrensis[All Fields])", retmax = 500, use_history = TRUE)
I can not get rid of this error, even after closing the session or reinstalling rentrez package
Error in curl::curl_fetch_memory(url, handle = handle) : schannel:
next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE
(0x80090326) - This error usually occurs when a fatal SSL/TLS alert is
received (e.g. handshake failed).
This is the main problem that I faced.
RSelenium package
Later I decided to address pages containing details about the genes and their sequences in FASTA format modifying a code that I have previously used. It uses rvest and rselenium packages and the results were perfect.
# Specifying a webpage
url <- "https://www.ncbi.nlm.nih.gov/gene/66940694" # the last 9 numbers is gene id
library(rvest)
library(RSelenium)
# Opening a browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(url)
# Clicked outside in an empty space next to the FASTA button and copied a full xPath (redirecting to a FASTA data containing webpage)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of a webpage: left it from the old code for the case of a long gene
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
# Let's get gene FASTA, for example
page <- read_html(remDr$getPageSource()[[1]])
fasta <- page %>%
html_nodes('pre') %>%
html_text()
print(fasta)
Output: ">NZ_QKKR01000022.1:c3037-151 Vibrio paracholerae strain
2016V-1111 2016V-1111_ori_contig_18, whole genome shotgun
sequence\nGGT...
The code worked well to obtain other details about the gene like its accession number, position, organism and etc.
Looping of the process for several gene IDs
Later I tried to change the code to get simultaneously the same information for several gene IDs following the explanations I got here for the other project of mine.
# Specifying a list of gene IDs
res_id <- c('57838769','61919208','66940694')
dt <- res_id # <lapply> looping function refused to work if an argument had a different name rather than <dt>
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
## Writing a function of GET_FASTA dependent on GENE_ID (x)
get_fasta <- function(x){
link = paste0('https://www.ncbi.nlm.nih.gov/gene/',x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
... there is a continuation below but an error was appearing here, saying that the same xPath, which was successfully used before, can not be found.
Error: Summary: NoSuchElement Detail: An element could not be located
on the page using the given search parameters. class:
org.openqa.selenium.NoSuchElementException Further Details: run
errorDetails method
I tried to delete /a[2] to get /html/.../p at the end of the xPath as it was working in the initial code, but an error was appearing later again.
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of FASTA on the website
fasta <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('pre') %>%
html_text()
fasta
return(fasta)
}
## Writing a function of GET_ACC_NUM dependent on GENE_ID (x)
get_acc_num <- function(x){
link = paste0( 'https://www.ncbi.nlm.nih.gov/gene/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p')$clickElement()
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of ACC_NUM on the website
acc_num <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.itemid') %>%
html_text() %>%
str_sub(start= -17)
acc_num
return(acc_num)
}
## Collecting all FUNCTION into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
fasta <- get_fasta(x)
acc_num <- get_acc_num(x)
# Combine into a tibble
combined_data <- tibble( Acc_Number = acc_num,
FASTA = fasta)
}
## Running FUNCTION for all x
df <- lapply(dt, get_data_table)
head(df)
I also tried to write the code
only with rvest,
to write the loop with for (i in res_id) {},
to introduce two different xPaths ending with /html/.../p/a[2] or .../p using if () {} else {}
but the results were even more confusing.
I am studying R coding while working on such tasks, so any suggestions and critics are welcome.

The node pre is not a valid one. We have to look for value inside class or 'id` etc.
webElem$sendKeysToElement(list(key = "end") you don't need this command as there is no necessity yo scroll the page.
Below is code to get you the sequence of genes.
First we have to get the links to sequence of genes which we do it by rvest
library(rvest)
library(dplyr)
res_id <- c('57838769','61919208','66940694')
link = vector()
for(i in res_id){
url = paste0('https://www.ncbi.nlm.nih.gov/gene/', i)
df = url %>%
read_html() %>%
html_node('.note-link')
link1 = xml_attrs(xml_child(df, 3))[["href"]]
link1 = paste0('https://www.ncbi.nlm.nih.gov', link1)
link = rbind(link, link1)
}
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_ADAF01000001.1?report=fasta&from=257558&to=260444"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_VARQ01000103.1?report=fasta&from=64&to=2616&strand=true"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_QKKR01000022.1?report=fasta&from=151&to=3037&strand=true"
After obtaining the links we shall get the sequence of genes which we do it by RSelenium. I tried to do it with rvest but couldn't get the sequence.
Launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
Function to get the sequence
get_seq = function(link){
remDr$navigate(link)
Sys.sleep(5)
df = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes(xpath = '//*[#id="viewercontent1"]') %>%
html_text()
return(df)
}
df = lapply(link, get_seq)
Now we have list df with all the info.

Web scraping in R with Selenium to click new pages

I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)

The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)

#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.

Scrape Web Page When Selector Does Not Update URL

I am trying to scrape this webpage (https://nc.211counts.org) for a given region and time ('Onslow', 'Yesterday' for example). I want to pull all of the information from that top left table (COVID, Housing, etc through Other). Unfortunately, the URL does not update when the filters are selected. I have been following the tutorial here but cannot find a way to pull in the position of the region names I need to scrape for. Since the html_nodes function is returning empty, I think there is something to the mapping that is off.
What am I missing here?
# docker run -d -p 4445:4444 selenium/standalone-chrome
# docker ps
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://nc.211counts.org")
remDr$screenshot(display = TRUE)
nc211 <- xml2::read_html(remDr$getPageSource()[[1]])
str(nc211)
body_nodes <- nc211 %>%
html_node('body') %>%
html_children()
body_nodes
body_nodes %>%
html_children()
rank <- nc211 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(#class, 'col-lg-12 chosen-select')]") %>%
rvest::html_text()
# this returns empty
nc211 %>%
rvest::html_nodes("#region") %>%
rvest::html_children() %>%
rvest::html_text()
# guessing at an element number to see what happens
element<- remDr$findElement(using = 'css selector', "#region > option:nth-child(1)")
element$clickElement()

Content is dynamically updated through xhr POST requests when you make your selections and press Search. You can use the network tab to analyse these requests and reproduce them without resorting to selenium (as an alternative). You will need to pick up the param options from the initial page.
Below I show you how to make a request for a particular zipcode and also how to find out all the zip codes and their corresponding param ids to use in request. The latter needs to come from the initial url.
library(httr)
library(rvest)
data = list(
'id' = '{"ids":["315"]}', # zip 27006 is id 315 seen in value attribute of checkbox node
'timeIntervalId' = '18',
'centerId' = '7',
'type' = 'Z'
)
#post request that page makes using your filter selections e.g. zip code
r <- httr::POST(url = 'https://nc.211counts.org/dashBoard/barChart', body = data)
page <- read_html(r)
categories <- page %>% html_nodes(".categoriesDiv .toolTipSubCategory, #totalLabel") %>% html_text
colNodes <- page %>% html_nodes(".categoriesDiv .value")
percentages <- colNodes %>% html_attr('data-percentage')
counts <- colNodes %>% html_attr('data-value')
df <- as.data.frame(cbind(categories, percentages, counts))
print(df)
#Lookups e.g. zip codes. Taken from initial url
initial_page <- read_html('https://nc.211counts.org/')
ids <- initial_page %>% html_nodes('.zip [value]') %>% html_attr('value')
zips <- initial_page %>% html_nodes('.zip label') %>% html_text() %>% trimws()
print(ids[match('27006', zips)])

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scraping several webpages from a website (newspaper archive) using RSelenium - html

Related

R: Webscraping double loop does not go through the dates

R: Inferring a Common Merge Key

Obtaining data from NCBI gene database with R

Web scraping in R with Selenium to click new pages

Scrape Web Page When Selector Does Not Update URL

Categories

Resources