When I try this code, I get an error. I want to scrape the data from multiple pages. But when I try the script down below, I get an error that says:
Error in eval_tidy (xs [[j]], mask): object 'titles' not found
library(tidyverse)
library(rvest)
# function to scrape all elements also missing elements
scrape_css <- function(css, group, html_page) {
txt <- html_page %>%
html_nodes(group) %>%
lapply(
. %>%
html_nodes(css) %>%
html_text() %>%
ifelse(identical(., character(0)), NA, .)
) %>%
unlist()
return(txt)
}
# Get all elements from 1 page
get_one_page <- function(url) {
html <- read_html(url)
titles <- scrape_css(
".recipe-card_title__1oIb-",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
minutes <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(1)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
callories <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(2)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
}
return(tibble(titles = titles, minutes = minutes, callories = callories))
url <- ("https://www.ah.nl/allerhande/recepten/R-L1473207825981/suikerbewust")
appie <- get_one_page(url)
Two things:
return should be inside the curly braces at the end of the function
return(tibble(titles = titles, minutes = minutes, callories = callories))
}
You forgot to update the variable name such that you have
html <- read_html(url)
when you need
html_page <- read_html(url)
This does not per se answer #Laila's question, but I ran into the
Error in eval_tidy(xs[[j]], mask) : object '' not found
error, when I did sth similar to
tibble(
a = 1,
b = ,
c = NA
)
So basically, forgot to enter a value for a tibble column. But it is a rather uninformative error message. I am leaving this for the future me (and others) as a reference, as there are not many posts related to this error.
Related
I am trying to webscrape a site to get addresses for a set of names (part A) along with the longitude and latitudes (part B). I don't know how to do this all together, so I did this in two parts:
# part A
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.mywebsite/",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Here is part B:
# Recognize pattern in websites
part1 = "https://www.mywebsite/"
part2 = c(0:55)
part3 = "extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
df_2 = df_2 %>% filter(X1 != "LatLngBounds();")
> dim(df_2)
[1] 86 3
We can see that df_1 and df_2 have a different number of rows - but also, there is no common merge key between df_1 and df_2. How can I re-write my code in such a way that I can create a merge key between df_1 and df_2 such that I can merge the common records between these files together?
I am not sure multiple requests to the same URIs are needed. There are some lat long values not listed either on the results pages or on the result specific linked webpage e.g.Toronto Beaches Dentist from current page 2 results has no lat long shown on either page 2 or the website specific page. In these cases, you may choose to fill the blanks using another service which returns lat long based on an address.
You can re-write your function and alter your regex patterns to produce 2 dataframes which can be joined and the resultant dataframe returned. With the appropriate regex changes, as given below, you can use the address column to join the 2 dataframes. I dislike a key which is an address but it does appear to be internally consistent across the result page. I have used a left join to return all rows from the dentist listings i.e. the practice business names.
library(tidyverse)
library(rvest)
urls <- sprintf("https://www.dentistsearch.ca/search-doctor/%i?category=0&services=0&province=55&city=&k=", 1:10)
pages <- lapply(urls, read_html)
get_dentist_info <- function(page) {
page_text <- page %>% html_text()
address_keys <- page_text %>%
str_match_all('marker_\\d+\\.set\\("content", "(.*?)"\\);') %>%
.[[1]] %>%
.[, 2]
lat_long <- page_text %>%
str_match_all("LatLng\\((.*)\\);(?![\\s\\S]+myOptions)") %>%
.[[1]] %>%
.[, 2]
lat_lon <- tibble(address = address_keys, lat_long = lat_long) %>%
separate(lat_long, into = c("lat", "long"), sep = ", ") %>%
mutate(lat = as.numeric(lat), long = as.numeric(long))
practice_info <- tibble(
title = page %>% html_elements(".title > a") %>% html_text(trim = T),
address = page %>% html_elements(".marker") %>% html_text()
)
dentist_info <- left_join(practice_info, lat_lon, by = "address")
return(dentist_info)
}
all_dentist_info <- map_dfr(pages, get_dentist_info)
I managed to scrape one page from a newspaper archive according to explanations here.
Now I am trying to automatise the process to access a list of pages by running one code.
Making a list of URLs was easy as the newspaper's archive has a similar pattern of links:
https://en.trend.az/archive/2021-XX-XX
The problem is with writing a loop to scrape such data as title, date, time, category. For simplicity, I tried to work only with article headlines from 2021-09-30 to 2021-10-02.
## Setting data frames
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
list_of_url <- character() # or str_c()
## Generating subpage list
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d")) {
list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)
# Launching browser
driver <- rsDriver(browser = c("firefox")) #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
page <- read_html(remDr$getPageSource()[[1]])
# Scraping article headlines
get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
get_time <- str_sub(get_time, start= -5)
length(get_time)
}
}
In total length should have been 157+166+140=463. In fact, I did not manage to collect all data even from one page (length(get_time) = 126)
I considered that after the first set of commands in the loop, I obtained three remDr for the 3 dates specified, but they were not recognised later independently.
Because of that I tried to initiate a second loop inside the initial one before or after page <- by
for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0
OR
page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0
However, these attempts ended with different errors.
I would appreciate the help of specialists as it is my first time using R for such purposes.
Hope it will be possible to correct the existing loop that I made or maybe even suggest a shorter pathway, by making a function, for example.
Slight broadening for scraping multiple categories
library(RSelenium)
library(dplyr)
library(rvest)
Mention the date period
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
dt = seq(d1, d2, by="days")#contains the date sequence
#launch browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
### `get_headline` Function for newspaper headlines
get_headline = function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
headlines = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
headlines
return(headlines)
}
get_time Function for the time of publishing
get_time <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selector of time on the website
time <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-date') %>%
html_text() %>%
str_sub(start= -5)
time
return(time)
}
Numbering of all articles from one page/day
get_number <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of headlines on the website
headline <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
number <- seq(1:length(headline))
return(number)
}
Collection of all functions into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
headline <- get_headline(x)
time <- get_time(x)
headline_number <- get_number(x)
# Combine into a tibble
combined_data <- tibble(Num = headline_number,
Article = headline,
Time = time)
}
Used lapply to loop through all the dates in dt
df = lapply(dt, get_data_table)
I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())
Rentrez package
I was discovering rentrez package in RStudio (Version 1.1.442) on a lab computer in Linux (Ubuntu 20.04.2) according to this manual.
However, later when I wanted to run the same code on my laptop in Windows 8 Pro (RStudio 2021.09.0 )
library (rentrez)
entrez_dbs()
entrez_db_searchable("gene")
#res <- entrez_search (db = "gene", term = "(Vibrio[Organism] OR vibrio[All Fields]) AND (16s[All Fields]) AND (rna[All Fields]) AND (owensii[All Fields] OR navarrensis[All Fields])", retmax = 500, use_history = TRUE)
I can not get rid of this error, even after closing the session or reinstalling rentrez package
Error in curl::curl_fetch_memory(url, handle = handle) : schannel:
next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE
(0x80090326) - This error usually occurs when a fatal SSL/TLS alert is
received (e.g. handshake failed).
This is the main problem that I faced.
RSelenium package
Later I decided to address pages containing details about the genes and their sequences in FASTA format modifying a code that I have previously used. It uses rvest and rselenium packages and the results were perfect.
# Specifying a webpage
url <- "https://www.ncbi.nlm.nih.gov/gene/66940694" # the last 9 numbers is gene id
library(rvest)
library(RSelenium)
# Opening a browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(url)
# Clicked outside in an empty space next to the FASTA button and copied a full xPath (redirecting to a FASTA data containing webpage)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of a webpage: left it from the old code for the case of a long gene
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
# Let's get gene FASTA, for example
page <- read_html(remDr$getPageSource()[[1]])
fasta <- page %>%
html_nodes('pre') %>%
html_text()
print(fasta)
Output: ">NZ_QKKR01000022.1:c3037-151 Vibrio paracholerae strain
2016V-1111 2016V-1111_ori_contig_18, whole genome shotgun
sequence\nGGT...
The code worked well to obtain other details about the gene like its accession number, position, organism and etc.
Looping of the process for several gene IDs
Later I tried to change the code to get simultaneously the same information for several gene IDs following the explanations I got here for the other project of mine.
# Specifying a list of gene IDs
res_id <- c('57838769','61919208','66940694')
dt <- res_id # <lapply> looping function refused to work if an argument had a different name rather than <dt>
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
## Writing a function of GET_FASTA dependent on GENE_ID (x)
get_fasta <- function(x){
link = paste0('https://www.ncbi.nlm.nih.gov/gene/',x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
... there is a continuation below but an error was appearing here, saying that the same xPath, which was successfully used before, can not be found.
Error: Summary: NoSuchElement Detail: An element could not be located
on the page using the given search parameters. class:
org.openqa.selenium.NoSuchElementException Further Details: run
errorDetails method
I tried to delete /a[2] to get /html/.../p at the end of the xPath as it was working in the initial code, but an error was appearing later again.
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of FASTA on the website
fasta <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('pre') %>%
html_text()
fasta
return(fasta)
}
## Writing a function of GET_ACC_NUM dependent on GENE_ID (x)
get_acc_num <- function(x){
link = paste0( 'https://www.ncbi.nlm.nih.gov/gene/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p')$clickElement()
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of ACC_NUM on the website
acc_num <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.itemid') %>%
html_text() %>%
str_sub(start= -17)
acc_num
return(acc_num)
}
## Collecting all FUNCTION into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
fasta <- get_fasta(x)
acc_num <- get_acc_num(x)
# Combine into a tibble
combined_data <- tibble( Acc_Number = acc_num,
FASTA = fasta)
}
## Running FUNCTION for all x
df <- lapply(dt, get_data_table)
head(df)
I also tried to write the code
only with rvest,
to write the loop with for (i in res_id) {},
to introduce two different xPaths ending with /html/.../p/a[2] or .../p using if () {} else {}
but the results were even more confusing.
I am studying R coding while working on such tasks, so any suggestions and critics are welcome.
The node pre is not a valid one. We have to look for value inside class or 'id` etc.
webElem$sendKeysToElement(list(key = "end") you don't need this command as there is no necessity yo scroll the page.
Below is code to get you the sequence of genes.
First we have to get the links to sequence of genes which we do it by rvest
library(rvest)
library(dplyr)
res_id <- c('57838769','61919208','66940694')
link = vector()
for(i in res_id){
url = paste0('https://www.ncbi.nlm.nih.gov/gene/', i)
df = url %>%
read_html() %>%
html_node('.note-link')
link1 = xml_attrs(xml_child(df, 3))[["href"]]
link1 = paste0('https://www.ncbi.nlm.nih.gov', link1)
link = rbind(link, link1)
}
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_ADAF01000001.1?report=fasta&from=257558&to=260444"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_VARQ01000103.1?report=fasta&from=64&to=2616&strand=true"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_QKKR01000022.1?report=fasta&from=151&to=3037&strand=true"
After obtaining the links we shall get the sequence of genes which we do it by RSelenium. I tried to do it with rvest but couldn't get the sequence.
Launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
Function to get the sequence
get_seq = function(link){
remDr$navigate(link)
Sys.sleep(5)
df = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes(xpath = '//*[#id="viewercontent1"]') %>%
html_text()
return(df)
}
df = lapply(link, get_seq)
Now we have list df with all the info.
I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)
The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)
#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.