rvest how to get last page number in r language - html

I'm learning web scraping and want to create an example for myself.
https://www.goodreads.com/search?page=1&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93
I want to scrape last page number which is 100 by using above url. I tried several different codes, but they are not working well.
url %>%
read_html(x) %>%
html_nodes('div.leftContainer') %>%
html_nodes('a[href^="/search?page=100&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93"]') %>%
html_text()
I used html_nodes to get text '100' but it failed. I want to use length() and as.integer() to get the number.
I would like to know how to get the value of last page number.

You should be able to use nth-last-of-type to get penultimate href containing page
library(rvest)
url <- 'https://www.goodreads.com/search?page=1&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93'
last_page <- read_html(url) %>% html_node('[href*=page]:nth-last-child(2)') %>% html_text() %>% as.integer()

Below another possible solution:
library(RSelenium)
remDr <- rsDriver(port=4555L,browser = "firefox")
remoteDriver<- remDr[["client"]]
url <- "https://www.goodreads.com/search?page=1&qid=ckDrIeoJ2c&query=harry+potter&tab=books&utf8=%E2%9C%93"
remoteDriver$navigate(url)
#gets the last number of page
last_page<-remoteDriver$findElement(using = 'xpath', value = '/html/body/div[2]/div[3]/div[1]/div[2]/div[2]/div[3]/div/a[10]')$getElementText()
print(last_page)
[[1]]
[1] "100"

Related

What should the "i"/category be in this for loop and how can I ensure it is in my working directory?

I am running a web-scraping project and running into some difficulty using the urls for search results from an initial scrape to scrape information from the search results themselves.
My first loop provides the back halves of the urls I need, after the / (for example, yelp.com/abd - I have abd), which I have in a nested list. However, when I summarize that nested list, like so:
profile_url_lst <- list()
for(page_num in 1:73){
main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
html_content <- read_html(main_url)
profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>%
html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>%
html_attr("href")
profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
profile_url_lst
profiles <- cbind(profile_urls)
profiles
I only receive the urls from the last page of results.
I pasted the domain name to those urls with paste0, which worked fine, but I then encounter another problem. When I use the variable name in a for loop, R returns "variable name is not in your working directory).
complete_urls <- paste0('https://www.theeroticreview.com', profiles)
complete <- cbind(complete_urls)
complete
TED_lst <- list()
for(complete_urls in 1:73) {
html_content1 <- read_html('complete_urls')
TED <- html_content1 %>% html_nodes("'") %>% html_text()
TED_lst[i] <- TEDs
Sys.sleep(2)
How do I paste the domain name to all the collected urls and bind them, and what should the category be in the for loop?
Assuming you intend to read_html from each url within complete_urls you want to avoid overwriting that variable by using it as the loop variable; as well as referencing it as a string literal. You could instead seq_along the items and index in. Here I print rather than read_html
complete_urls <- c('A', 'B')
for(i in seq_along(complete_urls)){
print(complete_urls[[i]])
}
It is probably better to write a custom function to apply to each url and pass that into a tidyverse function/possibly something where you can take advantage of parallel|async running.

Scrape Web Page When Selector Does Not Update URL

I am trying to scrape this webpage (https://nc.211counts.org) for a given region and time ('Onslow', 'Yesterday' for example). I want to pull all of the information from that top left table (COVID, Housing, etc through Other). Unfortunately, the URL does not update when the filters are selected. I have been following the tutorial here but cannot find a way to pull in the position of the region names I need to scrape for. Since the html_nodes function is returning empty, I think there is something to the mapping that is off.
What am I missing here?
# docker run -d -p 4445:4444 selenium/standalone-chrome
# docker ps
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://nc.211counts.org")
remDr$screenshot(display = TRUE)
nc211 <- xml2::read_html(remDr$getPageSource()[[1]])
str(nc211)
body_nodes <- nc211 %>%
html_node('body') %>%
html_children()
body_nodes
body_nodes %>%
html_children()
rank <- nc211 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(#class, 'col-lg-12 chosen-select')]") %>%
rvest::html_text()
# this returns empty
nc211 %>%
rvest::html_nodes("#region") %>%
rvest::html_children() %>%
rvest::html_text()
# guessing at an element number to see what happens
element<- remDr$findElement(using = 'css selector', "#region > option:nth-child(1)")
element$clickElement()
Content is dynamically updated through xhr POST requests when you make your selections and press Search. You can use the network tab to analyse these requests and reproduce them without resorting to selenium (as an alternative). You will need to pick up the param options from the initial page.
Below I show you how to make a request for a particular zipcode and also how to find out all the zip codes and their corresponding param ids to use in request. The latter needs to come from the initial url.
library(httr)
library(rvest)
data = list(
'id' = '{"ids":["315"]}', # zip 27006 is id 315 seen in value attribute of checkbox node
'timeIntervalId' = '18',
'centerId' = '7',
'type' = 'Z'
)
#post request that page makes using your filter selections e.g. zip code
r <- httr::POST(url = 'https://nc.211counts.org/dashBoard/barChart', body = data)
page <- read_html(r)
categories <- page %>% html_nodes(".categoriesDiv .toolTipSubCategory, #totalLabel") %>% html_text
colNodes <- page %>% html_nodes(".categoriesDiv .value")
percentages <- colNodes %>% html_attr('data-percentage')
counts <- colNodes %>% html_attr('data-value')
df <- as.data.frame(cbind(categories, percentages, counts))
print(df)
#Lookups e.g. zip codes. Taken from initial url
initial_page <- read_html('https://nc.211counts.org/')
ids <- initial_page %>% html_nodes('.zip [value]') %>% html_attr('value')
zips <- initial_page %>% html_nodes('.zip label') %>% html_text() %>% trimws()
print(ids[match('27006', zips)])

Cannot find numbers of pages of website in web scraping

I want to take number of pages from web site. I try to do it like on tutorial. I used this function:
get_last_page <- function(html){
pages_data <- html %>%
# The '.' indicates the class
html_nodes('.pagination-page') %>%
# Extract the raw text as a list
html_text()
# The second to last of the buttons is the one
pages_data[(length(pages_data)-1)] %>%
# Take the raw string
unname() %>%
# Convert to number
as.numeric()
}
first_page <- read_html(url)
(latest_page_number <- get_last_page(first_page))
for website
url <-'http://www.trustpilot.com/review/www.amazon.com'
it works fine.When I tried to do it with
url <-'https://energybase.ru/en/oil-gas-field/index'
I got integer(0).
I change
html_nodes('.pagination-page')
to
html_nodes('.html_nodes('data-page')')
And failed.
How can I change my code to make it works fine?
I think you have to go about this a little differently here.
The energybase.ru URL isn't organized quite the same way as the TrustPilot URL.
For our purposes here, we're interested in the fact that the last page has its own node .last. From there, you just have to extract the value of the data-page attribute and increment it by 1.
library("rvest")
library("magrittr")
url <- 'https://energybase.ru/en/oil-gas-field/index'
read_html(url) %>% html_nodes(".last") %>% html_children() %>% html_attr("data-page") %>% as.numeric()+1
# [1] 21
Edit: note, you can always intercept the piping at html_children() (by adding a %>% html_attrs() to it) to find out what attributes are available at your disposal there.
You could use the rel=last attribute=value node and extract the number from the href
library("rvest")
library("magrittr")
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
number_of_pages <- str_match_all(pg %>% html_node("[rel=last]") %>% html_attr("href"),'page=(\\d+)')[[1]][,2] %>% as.numeric()
Or, there are a number of ways you could calculate it given that there are more pages than pagination visibile. One way is to get the total count from the appropriate li in the drop down and divide by the results per page count.
library(rvest)
library(magrittr)
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
total_sites <- strtoi(pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount'), base = 0L)
# or use: total_sites <- pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount') %>% as.numeric()
sites_per_page <- length(pg %>% html_nodes('.index-list-item'))
number_of_pages <- ceiling(total_sites/sites_per_page)

HTML content not showing when using html_nodes from rvest

I'm trying to get a specific number in webpages of: https://ideas.repec.org/. More specifically, I'm looking for the number of search results like this:IDEAS' search results
However, when I'm applying the following code, I get an empty string:
library(rvest)
x <- GET("https://ideas.repec.org/cgi-bin/htsearch?form=extended&wm=wrd&dt=range&ul=&q=labor&cmd=Search%21&wf=4BFF&s=R&db=01%2F01%2F1950&de=31%2F12%2F1950")
webpage <- read_html(x)
hits_html <- html_nodes(webpage, xpath = '//*[#id="content-block"]/p')
hits <- html_text(hits_html)
hits
[1] ""
You could regex it out from the appropriate node. This does assume a constant before and after string and case. You could make also case insensitive with (?i)found\\s+(\\d+)\\s+results.
library(rvest)
library(stringr)
page = read_html("https://ideas.repec.org/cgi-bin/htsearch?form=extended&wm=wrd&dt=range&ul=&q=labor&cmd=Search%21&wf=4BFF&s=R&db=01%2F01%2F1950&de=31%2F12%2F1950")
r = page %>% html_node("#content-block") %>% html_text() %>%toString()
x <- str_match_all(r,'Found\\s+(\\d+)\\s+results')
print(x[[1]][,2])

More efficient way to scrape ratings values from TripAdvisor using R

I am trying to scrape the ratings from TripAdvisor. So far, I have managed to extract the HTML nodes, turn them into character strings, extract the string that represents the numeric I need then converted it to the correct number, finally dividing it by 10 to get the correct value it is demonstrating.
library(rvest)
url <- "https://www.tripadvisor.co.uk/Attraction_Review-g1466790-d547811-Reviews-Royal_Botanic_Gardens_Kew-Kew_Richmond_upon_Thames_Greater_London_England.html"
ratings_too_big <- url %>%
read_html() %>%
html_nodes("#REVIEWS .ui_bubble_rating") %>%
as.character() %>%
substr(38,39) %>%
as.numeric()
ratings_too_big/10
This is without doubt very messy - what's a cleaner, more efficient way to do this? I have also tried Hadley Wickham's example shown here:
library(rvest)
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"
reviews <- url %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
This was not successful, as no data was returned (there appears to be nothing in the HTML node ".rating .rating_s_fill"). I am new scraping and css identifiers, so apologies if the answer is obvious.