How to parse addresses from website specifying class in R? - html

I would like to parse addresses of all stores on the following website:
https://www.carrefour.fr/magasin/region/ looping through the regions. So starting for example with the region "auvergne-rhone-alpes-84", hence full url = https://www.carrefour.fr/magasin/region/auvergne-rhone-alpes-84. Note that I can add more regions afterwards, I just want to make it work with one for now.
carrefour <- "https://www.carrefour.fr/magasin/region/"
addresses_vector = c()
for (current_region in c("auvergne-rhone-alpes-84")) {
current_region_url = paste(carrefour, current_region, "/", sep="")
x <- GET(url=current_region_url)
html_doc <- read_html(x) %>%
html_nodes("[class = 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2']")
addresses_vector <- c(addresses_vector, html_doc %>%
rvest::html_nodes('body')%>%
xml2::xml_find_all(".//div[contains(#class, 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2')]") %>%
rvest::html_text())
}
I also tried with x%>% read_html() %>% rvest::html_nodes(xpath="/html/body/main/div[1]/div/div[2]/div[2]/ol/li[1]/div/div[1]/div[2]/div[2]")%>% rvest::html_text() (copying the whole xpath by hand) or x%>%read_html() %>%html_nodes("div.ds-body-text.ds-store-card__details--content.ds-body-text--size-m.ds-body-text--color-standard-2") %>%html_text() and several other ways but I always get a character(0) element returned.
Any help is appreciated!

You could write a couple of custom functions to help then use purrr to map the store data function to inputs from the output of the first helper function.
First, extract the region urls and extract the region names and region ids. Store these in a tibble. This is the first helper function get_regions.
Then use another function, get_store_info, to extract from these region urls the store info, which is stored in a div tag, from which it is dynamically extracted when JavaScript runs in the browser, but not when using rvest.
Apply the function that extracts the store info over the list of region urls and region ids.
If you use map2_dfr to pass both region id and region link to the function which extracts store data, you then have the region id to link back on to join the result of the map2_dfr to that of region tibble generated earlier.
Then do some column cleaning e.g., drop ones you don't want.
library(rvest)
library(purrr)
library(dplyr)
library(readr)
library(jsonlite)
get_regions <- function() {
url <- "https://www.carrefour.fr/magasin"
page <- read_html(url)
regions <- page %>% html_nodes(".store-locator-footer-list__item > a")
t <- tibble(
region = regions %>% html_text(trim = T),
link = regions %>% html_attr("href") %>% url_absolute(url),
region_id = NA_integer_
) %>% mutate(region_id = str_match(link, "-(\\d+)$")[, 2] %>%
as.integer())
return(t)
}
get_store_info <- function(region_url, r_id) {
region_page <- read_html(region_url)
store_data <- region_page %>%
html_node("#store-locator") %>%
html_attr(":context-stores") %>%
parse_json(simplifyVector = T) %>%
as_tibble()
store_data$region_id <- r_id
return(store_data)
}
region_df <- get_regions()
store_df <- map2_dfr(region_df$link, region_df$region_id, get_store_info)
final_df <- inner_join(region_df, store_df, by = 'region_id') # now clean columns within this.

Related

R: Inferring a Common Merge Key

I am trying to webscrape a site to get addresses for a set of names (part A) along with the longitude and latitudes (part B). I don't know how to do this all together, so I did this in two parts:
# part A
library(tidyverse)
library(rvest)
library(httr)
library(XML)
# Define function to scrape 1 page
get_info <- function(page_n) {
cat("Scraping page ", page_n, "\n")
page <- paste0("https://www.mywebsite/",
page_n, "?extension") %>% read_html
tibble(title = page %>%
html_elements(".title a") %>%
html_text2(),
adress = page %>%
html_elements(".marker") %>%
html_text2(),
page = page_n)
}
# Apply function to pages 1:10
df_1 <- map_dfr(1:10, get_info)
# Check dimensions
dim(df_1)
[1] 90
Here is part B:
# Recognize pattern in websites
part1 = "https://www.mywebsite/"
part2 = c(0:55)
part3 = "extension"
temp = data.frame(part1, part2, part3)
# Create list of websites
temp$all_websites = paste0(temp$part1, temp$part2, temp$part3)
# Scrape
df_2 <- list()
for (i in 1:10)
{tryCatch({
url_i <-temp$all_websites[i]
page_i <-read_html(url_i)
b_i = page_i %>% html_nodes("head")
listanswer_i <- b_i %>% html_text() %>% strsplit("\\n")
df_2[[i]] <- listanswer_i
print(listanswer_i)
}, error = function(e){})
}
# Extract long/lat from results
lat_long = grep("LatLng", unlist(df_2[]), value = TRUE)
df_2 = data.frame(str_match(lat_long, "LatLng(\\s*(.*?)\\s*);"))
df_2 = df_2 %>% filter(X1 != "LatLngBounds();")
> dim(df_2)
[1] 86 3
We can see that df_1 and df_2 have a different number of rows - but also, there is no common merge key between df_1 and df_2. How can I re-write my code in such a way that I can create a merge key between df_1 and df_2 such that I can merge the common records between these files together?
I am not sure multiple requests to the same URIs are needed. There are some lat long values not listed either on the results pages or on the result specific linked webpage e.g.Toronto Beaches Dentist from current page 2 results has no lat long shown on either page 2 or the website specific page. In these cases, you may choose to fill the blanks using another service which returns lat long based on an address.
You can re-write your function and alter your regex patterns to produce 2 dataframes which can be joined and the resultant dataframe returned. With the appropriate regex changes, as given below, you can use the address column to join the 2 dataframes. I dislike a key which is an address but it does appear to be internally consistent across the result page. I have used a left join to return all rows from the dentist listings i.e. the practice business names.
library(tidyverse)
library(rvest)
urls <- sprintf("https://www.dentistsearch.ca/search-doctor/%i?category=0&services=0&province=55&city=&k=", 1:10)
pages <- lapply(urls, read_html)
get_dentist_info <- function(page) {
page_text <- page %>% html_text()
address_keys <- page_text %>%
str_match_all('marker_\\d+\\.set\\("content", "(.*?)"\\);') %>%
.[[1]] %>%
.[, 2]
lat_long <- page_text %>%
str_match_all("LatLng\\((.*)\\);(?![\\s\\S]+myOptions)") %>%
.[[1]] %>%
.[, 2]
lat_lon <- tibble(address = address_keys, lat_long = lat_long) %>%
separate(lat_long, into = c("lat", "long"), sep = ", ") %>%
mutate(lat = as.numeric(lat), long = as.numeric(long))
practice_info <- tibble(
title = page %>% html_elements(".title > a") %>% html_text(trim = T),
address = page %>% html_elements(".marker") %>% html_text()
)
dentist_info <- left_join(practice_info, lat_lon, by = "address")
return(dentist_info)
}
all_dentist_info <- map_dfr(pages, get_dentist_info)

Web scraping in R with Selenium to click new pages

I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)
The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)
#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.

What should the "i"/category be in this for loop and how can I ensure it is in my working directory?

I am running a web-scraping project and running into some difficulty using the urls for search results from an initial scrape to scrape information from the search results themselves.
My first loop provides the back halves of the urls I need, after the / (for example, yelp.com/abd - I have abd), which I have in a nested list. However, when I summarize that nested list, like so:
profile_url_lst <- list()
for(page_num in 1:73){
main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
html_content <- read_html(main_url)
profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>%
html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>%
html_attr("href")
profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
profile_url_lst
profiles <- cbind(profile_urls)
profiles
I only receive the urls from the last page of results.
I pasted the domain name to those urls with paste0, which worked fine, but I then encounter another problem. When I use the variable name in a for loop, R returns "variable name is not in your working directory).
complete_urls <- paste0('https://www.theeroticreview.com', profiles)
complete <- cbind(complete_urls)
complete
TED_lst <- list()
for(complete_urls in 1:73) {
html_content1 <- read_html('complete_urls')
TED <- html_content1 %>% html_nodes("'") %>% html_text()
TED_lst[i] <- TEDs
Sys.sleep(2)
How do I paste the domain name to all the collected urls and bind them, and what should the category be in the for loop?
Assuming you intend to read_html from each url within complete_urls you want to avoid overwriting that variable by using it as the loop variable; as well as referencing it as a string literal. You could instead seq_along the items and index in. Here I print rather than read_html
complete_urls <- c('A', 'B')
for(i in seq_along(complete_urls)){
print(complete_urls[[i]])
}
It is probably better to write a custom function to apply to each url and pass that into a tidyverse function/possibly something where you can take advantage of parallel|async running.

Scrape Web Page When Selector Does Not Update URL

I am trying to scrape this webpage (https://nc.211counts.org) for a given region and time ('Onslow', 'Yesterday' for example). I want to pull all of the information from that top left table (COVID, Housing, etc through Other). Unfortunately, the URL does not update when the filters are selected. I have been following the tutorial here but cannot find a way to pull in the position of the region names I need to scrape for. Since the html_nodes function is returning empty, I think there is something to the mapping that is off.
What am I missing here?
# docker run -d -p 4445:4444 selenium/standalone-chrome
# docker ps
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://nc.211counts.org")
remDr$screenshot(display = TRUE)
nc211 <- xml2::read_html(remDr$getPageSource()[[1]])
str(nc211)
body_nodes <- nc211 %>%
html_node('body') %>%
html_children()
body_nodes
body_nodes %>%
html_children()
rank <- nc211 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(#class, 'col-lg-12 chosen-select')]") %>%
rvest::html_text()
# this returns empty
nc211 %>%
rvest::html_nodes("#region") %>%
rvest::html_children() %>%
rvest::html_text()
# guessing at an element number to see what happens
element<- remDr$findElement(using = 'css selector', "#region > option:nth-child(1)")
element$clickElement()
Content is dynamically updated through xhr POST requests when you make your selections and press Search. You can use the network tab to analyse these requests and reproduce them without resorting to selenium (as an alternative). You will need to pick up the param options from the initial page.
Below I show you how to make a request for a particular zipcode and also how to find out all the zip codes and their corresponding param ids to use in request. The latter needs to come from the initial url.
library(httr)
library(rvest)
data = list(
'id' = '{"ids":["315"]}', # zip 27006 is id 315 seen in value attribute of checkbox node
'timeIntervalId' = '18',
'centerId' = '7',
'type' = 'Z'
)
#post request that page makes using your filter selections e.g. zip code
r <- httr::POST(url = 'https://nc.211counts.org/dashBoard/barChart', body = data)
page <- read_html(r)
categories <- page %>% html_nodes(".categoriesDiv .toolTipSubCategory, #totalLabel") %>% html_text
colNodes <- page %>% html_nodes(".categoriesDiv .value")
percentages <- colNodes %>% html_attr('data-percentage')
counts <- colNodes %>% html_attr('data-value')
df <- as.data.frame(cbind(categories, percentages, counts))
print(df)
#Lookups e.g. zip codes. Taken from initial url
initial_page <- read_html('https://nc.211counts.org/')
ids <- initial_page %>% html_nodes('.zip [value]') %>% html_attr('value')
zips <- initial_page %>% html_nodes('.zip label') %>% html_text() %>% trimws()
print(ids[match('27006', zips)])

Cannot find numbers of pages of website in web scraping

I want to take number of pages from web site. I try to do it like on tutorial. I used this function:
get_last_page <- function(html){
pages_data <- html %>%
# The '.' indicates the class
html_nodes('.pagination-page') %>%
# Extract the raw text as a list
html_text()
# The second to last of the buttons is the one
pages_data[(length(pages_data)-1)] %>%
# Take the raw string
unname() %>%
# Convert to number
as.numeric()
}
first_page <- read_html(url)
(latest_page_number <- get_last_page(first_page))
for website
url <-'http://www.trustpilot.com/review/www.amazon.com'
it works fine.When I tried to do it with
url <-'https://energybase.ru/en/oil-gas-field/index'
I got integer(0).
I change
html_nodes('.pagination-page')
to
html_nodes('.html_nodes('data-page')')
And failed.
How can I change my code to make it works fine?
I think you have to go about this a little differently here.
The energybase.ru URL isn't organized quite the same way as the TrustPilot URL.
For our purposes here, we're interested in the fact that the last page has its own node .last. From there, you just have to extract the value of the data-page attribute and increment it by 1.
library("rvest")
library("magrittr")
url <- 'https://energybase.ru/en/oil-gas-field/index'
read_html(url) %>% html_nodes(".last") %>% html_children() %>% html_attr("data-page") %>% as.numeric()+1
# [1] 21
Edit: note, you can always intercept the piping at html_children() (by adding a %>% html_attrs() to it) to find out what attributes are available at your disposal there.
You could use the rel=last attribute=value node and extract the number from the href
library("rvest")
library("magrittr")
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
number_of_pages <- str_match_all(pg %>% html_node("[rel=last]") %>% html_attr("href"),'page=(\\d+)')[[1]][,2] %>% as.numeric()
Or, there are a number of ways you could calculate it given that there are more pages than pagination visibile. One way is to get the total count from the appropriate li in the drop down and divide by the results per page count.
library(rvest)
library(magrittr)
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
total_sites <- strtoi(pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount'), base = 0L)
# or use: total_sites <- pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount') %>% as.numeric()
sites_per_page <- length(pg %>% html_nodes('.index-list-item'))
number_of_pages <- ceiling(total_sites/sites_per_page)