(R) Webscraping Error : arguments imply differing number of rows: 1, 0 - html

I am working with the R programming language.
In a previous question (R: Webscraping Pizza Shops - "read_html" not working?), I learned how to scrape the names and address of Pizza Stores from YellowPages (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada). Here is the code for how to scrape a single page:
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
I then tried to make a LOOP that will repeat this for all 391 pages:
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results = list()
for (i in 1:391)
{
url_i = paste0(a,i,b)
s_i = data.frame(scraper(url_i))
ss_i = data.frame(i,s_i)
print(ss_i)
list_results[[i]] <- ss_i
}
final = do.call(rbind.data.frame, list_results)
My Problem: I noticed that after the 60th page, I get the following error:
Error in data.frame(i, s_i) :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In for (i in seq_along(specs)) { :
closing unused connection
To investigate, I went to the 60th page (https://www.yellowpages.ca/search/si/60/pizza/Canada) and noticed that you can not click beyond this page:
My Question: Is there something that I can do differently to try and move past the 60th page, or is there some internal limitation within YellowPages that is preventing from me scraping further?
Thanks!

This is a limit in the Yellow Pages preventing to continue to the next page. A solution is to assign the return value of scraper and check the number of rows. If it is 0, break the for loop.
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results <- list()
for (i in 1:391) {
url_i = paste0(a,i,b)
s <- scraper(url_i, i)
message(paste("page number:", i, "\trows:", nrow(s)))
if(nrow(s) > 0L) {
s_i <- as.data.frame(s)
ss_i <- data.frame(i, s_i)
} else {
message("empty page, bailing out...")
break
}
list_results[[i]] <- ss_i
}
final <- do.call(rbind.data.frame, list_results)
dim(final)
# [1] 2100 3

Related

R: Webscraping double loop does not go through the dates

I am webscraping a website in Jordan. The first page I'm scraping is https://alrai.com/search?date-from=2004-09-21&pgno=1.
I'm trying to make R run through each date and then each nested link that takes you to other pages (pgno=1,2,3 etc). The for loop works when I only use to obtain the links on 2004-09-21, but I need to be able to move up in dates.
I thought using another for loop around the first one that cycles through dates would work. But now the code as it is only returns the 10 elements on the first page and doesn't even go through the other page numbers.
for (i in seq_along(days)){
for (pagenumber in seq(from = 1, to = 10, by = 1)){
link = paste("https://alrai.com/search?date-from=",(days[i]), "&pgno=",
pagenumber, sep = "")
page = read_html(link)
}
}
readlink <- read_html(link)
text_title <- readlink %>%
html_elements(".font-700") %>%
html_text2()
article_links <- readlink %>%
html_elements(".font-700") %>%
html_attr("href")
Scraping the first 5 pages with purrr::map_dfr (without loop).
library(tidyverse)
library(rvest)
scraper <- function(page) {
site <- str_c("https://alrai.com/search?date-from=2004-09-21&pgno=",
page) %>%
read_html()
tibble(title = site %>%
html_elements(".font-700") %>%
html_text2())
}
map_dfr(1:5, scraper)

Error in eval_tidy (xs [[j]], mask): object 'titles' not found

When I try this code, I get an error. I want to scrape the data from multiple pages. But when I try the script down below, I get an error that says:
Error in eval_tidy (xs [[j]], mask): object 'titles' not found
library(tidyverse)
library(rvest)
# function to scrape all elements also missing elements
scrape_css <- function(css, group, html_page) {
txt <- html_page %>%
html_nodes(group) %>%
lapply(
. %>%
html_nodes(css) %>%
html_text() %>%
ifelse(identical(., character(0)), NA, .)
) %>%
unlist()
return(txt)
}
# Get all elements from 1 page
get_one_page <- function(url) {
html <- read_html(url)
titles <- scrape_css(
".recipe-card_title__1oIb-",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
minutes <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(1)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
callories <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(2)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
}
return(tibble(titles = titles, minutes = minutes, callories = callories))
url <- ("https://www.ah.nl/allerhande/recepten/R-L1473207825981/suikerbewust")
appie <- get_one_page(url)
Two things:
return should be inside the curly braces at the end of the function
return(tibble(titles = titles, minutes = minutes, callories = callories))
}
You forgot to update the variable name such that you have
html <- read_html(url)
when you need
html_page <- read_html(url)
This does not per se answer #Laila's question, but I ran into the
Error in eval_tidy(xs[[j]], mask) : object '' not found
error, when I did sth similar to
tibble(
a = 1,
b = ,
c = NA
)
So basically, forgot to enter a value for a tibble column. But it is a rather uninformative error message. I am leaving this for the future me (and others) as a reference, as there are not many posts related to this error.

Inconsistent error in RStudio in Web-scraping script: "arguments imply differing number of rows: 31, 30"

(edited to make question and problem clearer)
I am running a script (R) to scrape Goodreads reviews. Recently, I get the following error message for some of the pages I'm trying to scrape:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 31, 30
The numbers at the end may change, f.e. 30, 33.
What seems strange to me is that the error is not constant, it only occurs for some of the pages I'm trying to scrape, although the script itself remains the same. Example: scraping the reviews of The Handmaid's Tale (https://www.goodreads.com/book/show/38447.The_Handmaid_s_Tale?ac=1&from_search=true&qid=ZGrzc7AfLN&rank=1) causes an error (32, 30), but scraping the reviews of Typhoon Kingdom (https://www.goodreads.com/book/show/52391186-typhoon-kingdom) causes no problems.
Full script:
library(rJava)
library(data.table)
library(dplyr)
library(magrittr)
library(rvest)
library(RSelenium)
library(lubridate)
library(stringr)
library(purrr)
options(stringsAsFactors = F) #needed to prevent errors when merging data frames
#Paste the Goodreads Url
url <- "https://www.goodreads.com/book/show/6101138-wolf-hall"
englishOnly = F #If FALSE, all languages are chosen
#Set your browser settings
#Do NOT use Firefox!
rD <- rsDriver(browser = "chrome", chromever = "83.0.4103.39")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)
bookTitle = unlist(remDr$getTitle())
finalData = data.frame()
# Main loop going through the website pages
morePages = T
pageNumber = 1
while(morePages){
#Select reviews in correct language
selectLanguage = if(englishOnly){
selectLanguage = remDr$findElement("xpath", "//select[#id='language_code']/option[#value='']")
} else {
selectLanguage = remDr$findElement("xpath", "//select[#id='language_code']/option[1]")
}
selectLanguage$clickElement()
Sys.sleep(3)
#Expand all reviews
expandMore <- remDr$findElements("link text", "...more")
sapply(expandMore, function(x) x$clickElement())
#Extracting the reviews from the page
reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
reviews.text <- unlist(reviews.list)
#Get the review ID's from all the links
reviewId = reviews.html %>% str_extract("/review/show/\\d+")
reviewId = reviewId[!is.na(reviewId)] %>% str_extract("\\d+")
#Some reviews have only rating and no text, so we process them separately
onlyRating = unlist(map(1:length(reviews.text), function(i) str_detect(reviews.text[i], "^\\\n\\\n")))
#Full reviews
if(sum(!onlyRating) > 0){
filterData = reviews.text[!onlyRating]
fullReviews = purrr::map_df(seq(1, length(filterData), by=2), function(i){
review = unlist(strsplit(filterData[i], "\n"))
data.frame(
date = mdy(review[2]), #date
username = str_trim(review[5]), #user
rating = str_trim(review[9]), #overall
comment = str_trim(review[12]) #comment
)
})
#Add review text to full reviews
fullReviews$review = unlist(purrr::map(seq(2, length(filterData), by=2), function(i){
str_trim(str_remove(filterData[i], "\\s*\\n\\s*\\(less\\)"))
}))
} else {
fullReviews = data.frame()
}
#partial reviews (only rating)
if(sum(onlyRating) > 0){
filterData = reviews.text[onlyRating]
partialReviews = purrr::map_df(1:length(filterData), function(i){
review = unlist(strsplit(filterData[i], "\n"))
data.frame(
date = mdy(review[9]), #date
username = str_trim(review[4]), #user
rating = str_trim(review[8]), #overall
comment = "",
review = ""
)
})
} else {
partialReviews = data.frame()
}
finalData = rbind(finalData, cbind(reviewId, rbind(fullReviews, partialReviews)))
#Go to next page if possible
nextPage = remDr$findElements("xpath", "//a[#class='next_page']")
if(length(nextPage) > 0){
message(paste("PAGE", pageNumber, "Processed - Going to next"))
nextPage[[1]]$clickElement()
pageNumber = pageNumber + 1
Sys.sleep(2)
} else {
message(paste("PAGE", pageNumber, "Processed - Last page"))
morePages = FALSE
}
}
#end of the main loop
#Replace missing ratings by 'not rated'
finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)
#Stop server
rD[["server"]]$stop()
#Write results
write.csv(finalData, paste0(bookTitle, ".csv"), row.names = F)
message("FINISHED!")
I've removed some parts of the code to find out where the problem comes from and it seems to me that it must be caused by this piece of code that extracts the review-IDs:
#Get the review ID's from all the links
reviewId = reviews.html %>% str_extract("/review/show/\\d+")
reviewId = reviewId[!is.na(reviewId)] %>% str_extract("\\d+")
When I remove this piece of code and change finalData = rbind(finalData, cbind(reviewId, rbind(fullReviews, partialReviews))) to finalData = rbind(finalData, fullReviews, partialReviews), the script runs without problems and without causing any errors. However, I really need to be able to extract these review-IDs to properly anonymise my data, so leaving it out is not an option.
I've tried to exchange that part of the code with this, as this should also be able to scrape the review-ID as well(but please correct me if I'm wrong):
#Get the review ID's from all the links
reviewId = reviews.html %>% str_extract("review_\\d+")
reviewId = reviewId[!is.na(reviewId)] %>% str_extract("\\d+")
This did not solve the problem and caused the same error, though with some differences: 1. the error has completely different numbers: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 30 and 2. the error now occurs for every single URL instead of for some, so I've actually managed to somehow make it worse.
As I don't have a lot of experience working with R (or scripts in general), this is about as far as my knowledge and solution skills stretch. I'm especially confused because the error only occurs for some URLs and not for the others. If you want to try it, you can simply run the full script as it is (it has a url for a book that caused an error). There is no need to change anything for a test run, except perhaps your chromever.
Does anyone know what causes this error and how it might be solved? Concrete steps would be very appreciated. Thank you!

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

Trying to parse IMDb but the links are different each time I open site

I try to get links to all sites with popular feature films in IMDb. There is no problem with first 2000 sites since they have exactly the same "body", for example:
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=99951&title_type=feature
Each site consists with 50 links to movies, so in links the "parameter" start says that on this site there are links to movies from start to start + 50.
Problem is with the pages followed by one with parameter 99951. At the end of each one there is extra part of url like &tok=0f97 for example
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=100051&title_type=feature&tok=13c9
So when I try to parse this site to get links for all 50 movies (I use R for this) I get nothing.
The code I use to parse sites and it works on first 2000 links:
makeListOfUrls <- function() {
howManyPages <- round(318485/50)
urlStart <- "http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature"
linksList <- list()
for (i in 1:howManyPages){
j <- 50 * (i - 1) + 1
print(j)
startNew <- paste("start=", j, sep="")
urlNew <- stri_replace_all_regex(urlStart, "start=1", startNew)
titleLinks <- getLinks(urlNew)
## I get empty character for sites 2001 and next !!!
linksList[[i]] <- makeLongPath(titleLinks)
}
vector <- combineList(linksList)
return(vector)
}
getLinks <- function(url) {
allLinks <- getHTMLLinks(url, xpQuery = "//#href")
titleLinks <- allLinks[stri_detect_regex(allLinks, "^/title/tt[0-9]+/$")]
#there are no links for movies for the pages after 2000 (titleLinks is empty)
titleLinks <- titleLinks[!duplicated(titleLinks)]
return(titleLinks)
}
makeLongPath <- function(links) {
longPaths <- paste("http://www.imdb.com", links, sep="")
return(longPaths)
}
combineList <- function(UrlList){
n <- length(UrlList)
if (n==1){
return(UrlList)
} else {
tmpV <- UrlList[[1]]
for (i in 2:n){
cV <- c(tmpV, UrlList[[i]])
tmpV <- cV
}
return(tmpV)
}
}
So, is there any way to access these sites?