Data from page2 same as from page1 when scraping - html

I am trying to scrape all event links from https://www.tapology.com/fightcenter.
Have already quite some experience in webscraping using R but in this case I am stuck.
I am able to scrape from page 1, however when I input a second page as URL, I still obtain data from first page as if the page is being redirected back automatically.
I have tried various codes found here on the forum, still, something is wrong.
First page
url = "https://www.tapology.com/fightcenter"
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
matched = as.data.frame(matched[[1]], stringsAsFactors = F)
Second page
url = 'https://www.tapology.com/fightcenter_events?page=2'
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
matched = as.data.frame(matched[[1]], stringsAsFactors = F)
Results are identical.
Could you please help me to solve this?
Thank you

Content is added dynamically via xhr. You can use httr (as mentioned in other answer) and add your headers. You also need to alter the page param that goes in the url during a loop/sequence. An example is shown below of a single request for a different page is shown (I just extract the fight links of person 1 v person 2 to show it is reading from that page). You could alter this to be a function returning info of interest in your loop or perhaps use purrr to map info across to an existing structure.
require(httr)
require(rvest)
require(magrittr)
require(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'Accept' = 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
'X-Requested-With' = 'XMLHttpRequest'
)
params = list(
'page' = '2'
)
r <- httr::GET(url = 'https://www.tapology.com/fightcenter_events', httr::add_headers(.headers=headers), query = params)
x <- str_match_all(content(r,as="text") ,'html\\("(.*>)')
y <- gsub('"',"'",gsub('\\\\','', x[[1]][,2]))
z <- read_html(y) %>% html_nodes(., ".billing a") %>% html_attr(., "href")

You're getting redirected back, because the website checking the headers which are you sending. For getting a correct data, you need to have this headers set:
Accept: text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Also, this request doesn't return the HTML of the webpage, but jQuery code, which updating the list on the website dynamically.

I have been able to extract the first three links of the first three pages with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.tapology.com/fightcenter")
list_Matched <- list()
# We get results from first 3 pages
for(i in 1 : 3)
{
print(i)
if(i != 1)
{
# Press on next button
web_Elem_Link <- remDr$findElement("class name", "next")
web_Elem_Link$clickElement()
}
list_Link_Page <- list()
Sys.sleep(3)
# Get the first three links of the page ...
for(j in 1 : 3)
{
web_Elem_Link <- tryCatch(remDr$findElement("xpath", paste0('//*[#id="content"]/div[4]/section[', j, ']/div/div[1]/div[1]/span[1]/a')),
error = function(e) NA)
if(is.na(web_Elem_Link))
{
web_Elem_Link <- remDr$findElement("xpath", paste0('//*[#id="content"]/div[3]/section[', j, ']/div/div[1]/div[1]/span[1]/a'))
}
web_Elem_Link$clickElement()
Sys.sleep(3)
list_Link_Page[[j]] <- remDr$getCurrentUrl()
remDr$goBack()
Sys.sleep(3)
}
list_Matched[[i]] <- list_Link_Page
}

Related

(R) Webscraping Error : arguments imply differing number of rows: 1, 0

I am working with the R programming language.
In a previous question (R: Webscraping Pizza Shops - "read_html" not working?), I learned how to scrape the names and address of Pizza Stores from YellowPages (e.g. https://www.yellowpages.ca/search/si/2/pizza/Canada). Here is the code for how to scrape a single page:
library(tidyverse)
library(rvest)
scraper <- function(url) {
page <- url %>%
read_html()
tibble(
name = page %>%
html_elements(".jsListingName") %>%
html_text2(),
address = page %>%
html_elements(".listing__address--full") %>%
html_text2()
)
}
I then tried to make a LOOP that will repeat this for all 391 pages:
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results = list()
for (i in 1:391)
{
url_i = paste0(a,i,b)
s_i = data.frame(scraper(url_i))
ss_i = data.frame(i,s_i)
print(ss_i)
list_results[[i]] <- ss_i
}
final = do.call(rbind.data.frame, list_results)
My Problem: I noticed that after the 60th page, I get the following error:
Error in data.frame(i, s_i) :
arguments imply differing number of rows: 1, 0
In addition: Warning message:
In for (i in seq_along(specs)) { :
closing unused connection
To investigate, I went to the 60th page (https://www.yellowpages.ca/search/si/60/pizza/Canada) and noticed that you can not click beyond this page:
My Question: Is there something that I can do differently to try and move past the 60th page, or is there some internal limitation within YellowPages that is preventing from me scraping further?
Thanks!
This is a limit in the Yellow Pages preventing to continue to the next page. A solution is to assign the return value of scraper and check the number of rows. If it is 0, break the for loop.
a = "https://www.yellowpages.ca/search/si/"
b = "/pizza/Canada"
list_results <- list()
for (i in 1:391) {
url_i = paste0(a,i,b)
s <- scraper(url_i, i)
message(paste("page number:", i, "\trows:", nrow(s)))
if(nrow(s) > 0L) {
s_i <- as.data.frame(s)
ss_i <- data.frame(i, s_i)
} else {
message("empty page, bailing out...")
break
}
list_results[[i]] <- ss_i
}
final <- do.call(rbind.data.frame, list_results)
dim(final)
# [1] 2100 3

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

Avoid getting "glued" words with R webscraping

When I use both of the two following blocks of code code, I get "glued" words, and by that i mean words that are not not separated by a space but they should, and this is a problem. In the original HTML, it seem like they're separated by a <b> and i'm not beeing able to handle this. The two blocks do the same thing by different ways.
library(XML)
library(RCurl)
# Block 1---------
url <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
u <- readLines(url)
h <- htmlTreeParse(file=u,
asText=TRUE,
useInternalNodes = TRUE,
encoding = "utf-8")
song <- getNodeSet(doc=h, path="//article", fun=xmlValue)
# Block 2---------
u <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
h <- htmlParse(getURL(u))
song <- xpathSApply(h, path = "//article", fun = xmlValue)
Which returns something like:
[1] "Sometimes I feelLike I don't have a partnerSometimes I feelLike my only friendIs the city I live inThe city of angelsLonely as I amTogether we cryI drive on her streets'Cause she's my companionI walk through her hills'Cause she knows who I amShe sees my good deedsAnd she kisses me windyI never worryNow that is a lieI don't ever wanna feelLike I did that dayBut take me to the place I loveTake me all the wayIt's hard to believeThat there's nobody out thereIt's hard to believeThat I'm all aloneAt...
I was able to retrieve words with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.letras.mus.br/red-hot-chili-peppers/32739/")
remDr$screenshot(display = TRUE, useViewer = TRUE)
page_Content <- remDr$getPageSource()[[1]]
list_Text_Song <- list()
for(i in 1 : 30)
{
print(i)
web_Obj <- tryCatch(remDr$findElement("xpath", paste0("//*[#id='js-lyric-cnt']/article/div[2]/div[2]/p[", i, "]")), error = function(e) NA)
list_Text_Song[[i]] <- tryCatch(web_Obj$getElementText(), error = function(e) NA)
}
list_Text_Song <- unlist(list_Text_Song)
list_Text_Song <- list_Text_Song[!is.na(list_Text_Song)]
The words are not glued with this approach.

How to read a <li> table in a webpage

I debug the program many times to get the result as follows:
url 研究所知识库列表
/handle/1471x/1 力学研究所
/handle/1471x/8865 半导体研究所
However, no metter what parameters I use, the result is not correct. The content in this table is one part of the basis of my further analysis, and I am very trembled for it. I'm looking forward to your help with great sincerity.
## download community-list ---the 1st level of IR Grid
#loading webpage and analyzing
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
# get table specs
tableNodes <- getNodeSet(com_parsed, "//table")
com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
# get External links
xpath <- "//a/#href"
getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)
it is unclear exactly what you want your end result to look like but if you modify your xpath statements a bit to take advantage of the DOM structure you can get something like this:
library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[#class="communityLink"]//#href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[#class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data
console output causes stackoverflow to think this answer is spam but here is a screen shot:

Trying to parse IMDb but the links are different each time I open site

I try to get links to all sites with popular feature films in IMDb. There is no problem with first 2000 sites since they have exactly the same "body", for example:
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=99951&title_type=feature
Each site consists with 50 links to movies, so in links the "parameter" start says that on this site there are links to movies from start to start + 50.
Problem is with the pages followed by one with parameter 99951. At the end of each one there is extra part of url like &tok=0f97 for example
http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=100051&title_type=feature&tok=13c9
So when I try to parse this site to get links for all 50 movies (I use R for this) I get nothing.
The code I use to parse sites and it works on first 2000 links:
makeListOfUrls <- function() {
howManyPages <- round(318485/50)
urlStart <- "http://www.imdb.com/search/title?at=0&sort=moviemeter,asc&start=1&title_type=feature"
linksList <- list()
for (i in 1:howManyPages){
j <- 50 * (i - 1) + 1
print(j)
startNew <- paste("start=", j, sep="")
urlNew <- stri_replace_all_regex(urlStart, "start=1", startNew)
titleLinks <- getLinks(urlNew)
## I get empty character for sites 2001 and next !!!
linksList[[i]] <- makeLongPath(titleLinks)
}
vector <- combineList(linksList)
return(vector)
}
getLinks <- function(url) {
allLinks <- getHTMLLinks(url, xpQuery = "//#href")
titleLinks <- allLinks[stri_detect_regex(allLinks, "^/title/tt[0-9]+/$")]
#there are no links for movies for the pages after 2000 (titleLinks is empty)
titleLinks <- titleLinks[!duplicated(titleLinks)]
return(titleLinks)
}
makeLongPath <- function(links) {
longPaths <- paste("http://www.imdb.com", links, sep="")
return(longPaths)
}
combineList <- function(UrlList){
n <- length(UrlList)
if (n==1){
return(UrlList)
} else {
tmpV <- UrlList[[1]]
for (i in 2:n){
cV <- c(tmpV, UrlList[[i]])
tmpV <- cV
}
return(tmpV)
}
}
So, is there any way to access these sites?