How do I webscrape .dpbox table using selectorgadget with R (rvest)? - html

I've been trying to webscrape data from a specific website using selectorgadget in R. For example, I successfully webscraped from http://www.dotabuff.com/heroes/abaddon/matchups before. Usually, I just click on the tables I want using the selectorgadget Chrome extension and put the CSS Selection result into the code as follows.
urlx <- "http://www.dotabuff.com/heroes/abaddon/matchups"
rawData <- html_text(html_nodes(read_html(urlx),"td:nth-child(4) , td:nth-child(3), .cell-xlarge"))
In this case, the html_nodes function does return a whole bunch of nodes (340)
{xml_nodeset (340)}
However, when I try to webscrape off http://www.dotapicker.com/heroes/Abaddon using selectorgadget, which turns out to be this code:
urlx <- "http://www.dotapicker.com/heroes/abaddon"
rawData <- html_text(html_nodes(read_html(urlx),".ng-scope:nth-child(1) .ng-scope .ng-binding"))
Unfortunately, no nodes actually show up after the html_nodes function is called, and I get the result
{xml_nodeset (0)}
I feel like this has something to do with the nesting of the table in a drop down box (compared to previously, the table was right on the webpage itself) but I'm not sure how to get around it.
Thank you and I appreciate any help!

It seems like this page load dynamically some data using XHR. In Chrome you can check that by going to inspect and then the network tab. If you do this, you will see that there are a number of json files that are being loaded. You can scrape directly those json files and then parse them to extract the info you need. Here is a quick example:
library(httr)
library(jsonlite)
heroinfo_json <- GET("http://www.dotapicker.com/assets/json/data/heroinfo.json")
heroinfo_flat <- fromJSON(content(heroinfo_json, type = "text"))
#> No encoding supplied: defaulting to UTF-8.
winrates_json <- GET("http://www.dotapicker.com/assets/dynamic/winrates10d.json")
winrates_flat <- fromJSON(content(winrates_json, type = "text"))
#> No encoding supplied: defaulting to UTF-8.

Related

rvest - find html-node with last page number

I'm learning web scraping and created a little exercise for myself to scrape all titles of a recipe site: https://pinchofyum.com/recipes?fwp_paged=1. (I got inspired by this post: https://www.kdnuggets.com/2017/06/web-scraping-r-online-food-blogs.html).
I want to scrape the value of the last page number, which is (at time of writing) number 64. You can find the number of pages at the bottom. I see that this is stored as "a.facetwp-page last", but for some reason cannot access this node. I can see that the page number values are stored as 'data-page', but I'm unable to get this value through 'html_attrs'.
I believe the parent node is "div.facetwp-pager" and I can access that one as follows:
library(rvest)
pg <- read_html("https://pinchofyum.com/recipes")
html_nodes(pg, "div.facetwp-pager")
But this is as far as I get. I guess I'm missing something small, but cannot figure out what it is. I know about Rselenium, but I would like to know if and how to get that last page value (64) with rvest.
Sometimes scraping with rvest doesn't work, especially when the webpage is dynamically generated with java script (I also wasn't able to scrape this info with rvest). In those cases, you can use the RSelenium package. I was able to scrape your desired element like this:
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pinchofyum.com/recipes?fwp_paged=1") # navigates to webpage
webElem <- remDr$findElement(using = "css selector", ".last") #find desired element
txt <- webElem$getElementText() # gets us the HTML
#> txt
#>[[1]]
#>[1] "64"

Chrome preview differs from download

I inspect the following page:
https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2=
or
https://www.dm-jobs.com/Germany/search/?q=&sortColumn=referencedate&sortDirection=desc&searchby=location&d=15.
As far as i understood the data can be either get via a get/post, in the "raw" html source or that some JavaScript code is executed.
But on that page i somehow dont manage to find the source.
The data on Chrome Network indicates that the data (here the Job data on the page) are in a Doc(ument) [see the screenshot - Tab Doc] and when i look on the preview tab its empty. But if i look on the "Response" tab the data can be seen.
Desired Output:
Target langauge is R, but actually not that relevant here. I would be happy enough to understand how the data is generated. So some selenium Approach or similar is not desired. But more getting an understanding how the data is generated and how it could be extracted via post/get, JS or the raw source.
What i tried:
library(httr)
library(rvest)
url <- "https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2="
src <- read_html(url)
src %>% html_nodes(xpath = "//*[contains(text(), 'Filialmitarbeiter')]")
as.character(src) %>% grep(pattern = "Filialmitarbeiter")
get <- GET(url)
content(get)
content(get$content)
Target Outputs:
e.g.
Filialmitarbeiter (w/m/d) 15-30 Std./Wo. Bad Reichenhall, DE, 83435 30.08.2019
Filialmitarbeiter (w/m/d) 6-8 Std./Wo. Neuenburg am Rhein, DE, 79395 30.08.2019
Führungsnachwuchs Filialleitung (w/m/d) Vechta, DE, 49377 30.08.2019
There are two cookies that are of import that must be picked up from the initial landing page. You can use html_session to capture these dynamically and then pass them on in a subsequent request to the page you want results from (at least for me). I wrote some stuff about session objects here.
The 3 cookies seen are:
cookies = c(
'rmk12' = '1',
'JSESSIONID' = 'some_value',
'cookie_j2w' = 'some_other_value'
)
You can find these plus the headers by using the network tab to monitor the web-traffic when attempting to view the job listings.
You can experiment with removing headers and cookies and you will discover that only the second and third cookies are required and no headers. However, the cookies passed must be captured in a prior request to the url as shown below. Session is the traditional way to do this.
R
library(rvest)
library(magrittr)
start_link = 'https://www.dm-jobs.com/Germany/?locale=de_DE'
next_link <- 'https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2='
jobs <- html_session(start_link) %>%
jump_to(.,next_link) %>%
html_nodes('.jobTitle-link') %>%
html_text()
print(jobs)
Py
import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
r = s.get('https://www.dm-jobs.com/Germany/?locale=de_DE')
cookies = s.cookies.get_dict() # just to demo which cookies are captured
print(cookies) # just to demo which cookies are captured
r = s.get('https://www.dm-jobs.com/Germany/search/?searchby=location&createNewAlert=false&q=&locationsearch=&geolocation=&optionsFacetsDD_customfield4=&optionsFacetsDD_customfield3=&optionsFacetsDD_customfield2=')
soup = bs(r.content, 'lxml')
print(len(soup.select('.jobTitle-link')))
Reading:
html_session

Read all html tables from tennis players activity page

I would like to read all html tables containing Federer's results from this website: http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity
and store the data in one single data frame. One way I figured out was using the rvest package, but as you may notice, my code only works for a specific number of tournaments. Is there any way I can read all relevant tables with one command? Thank you for your help!
Url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
x<- list(length(4))
for (i in 1:4) {
results <- Url %>%
read_html() %>%
html_nodes(xpath=paste0("//table[#class='mega-table'][", i, "]")) %>%
html_table()
results <- results[[1]]
x[[i]] <- resultados
}
Your solution above was close to being the final solution. One downside of your code was having the read_html statement inside the for loop, this would greatly slow down the processing. In the future read the page into a variable and then process the page node by node as necessary.
In this solution, I read the web page into the variable "page" and then extracted the table nodes where class = mega-table. One there, the html_table command returned a list of the tables of interest. The do.call looped a rbind the tables together.
library(rvest)
url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
page<- read_html(url)
tablenodes<-html_nodes(page, "table.mega-table")
tables<-html_table(tablenodes)
#numoftables<-length(tables)
df<-do.call(rbind, tables)

rvest HTML table scrape is empty

I'm trying to scrape a table, but the results are empty. While html_nodes("table") recognizes two tables, specifying with "[1]" is just the headers for the table, and "[2]" is a data frame of the correct dimensions, containing only NA's. I've used this same code for tables in other places from the same site, and they work fine. In fact, this code worked for me until this week. There are no errors output when running this code and I can't seem to find anything unusual about the HTML file itself (no missing tags or anything), but I'm not terribly familiar with HTML files so I could be missing something. Here is my code:
Vegas_lines_html <- read_html("https://rotogrinders.com/schedules/nba")
Vegas_lines <- Vegas_lines_html %>%
html_nodes("table") %>%
.[[1]] %>% # here index 1 is only headers, index 2 is the data frame full of NA's
html_table()

R - Extracting Tables From Websites Using XML Package

I am trying to replicate the method used in a previous answer here Scraping html tables into R data frames using the XML package for my own work but cannot get the data to extract. The website I am using is:
http://www.footballfanalytics.com/articles/football/euro_super_league_table.html
I just wish to extract a table of each team name and their current rating score. My code is as follows:
library(XML)
theurl <- "http://www.footballfanalytics.com/articles/football/euro_super_league_table.html"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
This produces the error message
Error in tables[[which.max(n.rows)]] :
attempt to select less than one element
Could anyone suggest a solution please? Is there something in this particular site causing this not to work? Or is there a better alternative method I can try? Thanks
Seems as if the data is loaded via javascript. Try:
library(XML)
theurl <- "http://www.footballfanalytics.com/xml/esl/esl.xml"
doc <- xmlParse(theurl)
cbind(team = xpathSApply(doc, "/StatsData/Teams/Team/Name", xmlValue),
points = xpathSApply(doc, "/StatsData/Teams/Team/Points", xmlValue))