rvest emails scraping on page with complex node structure (html nodes)

rvest emails scraping on page with complex node structure (html nodes) - html

I am trying to collect names and email addresses from this page "https://www.gu.se/en/about/find-staff?affiliation_types=Teaching%20staff&hits=2744". I am having hard times to figure out the correct way to select the nodes. For example I am doing the following to select people names, but it selects the wrong node.
Thank you in advance for your help
library(rvest)
library(tidyverse)
r<-read_html("https://www.gu.se/en/about/find-staff?affiliation_types=Teaching%20staff&hits=2744")
people_name <- r %>%
html_nodes("a span") %>%
html_text()

As #QHarr mentioned in the comments, the data in the webpage is generated dynamically. The html code you get in your read_html does not yet have the data you need. You could use RSelenium, but in this case I think rvest is better.
If you look at the Developer Tools in chrome (see image below), you will see that when you load the webpage, it makes several subsequent requests. One of them is to the url #QHarr mentioned that returns a json string with all the data the then populates the website using javascript.
So, you can make a request directly to this url, get the json string and parse the
json string so you can get the data directly (this is much lighter than using RSelenium). Sometimes this does not work, because you may need to set state variables in the request to the server or make a complicated POST request. But in this case it is a simpler GET request and it worked!
The json response is a nested list so you need to look at it and identify where is the data you need for each person.
Here is my code:
library(rvest)
library(dplyr)
url.1 <- 'https://www.gu.se/api/search/rest/apps/gu/searchers/person_en?q=*&sort=relevance&affiliation_types=Teaching+staff&hits=2744'
# get json string and parse it to list using jsonlite::fromJSON
json.content <- read_html(url.1) %>% html_node('body') %>% html_text() %>%
jsonlite::fromJSON(simplifyVector = FALSE)
# the list of people is in json.content$documentList$documents (also a nested list)
# use plyr::ldply to get info from each person and combine into a dataframe
df.staff <- plyr::ldply(json.content$documentList$documents,
.fun = {function(x){
name = x$title
aff = ifelse(length(x$affiliations[[1]]$affiliation_name) > 0,
x$affiliations[[1]]$affiliation_name,
NA)
email = ifelse(length(x$affiliations[[1]]$email[[1]]) > 0,
x$affiliations[[1]]$email[[1]],
NA)
dept = ifelse(length(x$affiliations[[1]]$organization) > 0,
x$affiliations[[1]]$organization,
NA)
data.frame(name=name,
affiliation=aff,
email=email,
department=dept)}})
head(df.staff)
# name affiliation email department
#1 Zareen Abbas SENIOR LECTURER zareen.abbas#gu.se Department of Chemistry & Molecular Biology
#2 Yehia Abd Alrahman POSTDOCTOR yehia.abd.alrahman#gu.se Formal Methods
#3 Afrah Abdulla SENIOR LECTURER afrah.abdulla#gu.se Unit for General Didactics and Pedagogic Work
#4 Behjat Omer Abdulla LECTURER behjat.o.a#akademinvaland.gu.se The Crafts and Fine Art Unit
#5 Frida Abel Docent frida.abel#gu.se Department of Laboratory Medicine
#6 Andreas Martin Abel SENIOR LECTURER abela#chalmers.se Computer Science (CS)

Related

R (rvest) Web Scraping Multiple Pages

I am looking to scrape the results from the Philly DA Democratic Primary race. I want to scrape the ward-division results from the website. I need the ward-division number (e.g. 01-01), the name of the candidate (e.g. LARRY KRASNER), and the percent each candidate received. For this website, there are 86 pages of results at the ward-division level:
https://results.philadelphiavotes.com/ResultsSW.aspx?type=CTY&map=CTY#page-1
Using the SelectorGadget tool, the CSS for each are as follows:
ward-division numbers = ".precinct-results-orangebox-title h1"
name of candidates= ".precinct-results-databox1 h1"
percent results= "#Datawrapper 16DEM .bar-percent"
When I tried to initially scrape the website data, I used the following code:
#Read in the Data
daresults <- read_html (https://results.philadelphiavotes.com/ResultsSW.aspx type=CTY&map=CTY#page-1)
#Ward-Division Numbers
warddiv<-daresults %>%
html_nodes(".precinct-results-orangebox-title h1")%>%
html_text()
And I received a response of
character(0)
Any help on cleaning up the code and creating a loop to scrape all 86 pages would be appreciated. Thanks.

It looks like the data is stored as a JSON file. From the Network tab, from your browser's developer tools the files are located here:
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=16&osn=16&county=04&party=DEM&LanguageID=1
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=17&osn=17&county=04&party=REP&LanguageID=1
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=18&osn=18&county=04&party=DEM&LanguageID=1
https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=19&osn=19&county=04&party=REP&LanguageID=1
Use jsonlite or another package to read the file and parse the file into a data frame.
For example:
url<-"https://phillyresws.azurewebsites.us/ResultsAjax.svc/GetMapData?type=CTY&category=PREC&raceID=16&osn=16&county=04&party=DEM&LanguageID=1"
jsonlite::fromJSON(url)

rvest - find html-node with last page number

I'm learning web scraping and created a little exercise for myself to scrape all titles of a recipe site: https://pinchofyum.com/recipes?fwp_paged=1. (I got inspired by this post: https://www.kdnuggets.com/2017/06/web-scraping-r-online-food-blogs.html).
I want to scrape the value of the last page number, which is (at time of writing) number 64. You can find the number of pages at the bottom. I see that this is stored as "a.facetwp-page last", but for some reason cannot access this node. I can see that the page number values are stored as 'data-page', but I'm unable to get this value through 'html_attrs'.
I believe the parent node is "div.facetwp-pager" and I can access that one as follows:
library(rvest)
pg <- read_html("https://pinchofyum.com/recipes")
html_nodes(pg, "div.facetwp-pager")
But this is as far as I get. I guess I'm missing something small, but cannot figure out what it is. I know about Rselenium, but I would like to know if and how to get that last page value (64) with rvest.

Sometimes scraping with rvest doesn't work, especially when the webpage is dynamically generated with java script (I also wasn't able to scrape this info with rvest). In those cases, you can use the RSelenium package. I was able to scrape your desired element like this:
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pinchofyum.com/recipes?fwp_paged=1") # navigates to webpage
webElem <- remDr$findElement(using = "css selector", ".last") #find desired element
txt <- webElem$getElementText() # gets us the HTML
#> txt
#>[[1]]
#>[1] "64"

Read all html tables from tennis players activity page

I would like to read all html tables containing Federer's results from this website: http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity
and store the data in one single data frame. One way I figured out was using the rvest package, but as you may notice, my code only works for a specific number of tournaments. Is there any way I can read all relevant tables with one command? Thank you for your help!
Url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
x<- list(length(4))
for (i in 1:4) {
results <- Url %>%
read_html() %>%
html_nodes(xpath=paste0("//table[#class='mega-table'][", i, "]")) %>%
html_table()
results <- results[[1]]
x[[i]] <- resultados
}

Your solution above was close to being the final solution. One downside of your code was having the read_html statement inside the for loop, this would greatly slow down the processing. In the future read the page into a variable and then process the page node by node as necessary.
In this solution, I read the web page into the variable "page" and then extracted the table nodes where class = mega-table. One there, the html_table command returned a list of the tables of interest. The do.call looped a rbind the tables together.
library(rvest)
url <- "http://www.atpworldtour.com/en/players/roger-federer/f324/player-activity"
page<- read_html(url)
tablenodes<-html_nodes(page, "table.mega-table")
tables<-html_table(tablenodes)
#numoftables<-length(tables)
df<-do.call(rbind, tables)

Extracting all (possible) optional date values from web page [R]

In this url string, "toDate=1399849199999" part of the string refers to UNIX time expressed in milliseconds which is used to extract the Premier league table for a particular day.
In this case, UNIX time refers to 11. may of 2014.
as.POSIXlt (1399849199999/1000, tz = "GMT", origin = "1970-01-01")
I would like to retrieve all possible UNIX time values for a particular month. For url provided here, those 6 values are stored in webpage source code and it looks like this:
<select name="toDate" id="date" class="selectToSlider" widget="selectToSlider" labels="18" tooltip="false" wrapperClass="selectToSliderWrapper selectToSliderMatchDate"><optgroup label="results"><option value="1399157999999">SAT 03</option><option value="1399244399999">SUN 04</option><option value="1399330799999">MON 05</option><option value="1399417199999" selected="selected">TUE 06</option><option value="1399503599999">WED 07</option><option value="1399849199999">SUN 11</option></optgroup><optgroup label="fixtures"></optgroup></select>
Previously I used to extract such information with regular expressions but it was the pain in the neck (***) and I want to do this in some easier way.
I appreciate if someone can provide the code (possibly with explained steps) that can extract those values using some web scraping packages in R, preferably XML. I tried it by myself but I was unsuccessful...

We can try using XML package to parse the html from the link you provided, then extract the specific information required (out of the whole html) using xpath:
library(XML)
EPL.URL <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399849199999&tableView=CURRENT_STANDINGS"
EPL.doc <- htmlParse(EPL.URL)
xpathSApply(EPLdoc, "//optgroup[#label='results']/option", xmlGetAttr, "value")

rvest makes this pretty easy. Look for the "option" nodes, then grab the "value" attributes.
library("rvest")
h <- read_html('<select name="toDate" id="date" class="selectToSlider" widget="selectToSlider" labels="18" tooltip="false" wrapperClass="selectToSliderWrapper selectToSliderMatchDate"><optgroup label="results"><option value="1399157999999">SAT 03</option><option value="1399244399999">SUN 04</option><option value="1399330799999">MON 05</option><option value="1399417199999" selected="selected">TUE 06</option><option value="1399503599999">WED 07</option><option value="1399849199999">SUN 11</option></optgroup><optgroup label="fixtures"></optgroup></select>')
h %>% html_nodes("option") %>% html_attr("value")
[1] "1399157999999" "1399244399999" "1399330799999"
[4] "1399417199999" "1399503599999" "1399849199999"

R - Extracting Tables From Websites Using XML Package

I am trying to replicate the method used in a previous answer here Scraping html tables into R data frames using the XML package for my own work but cannot get the data to extract. The website I am using is:
http://www.footballfanalytics.com/articles/football/euro_super_league_table.html
I just wish to extract a table of each team name and their current rating score. My code is as follows:
library(XML)
theurl <- "http://www.footballfanalytics.com/articles/football/euro_super_league_table.html"
tables <- readHTMLTable(theurl)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
This produces the error message
Error in tables[[which.max(n.rows)]] :
attempt to select less than one element
Could anyone suggest a solution please? Is there something in this particular site causing this not to work? Or is there a better alternative method I can try? Thanks

Seems as if the data is loaded via javascript. Try:
library(XML)
theurl <- "http://www.footballfanalytics.com/xml/esl/esl.xml"
doc <- xmlParse(theurl)
cbind(team = xpathSApply(doc, "/StatsData/Teams/Team/Name", xmlValue),
points = xpathSApply(doc, "/StatsData/Teams/Team/Points", xmlValue))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

rvest emails scraping on page with complex node structure (html nodes) - html

Related

R (rvest) Web Scraping Multiple Pages

rvest - find html-node with last page number

Read all html tables from tennis players activity page

Extracting all (possible) optional date values from web page [R]

R - Extracting Tables From Websites Using XML Package

Categories

Resources