Scraping in R, cannot get "onclick" attribute - html

I'm scraping the NFL website with R. R might not be the best to do this but that is not my question here.
I can usually get everything I want but for the first time I got a problem.
In the present case I want to get info from let's say, this page
http://www.nfl.com/player/j.j.watt/2495488/profile
The info I want to get is there
Draft
Using xPathSapply(parsedPage,xmlGettAttr, name="onclick") I get only NULL... and I do not get the reason why.
I could retrieve the information elsewhere in the code and then paste to recover the address but I find it much easier and clearer to get it at once.
How can I get this, using R, eventually C. I do not know much about JavaScript, I would be happy to avoid this.
Thanks in advance for the help.

The reason is that there are no "onclick"-attributes in the sourcecode: See (in Chrome)
view-source:http://www.nfl.com/player/j.j.watt/2495488/profile
The onclick-attributes are added via javascript. Because of that you need a parser that executes the JS.
In R you can you RSelenium for that as follows:
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.nfl.com/player/j.j.watt/2495488/profile")
doc <- remDr$getPageSource()
require(rvest)
doc <- read_html(doc[[1]])
doc %>% html_nodes(".HOULink") %>% xml_attr("onclick")
remDr$close()
#shutdown
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")
For me this resulted in:
[1] "s_objectID=\"http://www.nfl.com/teams/houstontexans/profile?team=HOU_1\";return this.s_oc?this.s_oc(e):true"
[2] "s_objectID=\"http://www.houstontexans.com/_2\";return this.s_oc?this.s_oc(e):true"
[3] "s_objectID=\"http://www.nfl.com/gamecenter/2015122004/2015/REG15/texans#colts/watch_1\";return this.s_oc?this.s_oc(e):true"
...
You can also use a headless browser like phantomjs see https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html

Related

rvest webscraping with placeholder counts that update daily

I'm trying to scrape global daily counts of cases and deaths from JHU: https://coronavirus.jhu.edu/
It seems that the counts are stored like this when I use the web inspect but when I try to use the following code to access them, all I can find are placeholders:
library(rvest)
url = "https://coronavirus.jhu.edu/"
website = read_html(url)
cases <- website %>%
html_nodes(css = "figure")
cases
produces the following:
{xml_nodeset (4)}
[1] <figure><figcaption>Global Confirmed</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[2] <figure><figcaption>Global Deaths</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[3] <figure><figcaption>U.S. Confirmed</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[4] <figure><figcaption>U.S. Deaths</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
So I can access these, but all that's stored in them is "Loading..." where the actual count appears on the site and in the webinspect. I'm new to this so I appreciate any help you can give me. Thank you!
The numbers you're interested in are updated by querying another source, so the page loads initially with the "Loading..." that you're seeing. You can see this in action if you refresh the page - initially, there's only "Loading..." in the boxes that's later filled in. The trick here is to find the source that supplies that information and request that, rather than the page itself. Here, the page that https://coronavirus.jhu.edu/ pulls from is https://jhucoronavirus.azureedge.net/jhucoronavirus/homepage-featured-stats.json, so we can query that directly.
url <- "https://jhucoronavirus.azureedge.net/jhucoronavirus/homepage-featured-stats.json"
website = read_html(url)
website %>%
html_element(xpath = "//p") %>%
html_text() %>%
jsonlite::fromJSON() %>%
as.data.frame()
returning a neat data frame of
generated updated cases.global cases.US deaths.global deaths.US
1 1.637691e+12 1.637688e+12 258453277 47902038 5162675 772588
You were off to a good start by using the web inspect, but finding sources is a little trickier and usually requires using Chrome's "Network" tab instead of the default "Elements". In this case, I was able to find the source by slowing down the request speed (enabling throttling to "Slow 3G") and then watching for network activity that occurred after the initial page load. The only major update came from the URL I've suggested above:
(green bar in the top left was the original page loading, second lower green bar currently highlighted with a blue aura was the next major update)
which I could then access directly in Chrome (copy/paste URL) to see the raw JSON.
As an additional note, because you're webscraping I'd recommend the polite package to obey the website's rules for scraping which I've included below:
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent

Using RSelenium to web scrape Google Scholar

I'm trying to develop an academic network using information available from Google Scholar. Part of this involves scraping data from a pop-up window (not actually sure what kind of window it is - it doesn't seem to be a regular window or an iframe) that is produced from clicking on an article title on an individual scholar's page.
I've been using RSelenium to perform this task. Below is the code I've developed so far for interacting with Google Scholar.
#Libraries----
library(RSelenium)
#Functions----
#Convenience function for simplifying data generated from .$findElements()
unPack <- function(x, opt = "text"){
unlist(sapply(x, function(x){x$getElementAttribute(opt)}))
}
#Analysis----
#Start up the server for Chrome.
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "http://scholar.google.com/citations?user=sc3TX6oAAAAJ&hl=en&oi=ao"
#Open the site.
remDr$navigate(siteAdd)
#Create a list of all the article titles
cite100Elem <- remDr$findElements(using = "css selector", value = "a.gsc_a_at")
cite100 <- unPack(cite100Elem)
#Start scraping the first article. I will create some kind of loop for all
# articles later.
#This opens the pop-up window with additional data I'm interested in.
citeTitle <- cite100[1]
citeElem <- remDr$findElement(using = 'link text', value = citeTitle)
citeElem$clickElement()
Here's where I get stuck. Looking at the underlying webpage using Chrome's Developer tools, I can see that the first bit of information I'm interested in, the authors of the article, which is associated with the following HTML:
<div class="gsc_vcd_value">TR Moore, NT Roulet, JM Waddington</div>
This suggests that I should be able do something like:
#Extract all the information about the article.
articleElem <- remDr$findElements(value = '//*[#class="gsc_vcd_title"]')
articleInfo <- unPack(articleElem)
However, this solution doesn't seem to work; it returns a value of "NULL".
I'm hoping that someone out there has an R-based solution, because I know very little about Java Script.
Last, if search the resulting text from the following code (parse the page I'm currently on):
htmlOut <- XML::htmlParse(remDr$getPageSource()[[1]])
htmlOut
I can't find the CSS class associated with "gsc_vcd_title", which suggests to me that the page I'm interested in has a more complicated structure that I haven't quite figured out yet.
Any insights you have would be very welcome. Thanks!

Using R to download URL by linkname in a search function

I want to scrape information from this page for each month with a few parameters, download all returning articles and look for some information.
Scraping works fine with css selector, for example getting the article names:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
#headings Jan 2015
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, css = ".entrylist__title")
headings1 <- html_text(headings_nodes1)
headings1 <- str_replace_all(headings1, "\\n|\\t|\\r", "") %>% str_trim()
head(headings1)
headings1
But now i want to download the articles for every entrylist_link that the search returns ( for example here).
How can i do that? I followed advices here , because the URLs aren´t regular and have different numbers for each article at the end, but it doesnt work.
Somehow i´m not able to get the entrylist_link information with the href information.
I think getting all the links together in a vector is the biggest problem
Can someone give me suggestions on how to get this to work?
Thank you in advance for any help.
If you right click on the page and click inpect (I'm using a Chrome Web Browswer), you can see more detail for the underlying xml. I was able to pull all the links under the headings:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, ".entrylist__link, a")
html_links <- html_attr(headings_nodes1, "href")

Using R-selenium to scrape data from an aspx webpage

I am pretty new to r and selenium so hopefully i can express myself clearly about my question.
I want to scrape some data off a website (.aspx) and i need to type some chemical code to be able to pull out some information in the next page (using R-selenium to input and click element). So far i have been able to build a short code that will get me through the first step, i.e. pull out the correct page i wanted. But i had so much trouble in finding a good way to scrape the data (the chemical information in the table) off this website. Mainly because the website will not assign a new html address instead of give me the same aspx address for any chemical i search. I plan to overcome this and then build a loop so i can scrape more information automatically. Anyone has any good thoughts that how i should get the data off after click-element? I need the chemical information table in the second page.
Thanks heaps in advance!
Here i put my code that i wrote so far: the next step i need is to scrape the table out the next page!
library("RSelenium")
checkForServer()
startServer()
mybrowser <- remoteDriver()
mybrowser$open()
mybrowser$navigate("http://limitvalue.ifa.dguv.de/")
mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox <- mybrowser$findElement(using = 'css selector', "#Tbox_cas")
wxbox$sendKeysToElement(list("64-19-7"))
wxbutton <- mybrowser$findElement(using = 'css selector', "#Butsearch")
wxbutton$clickElement()
First of all, your tool choice is wrong.
Secondly, in your case
POST to the "permanent" url
302 redirect to a new url, which is http://limitvalue.ifa.dguv.de/WebForm_ueliste2.aspx in your case
GET the new url
Thirdly, what's the ultimate output you are after?
It really depends on how much data you are up to. Otherwise do a manual task.

R, rvest and selectorGadget on Facebook

I've got problem with rvest on Facebook. I've webscraped many things by R earlier, so I understand how for example html_nodes works. I always use SelectorGadget and everything workes. This time, SelectorGadget doesn't work on Facebook site so I have to cope with html.
Let's say I've got this site https://www.facebook.com/avanti/posts/1017920341583065 and I want to extract article title ('Karnawałowe stylizacje F&F'). How can I do it?
I've tried so far:
library("rvest")
link_fb <- "http://www.fb.com/103052579736517_1017920341583065"
html_strony <- read_html(link_fb)
html_text(html_nodes(html_strony, "mbs _6m6"))
but it doesn't work. I'd be really greatfull for any help.
PS I have to have this title, not after clicking the link, because it could be different there.
I think you should USE Facebook API to download content and information from Facebook: Rfacebook R package and Facebook API: https://developers.facebook.com/
You can write your own R-Facebook-API conection with httr package. Good luck