Using R to download URL by linkname in a search function - html

I want to scrape information from this page for each month with a few parameters, download all returning articles and look for some information.
Scraping works fine with css selector, for example getting the article names:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
#headings Jan 2015
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, css = ".entrylist__title")
headings1 <- html_text(headings_nodes1)
headings1 <- str_replace_all(headings1, "\\n|\\t|\\r", "") %>% str_trim()
head(headings1)
headings1
But now i want to download the articles for every entrylist_link that the search returns ( for example here).
How can i do that? I followed advices here , because the URLs aren´t regular and have different numbers for each article at the end, but it doesnt work.
Somehow i´m not able to get the entrylist_link information with the href information.
I think getting all the links together in a vector is the biggest problem
Can someone give me suggestions on how to get this to work?
Thank you in advance for any help.

If you right click on the page and click inpect (I'm using a Chrome Web Browswer), you can see more detail for the underlying xml. I was able to pull all the links under the headings:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, ".entrylist__link, a")
html_links <- html_attr(headings_nodes1, "href")

Related

rvest webscraping with placeholder counts that update daily

I'm trying to scrape global daily counts of cases and deaths from JHU: https://coronavirus.jhu.edu/
It seems that the counts are stored like this when I use the web inspect but when I try to use the following code to access them, all I can find are placeholders:
library(rvest)
url = "https://coronavirus.jhu.edu/"
website = read_html(url)
cases <- website %>%
html_nodes(css = "figure")
cases
produces the following:
{xml_nodeset (4)}
[1] <figure><figcaption>Global Confirmed</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[2] <figure><figcaption>Global Deaths</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[3] <figure><figcaption>U.S. Confirmed</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
[4] <figure><figcaption>U.S. Deaths</figcaption><p class="FeaturedStats_stat-placeholder__1Dax8">Loading...</p></figure>
So I can access these, but all that's stored in them is "Loading..." where the actual count appears on the site and in the webinspect. I'm new to this so I appreciate any help you can give me. Thank you!
The numbers you're interested in are updated by querying another source, so the page loads initially with the "Loading..." that you're seeing. You can see this in action if you refresh the page - initially, there's only "Loading..." in the boxes that's later filled in. The trick here is to find the source that supplies that information and request that, rather than the page itself. Here, the page that https://coronavirus.jhu.edu/ pulls from is https://jhucoronavirus.azureedge.net/jhucoronavirus/homepage-featured-stats.json, so we can query that directly.
url <- "https://jhucoronavirus.azureedge.net/jhucoronavirus/homepage-featured-stats.json"
website = read_html(url)
website %>%
html_element(xpath = "//p") %>%
html_text() %>%
jsonlite::fromJSON() %>%
as.data.frame()
returning a neat data frame of
generated updated cases.global cases.US deaths.global deaths.US
1 1.637691e+12 1.637688e+12 258453277 47902038 5162675 772588
You were off to a good start by using the web inspect, but finding sources is a little trickier and usually requires using Chrome's "Network" tab instead of the default "Elements". In this case, I was able to find the source by slowing down the request speed (enabling throttling to "Slow 3G") and then watching for network activity that occurred after the initial page load. The only major update came from the URL I've suggested above:
(green bar in the top left was the original page loading, second lower green bar currently highlighted with a blue aura was the next major update)
which I could then access directly in Chrome (copy/paste URL) to see the raw JSON.
As an additional note, because you're webscraping I'd recommend the polite package to obey the website's rules for scraping which I've included below:
robots.txt: 1 rules are defined for 1 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent

How to extract the table for each subject from the URL link in R - Webscraping

I'm trying to scrape the table for each subject:
This is the main link https://htmlaccess.louisville.edu/classSchedule/setupSearchClassSchedule.cfm?error=0 It looks like below:
I have to select each subject and click search, which takes to the link https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm
Each subject gives a different table. For subject Accounting I tried to get the table like below: I used Selector Gadget Chrome extension to get the node string for html_nodes
library(rvest)
library(tidyr)
library(dplyr)
library(ggplot2)
url <- "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm"
df <- read_html(url)
tot <- df %>%
html_nodes('table+ table td') %>%
html_text()
But it didn't work:
## show
tot
character(0)
Is there a way to get the tables for each subject in a code with R?
Your problem is that the site requires a web form be submitted - that's what happens when you click the "Search" button on the page. Without submitting that form, you won't be able to access the data. This is evident if you attempt to navigate to the link you're trying to scrape - punch that into your favorite web browser and you'll see that there's no tables at all at "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm". No wonder nothing shows up!
Fortunately, you can submit web forms with R. It requires a little bit more code, however. My favorite package for this is httr, which partners nicely with rvest. Here's the code that will submit a form using httr and then proceed with the rest of your code.
library(rvest)
library(dplyr)
library(httr)
request_body <- list(
term="4212",
subject="ACCT",
catalognbr="",
session="none",
genEdCat="none",
writingReq="none",
comBaseCat="none",
sustainCat="none",
starttimedir="0",
starttimehour="08",
startTimeMinute="00",
endTimeDir="0",
endTimeHour="22",
endTimeMinute="00",
location="any",
classstatus="0",
Search="Search"
)
resp <- httr::POST(
url = paste0("https://htmlaccess.louisville.edu/class",
"Schedule/searchClassSchedule.cfm"),
encode = "form",
body = request_body)
httr::status_code(resp)
df <- httr::content(resp)
tot <- df %>%
html_nodes("table+ table td") %>%
html_text() %>%
matrix(ncol=17, byrow=TRUE)
On my machine, that returns a nicely formatted matrix with the expected data. Now, the challenge was figuring out what the heck to put in the request body. For this, I use Chrome's "inspect" tool (right click on a webpage, hit "inspect"). On the "Network" tab of that side panel, you can track what information is being sent by your browser. If I start on the main page and keep that side tab up while I "search" for accounting, I see that the top hit is "searchClassSchedule.cfm" and open that up by clicking on it. There, you can see all the form fields that were submitted to the server and I simply copied those over into R manually.
Your job will be to figure out what shortened name the rest of the departments use! "ACCT" seems to be the one for "Accounting". Once you've got those names in a vector you can loop over them with a for loop or lapply statement:
dept_abbrevs <- c("ACCT", "AIRS")
lapply(dept_abbrevs, function(abbrev){
...code from above...
...after defining message body...
message_body$subject <- abbrev
...rest of the code...
}

Scraping Weblinks out of an Website using Rvest

im new to r and Webscraping. I'm currently scraping a realestate website (https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search) but i don't manage to scrape the links of the specific offers.
When using the code below, i get every link attached to the Website, and im not quite sure how i can filter it in a way that it only scrapes the links of the 20 estate offers. Maybe you can help me.
Viewing the source code / inspecting the elements didn't help me so far...
url <- immo_webp %>%
html_nodes("a") %>%
html_attr("href")
You can target the article tags and then construct the urls from the data-obid attribute by concatenating with a base string
library(rvest)
library(magrittr)
base = 'https://www.immobilienscout24.de/expose/'
urls <- lapply(read_html("https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search")%>%
html_nodes('article')%>%
html_attr('data-obid'), function (url){paste0(base, url)})
print(urls)

Using RSelenium to web scrape Google Scholar

I'm trying to develop an academic network using information available from Google Scholar. Part of this involves scraping data from a pop-up window (not actually sure what kind of window it is - it doesn't seem to be a regular window or an iframe) that is produced from clicking on an article title on an individual scholar's page.
I've been using RSelenium to perform this task. Below is the code I've developed so far for interacting with Google Scholar.
#Libraries----
library(RSelenium)
#Functions----
#Convenience function for simplifying data generated from .$findElements()
unPack <- function(x, opt = "text"){
unlist(sapply(x, function(x){x$getElementAttribute(opt)}))
}
#Analysis----
#Start up the server for Chrome.
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "http://scholar.google.com/citations?user=sc3TX6oAAAAJ&hl=en&oi=ao"
#Open the site.
remDr$navigate(siteAdd)
#Create a list of all the article titles
cite100Elem <- remDr$findElements(using = "css selector", value = "a.gsc_a_at")
cite100 <- unPack(cite100Elem)
#Start scraping the first article. I will create some kind of loop for all
# articles later.
#This opens the pop-up window with additional data I'm interested in.
citeTitle <- cite100[1]
citeElem <- remDr$findElement(using = 'link text', value = citeTitle)
citeElem$clickElement()
Here's where I get stuck. Looking at the underlying webpage using Chrome's Developer tools, I can see that the first bit of information I'm interested in, the authors of the article, which is associated with the following HTML:
<div class="gsc_vcd_value">TR Moore, NT Roulet, JM Waddington</div>
This suggests that I should be able do something like:
#Extract all the information about the article.
articleElem <- remDr$findElements(value = '//*[#class="gsc_vcd_title"]')
articleInfo <- unPack(articleElem)
However, this solution doesn't seem to work; it returns a value of "NULL".
I'm hoping that someone out there has an R-based solution, because I know very little about Java Script.
Last, if search the resulting text from the following code (parse the page I'm currently on):
htmlOut <- XML::htmlParse(remDr$getPageSource()[[1]])
htmlOut
I can't find the CSS class associated with "gsc_vcd_title", which suggests to me that the page I'm interested in has a more complicated structure that I haven't quite figured out yet.
Any insights you have would be very welcome. Thanks!

Scraping in R, cannot get "onclick" attribute

I'm scraping the NFL website with R. R might not be the best to do this but that is not my question here.
I can usually get everything I want but for the first time I got a problem.
In the present case I want to get info from let's say, this page
http://www.nfl.com/player/j.j.watt/2495488/profile
The info I want to get is there
Draft
Using xPathSapply(parsedPage,xmlGettAttr, name="onclick") I get only NULL... and I do not get the reason why.
I could retrieve the information elsewhere in the code and then paste to recover the address but I find it much easier and clearer to get it at once.
How can I get this, using R, eventually C. I do not know much about JavaScript, I would be happy to avoid this.
Thanks in advance for the help.
The reason is that there are no "onclick"-attributes in the sourcecode: See (in Chrome)
view-source:http://www.nfl.com/player/j.j.watt/2495488/profile
The onclick-attributes are added via javascript. Because of that you need a parser that executes the JS.
In R you can you RSelenium for that as follows:
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.nfl.com/player/j.j.watt/2495488/profile")
doc <- remDr$getPageSource()
require(rvest)
doc <- read_html(doc[[1]])
doc %>% html_nodes(".HOULink") %>% xml_attr("onclick")
remDr$close()
#shutdown
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")
For me this resulted in:
[1] "s_objectID=\"http://www.nfl.com/teams/houstontexans/profile?team=HOU_1\";return this.s_oc?this.s_oc(e):true"
[2] "s_objectID=\"http://www.houstontexans.com/_2\";return this.s_oc?this.s_oc(e):true"
[3] "s_objectID=\"http://www.nfl.com/gamecenter/2015122004/2015/REG15/texans#colts/watch_1\";return this.s_oc?this.s_oc(e):true"
...
You can also use a headless browser like phantomjs see https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html