Find absolute html path given relative href using R - html

I'm new to html but playing with a script to download all PDF files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.
Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R package rvest looking at the raw html or accessing the href attribute of a "nodes" I only get relative paths:
library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
"electrical-engineering-and-computer-science/",
"6-006-introduction-to-algorithms-fall-2011/lecture-notes/")
# Read webpage and extract all links
links_all <- read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1]
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"

The easiest solution that I have found as of today is using the url_absolute(x, base) function of the xml2 package. For the base parameter, you use the url of the page you retrieved the source from.
This seems less error prone than trying to extract the base url of the address via regexp.

Related

Webscraping Links on a Page

I have this website over here: https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434%2C-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD
Using R, within the <div class = "cardcon"> section, I am trying to extract the hyperlink for each individual house on this page:
As an example, the desired output would be:
https://www.realtor.ca/real-estate/25054113/4918-lafontaine-hanmer
https://www.realtor.ca/real-estate/25054111/77-shady-shores-drive-w-winnipeg-waterside-estates
etc.
In a previous question(Webscraping R: no applicable method for 'read_xml' applied to an object of class "list"), I learned how to use the API for this website, but this was producing problems.
Instead, I would like to try and learn how to extract the links (e.g. webscraping) directly from this website without using the API.
As an example, I tried to do this without the following code:
library(rvest)
library(httr)
library(XML)
url<-"https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434%2C-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD"
# making http request
resource <- GET(url)
# converting all the data to HTML format
parse <- htmlParse(resource)
# scrapping all the href tags
links <- xpathSApply(parse, path="//a", xmlGetAttr, "href")
page <-read_html(links)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
But I am not sure how to proceed with this - can someone please help me out?
Thank you!

How to extract the table for each subject from the URL link in R - Webscraping

I'm trying to scrape the table for each subject:
This is the main link https://htmlaccess.louisville.edu/classSchedule/setupSearchClassSchedule.cfm?error=0 It looks like below:
I have to select each subject and click search, which takes to the link https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm
Each subject gives a different table. For subject Accounting I tried to get the table like below: I used Selector Gadget Chrome extension to get the node string for html_nodes
library(rvest)
library(tidyr)
library(dplyr)
library(ggplot2)
url <- "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm"
df <- read_html(url)
tot <- df %>%
html_nodes('table+ table td') %>%
html_text()
But it didn't work:
## show
tot
character(0)
Is there a way to get the tables for each subject in a code with R?
Your problem is that the site requires a web form be submitted - that's what happens when you click the "Search" button on the page. Without submitting that form, you won't be able to access the data. This is evident if you attempt to navigate to the link you're trying to scrape - punch that into your favorite web browser and you'll see that there's no tables at all at "https://htmlaccess.louisville.edu/classSchedule/searchClassSchedule.cfm". No wonder nothing shows up!
Fortunately, you can submit web forms with R. It requires a little bit more code, however. My favorite package for this is httr, which partners nicely with rvest. Here's the code that will submit a form using httr and then proceed with the rest of your code.
library(rvest)
library(dplyr)
library(httr)
request_body <- list(
term="4212",
subject="ACCT",
catalognbr="",
session="none",
genEdCat="none",
writingReq="none",
comBaseCat="none",
sustainCat="none",
starttimedir="0",
starttimehour="08",
startTimeMinute="00",
endTimeDir="0",
endTimeHour="22",
endTimeMinute="00",
location="any",
classstatus="0",
Search="Search"
)
resp <- httr::POST(
url = paste0("https://htmlaccess.louisville.edu/class",
"Schedule/searchClassSchedule.cfm"),
encode = "form",
body = request_body)
httr::status_code(resp)
df <- httr::content(resp)
tot <- df %>%
html_nodes("table+ table td") %>%
html_text() %>%
matrix(ncol=17, byrow=TRUE)
On my machine, that returns a nicely formatted matrix with the expected data. Now, the challenge was figuring out what the heck to put in the request body. For this, I use Chrome's "inspect" tool (right click on a webpage, hit "inspect"). On the "Network" tab of that side panel, you can track what information is being sent by your browser. If I start on the main page and keep that side tab up while I "search" for accounting, I see that the top hit is "searchClassSchedule.cfm" and open that up by clicking on it. There, you can see all the form fields that were submitted to the server and I simply copied those over into R manually.
Your job will be to figure out what shortened name the rest of the departments use! "ACCT" seems to be the one for "Accounting". Once you've got those names in a vector you can loop over them with a for loop or lapply statement:
dept_abbrevs <- c("ACCT", "AIRS")
lapply(dept_abbrevs, function(abbrev){
...code from above...
...after defining message body...
message_body$subject <- abbrev
...rest of the code...
}

Scraping Weblinks out of an Website using Rvest

im new to r and Webscraping. I'm currently scraping a realestate website (https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search) but i don't manage to scrape the links of the specific offers.
When using the code below, i get every link attached to the Website, and im not quite sure how i can filter it in a way that it only scrapes the links of the 20 estate offers. Maybe you can help me.
Viewing the source code / inspecting the elements didn't help me so far...
url <- immo_webp %>%
html_nodes("a") %>%
html_attr("href")
You can target the article tags and then construct the urls from the data-obid attribute by concatenating with a base string
library(rvest)
library(magrittr)
base = 'https://www.immobilienscout24.de/expose/'
urls <- lapply(read_html("https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search")%>%
html_nodes('article')%>%
html_attr('data-obid'), function (url){paste0(base, url)})
print(urls)

How to scrape data within a input tag from an iframe using R

I'm trying to scrape the data from a property portal for an academic project. The data I'm interested is Price trends and it's in a iframe. I want to get the data for upper, average and lower range. This data is stored in a input tag. I'm trying to scrape this data by referring to parent class and then to the input tag but can't get to the data.
There are many iframes which I need to scrape but one of them is this
The code I've tried is below but I don't get desired result.
#Specifying the url of the iframe to be scraped
url <- 'https://www.99acres.com/do/pricetrends?building_id=0&loc_id=12400&prop_type=1&pref=S&bed_no=0&w=600&h=350'
#Reading the HTML code from the website
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
webpage <- read_html("scrapedpage.html")
PriceTrend_data_html <- html_nodes(webpage,'.ptplay input')
PriceTrend_data_html
It would of immense help if someone can guide me here.
I was able to solve it on my own after some research hence posting it here in case anyone else encounters the same issue in future. I could not read the html file using read_html() when I download the file with download.file() so had to manually download the file and then work on it.
Since the data was within input tag only so I scraped attribute with id of input tag and got the data I wanted. Here is the piece of code that worked for me.
url <- read_html("scrapedpage_chart.html")
average_prices <- html_attr(html_nodes(url, "#priceTrendVariables"), "median")
average_prices <- gsub(pattern = 'null',replacement = 'NA',x = average_prices)
average_prices <- unlist(strsplit(average,split = ","))
average_prices <- as.numeric(average)
average_prices

Using R to download URL by linkname in a search function

I want to scrape information from this page for each month with a few parameters, download all returning articles and look for some information.
Scraping works fine with css selector, for example getting the article names:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
#headings Jan 2015
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, css = ".entrylist__title")
headings1 <- html_text(headings_nodes1)
headings1 <- str_replace_all(headings1, "\\n|\\t|\\r", "") %>% str_trim()
head(headings1)
headings1
But now i want to download the articles for every entrylist_link that the search returns ( for example here).
How can i do that? I followed advices here , because the URLs aren´t regular and have different numbers for each article at the end, but it doesnt work.
Somehow i´m not able to get the entrylist_link information with the href information.
I think getting all the links together in a vector is the biggest problem
Can someone give me suggestions on how to get this to work?
Thank you in advance for any help.
If you right click on the page and click inpect (I'm using a Chrome Web Browswer), you can see more detail for the underlying xml. I was able to pull all the links under the headings:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, ".entrylist__link, a")
html_links <- html_attr(headings_nodes1, "href")