Webscraping Links on a Page - html

I have this website over here: https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434%2C-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD
Using R, within the <div class = "cardcon"> section, I am trying to extract the hyperlink for each individual house on this page:
As an example, the desired output would be:
https://www.realtor.ca/real-estate/25054113/4918-lafontaine-hanmer
https://www.realtor.ca/real-estate/25054111/77-shady-shores-drive-w-winnipeg-waterside-estates
etc.
In a previous question(Webscraping R: no applicable method for 'read_xml' applied to an object of class "list"), I learned how to use the API for this website, but this was producing problems.
Instead, I would like to try and learn how to extract the links (e.g. webscraping) directly from this website without using the API.
As an example, I tried to do this without the following code:
library(rvest)
library(httr)
library(XML)
url<-"https://www.realtor.ca/map#ZoomLevel=4&Center=58.695434%2C-96.000000&LatitudeMax=72.60462&LongitudeMax=-26.39063&LatitudeMin=35.66836&LongitudeMin=-165.60938&Sort=6-D&PropertyTypeGroupID=1&PropertySearchTypeId=1&TransactionTypeId=2&Currency=CAD"
# making http request
resource <- GET(url)
# converting all the data to HTML format
parse <- htmlParse(resource)
# scrapping all the href tags
links <- xpathSApply(parse, path="//a", xmlGetAttr, "href")
page <-read_html(links)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
But I am not sure how to proceed with this - can someone please help me out?
Thank you!

Related

Scraping Weblinks out of an Website using Rvest

im new to r and Webscraping. I'm currently scraping a realestate website (https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search) but i don't manage to scrape the links of the specific offers.
When using the code below, i get every link attached to the Website, and im not quite sure how i can filter it in a way that it only scrapes the links of the 20 estate offers. Maybe you can help me.
Viewing the source code / inspecting the elements didn't help me so far...
url <- immo_webp %>%
html_nodes("a") %>%
html_attr("href")
You can target the article tags and then construct the urls from the data-obid attribute by concatenating with a base string
library(rvest)
library(magrittr)
base = 'https://www.immobilienscout24.de/expose/'
urls <- lapply(read_html("https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search")%>%
html_nodes('article')%>%
html_attr('data-obid'), function (url){paste0(base, url)})
print(urls)

How to scrape data within a input tag from an iframe using R

I'm trying to scrape the data from a property portal for an academic project. The data I'm interested is Price trends and it's in a iframe. I want to get the data for upper, average and lower range. This data is stored in a input tag. I'm trying to scrape this data by referring to parent class and then to the input tag but can't get to the data.
There are many iframes which I need to scrape but one of them is this
The code I've tried is below but I don't get desired result.
#Specifying the url of the iframe to be scraped
url <- 'https://www.99acres.com/do/pricetrends?building_id=0&loc_id=12400&prop_type=1&pref=S&bed_no=0&w=600&h=350'
#Reading the HTML code from the website
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
webpage <- read_html("scrapedpage.html")
PriceTrend_data_html <- html_nodes(webpage,'.ptplay input')
PriceTrend_data_html
It would of immense help if someone can guide me here.
I was able to solve it on my own after some research hence posting it here in case anyone else encounters the same issue in future. I could not read the html file using read_html() when I download the file with download.file() so had to manually download the file and then work on it.
Since the data was within input tag only so I scraped attribute with id of input tag and got the data I wanted. Here is the piece of code that worked for me.
url <- read_html("scrapedpage_chart.html")
average_prices <- html_attr(html_nodes(url, "#priceTrendVariables"), "median")
average_prices <- gsub(pattern = 'null',replacement = 'NA',x = average_prices)
average_prices <- unlist(strsplit(average,split = ","))
average_prices <- as.numeric(average)
average_prices

Using R to download URL by linkname in a search function

I want to scrape information from this page for each month with a few parameters, download all returning articles and look for some information.
Scraping works fine with css selector, for example getting the article names:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
#headings Jan 2015
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, css = ".entrylist__title")
headings1 <- html_text(headings_nodes1)
headings1 <- str_replace_all(headings1, "\\n|\\t|\\r", "") %>% str_trim()
head(headings1)
headings1
But now i want to download the articles for every entrylist_link that the search returns ( for example here).
How can i do that? I followed advices here , because the URLs aren´t regular and have different numbers for each article at the end, but it doesnt work.
Somehow i´m not able to get the entrylist_link information with the href information.
I think getting all the links together in a vector is the biggest problem
Can someone give me suggestions on how to get this to work?
Thank you in advance for any help.
If you right click on the page and click inpect (I'm using a Chrome Web Browswer), you can see more detail for the underlying xml. I was able to pull all the links under the headings:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, ".entrylist__link, a")
html_links <- html_attr(headings_nodes1, "href")

Find absolute html path given relative href using R

I'm new to html but playing with a script to download all PDF files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.
Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R package rvest looking at the raw html or accessing the href attribute of a "nodes" I only get relative paths:
library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
"electrical-engineering-and-computer-science/",
"6-006-introduction-to-algorithms-fall-2011/lecture-notes/")
# Read webpage and extract all links
links_all <- read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1]
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"
The easiest solution that I have found as of today is using the url_absolute(x, base) function of the xml2 package. For the base parameter, you use the url of the page you retrieved the source from.
This seems less error prone than trying to extract the base url of the address via regexp.

Scraping in R, cannot get "onclick" attribute

I'm scraping the NFL website with R. R might not be the best to do this but that is not my question here.
I can usually get everything I want but for the first time I got a problem.
In the present case I want to get info from let's say, this page
http://www.nfl.com/player/j.j.watt/2495488/profile
The info I want to get is there
Draft
Using xPathSapply(parsedPage,xmlGettAttr, name="onclick") I get only NULL... and I do not get the reason why.
I could retrieve the information elsewhere in the code and then paste to recover the address but I find it much easier and clearer to get it at once.
How can I get this, using R, eventually C. I do not know much about JavaScript, I would be happy to avoid this.
Thanks in advance for the help.
The reason is that there are no "onclick"-attributes in the sourcecode: See (in Chrome)
view-source:http://www.nfl.com/player/j.j.watt/2495488/profile
The onclick-attributes are added via javascript. Because of that you need a parser that executes the JS.
In R you can you RSelenium for that as follows:
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.nfl.com/player/j.j.watt/2495488/profile")
doc <- remDr$getPageSource()
require(rvest)
doc <- read_html(doc[[1]])
doc %>% html_nodes(".HOULink") %>% xml_attr("onclick")
remDr$close()
#shutdown
browseURL("http://localhost:4444/selenium-server/driver/?cmd=shutDownSeleniumServer")
For me this resulted in:
[1] "s_objectID=\"http://www.nfl.com/teams/houstontexans/profile?team=HOU_1\";return this.s_oc?this.s_oc(e):true"
[2] "s_objectID=\"http://www.houstontexans.com/_2\";return this.s_oc?this.s_oc(e):true"
[3] "s_objectID=\"http://www.nfl.com/gamecenter/2015122004/2015/REG15/texans#colts/watch_1\";return this.s_oc?this.s_oc(e):true"
...
You can also use a headless browser like phantomjs see https://cran.r-project.org/web/packages/RSelenium/vignettes/RSelenium-headless.html