Scraping Weblinks out of an Website using Rvest - html

im new to r and Webscraping. I'm currently scraping a realestate website (https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search) but i don't manage to scrape the links of the specific offers.
When using the code below, i get every link attached to the Website, and im not quite sure how i can filter it in a way that it only scrapes the links of the 20 estate offers. Maybe you can help me.
Viewing the source code / inspecting the elements didn't help me so far...
url <- immo_webp %>%
html_nodes("a") %>%
html_attr("href")

You can target the article tags and then construct the urls from the data-obid attribute by concatenating with a base string
library(rvest)
library(magrittr)
base = 'https://www.immobilienscout24.de/expose/'
urls <- lapply(read_html("https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Rheinland-Pfalz/Koblenz?enteredFrom=one_step_search")%>%
html_nodes('article')%>%
html_attr('data-obid'), function (url){paste0(base, url)})
print(urls)

Related

How do you identify CSS Selectors when inspecting HTML code?

I'm trying to web scrape housing data using the rvest package, but I'm having difficulty identifying html nodes. My (generic) R code is as follows:
housing_wp <- read_html("webpage")
address <- housing_wp %>%
html_nodes("a.unhideListingLink") %>%
html_text()
address
The html code inspected is in the following link that contains the data I am trying to input into a table in R. What am I doing wrong?
HTML

Rvest to scrape Linkedin profile

I am using Rvest to scrape my linkedin profile.
But I am stuck at experience section.
below xpath is used to scape the experience section but it returns 0 nodeset
Test<-read %>%
html_nodes(xpath = '//*[#id="experience-section"]')
Thanks!!
Hello instead of scraping you can use their API

How to scrape data within a input tag from an iframe using R

I'm trying to scrape the data from a property portal for an academic project. The data I'm interested is Price trends and it's in a iframe. I want to get the data for upper, average and lower range. This data is stored in a input tag. I'm trying to scrape this data by referring to parent class and then to the input tag but can't get to the data.
There are many iframes which I need to scrape but one of them is this
The code I've tried is below but I don't get desired result.
#Specifying the url of the iframe to be scraped
url <- 'https://www.99acres.com/do/pricetrends?building_id=0&loc_id=12400&prop_type=1&pref=S&bed_no=0&w=600&h=350'
#Reading the HTML code from the website
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
webpage <- read_html("scrapedpage.html")
PriceTrend_data_html <- html_nodes(webpage,'.ptplay input')
PriceTrend_data_html
It would of immense help if someone can guide me here.
I was able to solve it on my own after some research hence posting it here in case anyone else encounters the same issue in future. I could not read the html file using read_html() when I download the file with download.file() so had to manually download the file and then work on it.
Since the data was within input tag only so I scraped attribute with id of input tag and got the data I wanted. Here is the piece of code that worked for me.
url <- read_html("scrapedpage_chart.html")
average_prices <- html_attr(html_nodes(url, "#priceTrendVariables"), "median")
average_prices <- gsub(pattern = 'null',replacement = 'NA',x = average_prices)
average_prices <- unlist(strsplit(average,split = ","))
average_prices <- as.numeric(average)
average_prices

Using R to download URL by linkname in a search function

I want to scrape information from this page for each month with a few parameters, download all returning articles and look for some information.
Scraping works fine with css selector, for example getting the article names:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
#headings Jan 2015
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, css = ".entrylist__title")
headings1 <- html_text(headings_nodes1)
headings1 <- str_replace_all(headings1, "\\n|\\t|\\r", "") %>% str_trim()
head(headings1)
headings1
But now i want to download the articles for every entrylist_link that the search returns ( for example here).
How can i do that? I followed advices here , because the URLs aren´t regular and have different numbers for each article at the end, but it doesnt work.
Somehow i´m not able to get the entrylist_link information with the href information.
I think getting all the links together in a vector is the biggest problem
Can someone give me suggestions on how to get this to work?
Thank you in advance for any help.
If you right click on the page and click inpect (I'm using a Chrome Web Browswer), you can see more detail for the underlying xml. I was able to pull all the links under the headings:
library(rvest)
browseURL("http://www.sueddeutsche.de/news")
url_parsed1 <- read_html("http://www.sueddeutsche.de/news?search=Fl%C3%BCchtlinge&sort=date&dep%5B%5D=politik&typ%5B%5D=article&sys%5B%5D=sz&catsz%5B%5D=alles&time=2015-01-01T00%3A00%2F2015-12-31T23%3A59&startDate=01.01.2015&endDate=31.01.2015")
headings_nodes1 <- html_nodes(url_parsed1, ".entrylist__link, a")
html_links <- html_attr(headings_nodes1, "href")

Find absolute html path given relative href using R

I'm new to html but playing with a script to download all PDF files that a given webpage links to (for fun and avoiding boring manual work) and I can't to find where in the html document I should look for the data that completes relative paths - I know it is possible since my web browser can do it.
Example: I trying to scrape lecture notes linked to on this page from ocw.mit.edu using R package rvest looking at the raw html or accessing the href attribute of a "nodes" I only get relative paths:
library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
"electrical-engineering-and-computer-science/",
"6-006-introduction-to-algorithms-fall-2011/lecture-notes/")
# Read webpage and extract all links
links_all <- read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1]
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"
The easiest solution that I have found as of today is using the url_absolute(x, base) function of the xml2 package. For the base parameter, you use the url of the page you retrieved the source from.
This seems less error prone than trying to extract the base url of the address via regexp.