Trying to scrape business website from webpage with R - html

I am attempting to scrape the website link:
https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
from the Contact for Appointment Popup on
https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984 using R Studio.
I have tried
page <- read_html("https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984")
Website <- html_attr(html_nodes(page,xpath = '//*[#id="appointment-vendors-categories"]/div/div[3]/div[1]/span/a'),"href")
This is my output: character(0).
The desired output is: https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
I successfully scraped info from the contact section at the bottom of the page using the code below, but the same method doesn't seem to be working for the link.
Name_of_Vendor2 <- substr((page %>% html_nodes("h3") %>% html_text),18,70)
Phone_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[2]"))
Address_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[1]"))

After taking a look into the downloaded html by writing it into a file using rvest::write_xml(page, file="temp.html"). I then proceeded to search for the URL and found it in the one of the last script tags as a JSON object, and the key for the url is websiteUrl. So I selected the last script tag and ran a regular expression on it's content to get the URL
scripts <- html_nodes(page,xpath = '//script[#type="text/javascript"]')
script <- html_text(scripts[[length(scripts)-1]])
stringr::str_extract(script,'(?<="websiteUrl":").+?(?=")')
#> [1] "http://www.AnElegantAffairBridal.com?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot"

Related

Loop through multiple links from an Excel file, open and download the corresponding webpages

I downloaded from MediaCloud an Excel file with 1719 links to different newspaper articles. I am trying to use R to loop through each link, open it and download all the corresponding online articles in a single searchable file (HTML, CSV, TXT, PDF - doesn't matter) that I can read and analyze later.
I went through all similar questions on Stack Overflow and a number of tutorials for downloading files and managed to assemble this code (I am very new to R):
express <-read.csv("C://Users//julir//Documents//Data//express.csv")
library(curl)
for (express$url in 2:1720)
destfile <- paste0("C://Users//julir//Documents//Data//results.csv")
download.file(express$url, destfile, method = "auto", quiet = TRUE, cacheOK=TRUE)
Whenever I try to run it though I get the following error:
Error in download.file(express$url, destfile = express$url, method = "auto", : 'url' must be a length-one character vector
I tried also this alternative method suggested online:
library(httr)
url <- express$url
response <- GET(express$url)
html_document <- content(response, type = "text", encoding = "UTF-8")
But I get the same mistake:
Error in parse_url(url) : length(url) == 1 is not TRUE
So I guess there is a problem with how the URLs are stored - but I can't understand how to fix it.
I am also not certain about the downloading process - I would ideally want all text on the HTML page - it seems unpractical to use selectors and rvest in this case - but I might be very wrong.
You need to look through the url's and read/parse each individually. You are essentially passing an array of urls into one request, which is why you see that error.
I don't know your content/urls, but here's an example of how you would approach this:
library(xml2)
library(jsonlite)
library(dplyr)
df <- data.frame(page_n = 1:5, urls = sprintf('https://www.politifact.com/factchecks/list/?page=%s', 1:5))
result_info <- lapply(df$urls, function(i){
raw <- read_html(i)
a_tags <- raw %>% xml_find_all(".//a[contains(#href,'factchecks/2021')]")
urls <- xml2::url_absolute(xml_attr(a_tags, "href"),xml_url(raw))
titles <- xml_text(a_tags) %>% stri_trim_both()
data.frame(title = titles, links = urls)
}) %>% rbind_pages()
result_info %>% head()
title
links
Says of UW-Madison, "It cost the university $50k (your tax dollars) to remove" a rock considered by some a symbol of racism.
https://www.politifact.com/factchecks/2021/aug/14/rachel-campos-duffy/no-taxpayer-funds-were-not-used-remove-rock-deemed/
“Rand Paul’s medical license was just revoked!”
https://www.politifact.com/factchecks/2021/aug/13/facebook-posts/no-rand-pauls-medical-license-wasnt-revoked/
Every time outgoing New York Gov. Andrew Cuomo “says the firearm industry ‘is immune from lawsuits,’ it's false.”
https://www.politifact.com/factchecks/2021/aug/13/elise-stefanik/refereeing-andrew-cuomo-elise-stefanik-firearm-ind/
The United States' southern border is "basically open" and is "a super spreader event.”
https://www.politifact.com/factchecks/2021/aug/13/gary-sides/north-carolina-school-leader-repeats-false-claims-/
There is a “0.05% chance of dying from COVID.”
https://www.politifact.com/factchecks/2021/aug/13/tiktok-posts/experts-break-down-numbers-catching-or-dying-covid/
The Biden administration is “not even testing these people” being released by Border Patrol into the U.S.
https://www.politifact.com/factchecks/2021/aug/13/ken-paxton/biden-administration-not-even-testing-migrants-rel/

Web Scrape an Image with rvest R

I'm having a problem when trying to scrape an image from this page. My code is as follow:
library(rvest)
url <- read_html("https://covid-19vis.cmm.uchile.cl/chart")
m <- '/html/body/div/div/div[4]/main/div/div/div/div/div/div[2]/div[1]'
grafico_cmm <- html_node(url, xpath = m) %>% html_attr('src')
When I run the above code, the result is NA. Does someone know how can I scrape the plot or maybe the data from the page?
Thanks a lot
It not an image, it is an interactive chart. For an image, you would need to scrape the data points and re-create as a chart and then convert to an image. Xpath is also invalid.
The data comes from an API call. I checked the values against the chart and this is the correct endpoint.
library(jsonlite)
data <- jsonlite::read_json('https://covid-19vis.cmm.uchile.cl/api/data?scope=0&indicatorId=57', simplifyVector = T)
The chart needs some tidying but here is a basic plot of the r values:
data$date <- data$date %>% as.Date()
library("ggplot2")
ggplot(data=data,
aes(x=date, y=value, colour ='red')) +
geom_line() +
scale_color_discrete(name = "R Efectivo", labels = c("Chile"))
print tail(data)

R package rvest() working, but not giving the result of filling out the HTML form

I'm stumped, not sure what I'm doing wrong (using rvest()):
url <- "http://library.uaf.edu"
a_session <- session(url)
a_form <- html_form(a_session)
filled_form <- html_form_set(a_form[[1]],ebscohostsearchtext="SQL")
#
a_returned_info <- html_form_submit(filled_form)
returned_page <- rawToChar(a_returned_info$content)
big_pile_of_html <- read_html(returned_page)
#writeLines(text=returned_page,con=file(FILENAMEHERE))
PROBLEM: When I look at the output from the last line in a browser, I see the original page (obtained by 'url'), and not the page that should be created when the SQL is submitted to the field. Otherwise, output is reasonable HTML. Also, the status code for a_returned_info is 200, which should be a success.

How to trigger a file download using R

I am trying to use R to trigger a file download on this site: http://www.regulomedb.org. Basically, an ID, e.g., rs33914668, is input in the form, click Submit. Then in the new page, click Download in the bottom left corner to trigger a file download.
I have tried rvest with the help from other posts.
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("http://www.regulomedb.org")
POST(
url = "http://www.regulomedb.org",
body = list(
data = "rs33914668"
),
encode = "form"
)
) -> res
pg <- content(res, as="parsed")
By checking pg, I think I am still on the first page, not the http://www.regulomedb.org/results. (I don't know how to check pg list other than reading it line by line). So, I cannot reach the download button. I cannot figure out why it cannot jump to the next page.
By learning from some other posts, I managed to download the file without using rvest.
library(httr)
library(rvest)
library(RCurl)
session <- html_session("http://www.regulomedb.org")
form <- html_form(session)[[1]]
filledform <- set_values(form, `data` = "rs33914668")
session2 <- submit_form(session, filledform)
form2 <- html_form(session2)[[1]]
filledform2 <- set_values(form2)
thesid <- filledform2[["fields"]][["sid"]]$value
theurl <- paste0('http://www.regulomedb.org/download/',thesid)
download.file(theurl,destfile="test.bed",method="libcurl")
In filledform2, I found the sid. Using www.regulomedb.org/download/:sid, I can download the file.
I am new to html or even R, and don't even know what sid is. Although I made it, I am not satisfied with the coding. So, I hope some experienced users can provide better, alternative solutions, or improve my current solution. Also, what is wrong with the POST/rvest method?
url<-"http://www.regulomedb.org/"
library(rvest)
page<-html_session(url)
download_page<-rvest:::request_POST(page,url="http://www.regulomedb.org/results",
body=list("data"="rs33914668"),
encode = 'form')
#This is a unique id on generated based on your query
sid<-html_nodes(download_page,css='#download > input[type="hidden"]:nth-child(8)') %>% html_attr('value')
#This is a UNIX time
download_token<-as.numeric(as.POSIXct(Sys.time()))
download_page1<-rvest:::request_POST(download_page,url="http://www.regulomedb.org/download",
body=list("format"="bed",
"sid"=sid,
"download_token_value_id"=download_token ),
encode = 'form')
writeBin(download_page1$response$content, "regulomedb_result.bed")

Navigate to a link using html session in R

I am trying to navigate to a link on a website. All the links work except for one single link. Here are the results.
> mcsession<-html_session("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
> mcsession<-mcsession %>% follow_link("Previous Years »")
Error: No links have text 'Previous Years »'
In addition: Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
> mcsession<-mcsession %>% follow_link("Balance Sheet")
Navigating to /financials/tataconsultancyservices/balance-sheetVI/TCS#TCS
Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
Any idea why this happens so?
It is not a normal link - it is javascript. I don't know of a way of doing it with rvest, but you could use RSelenium, which basically automates a normal browser window. It is slower than scraping directly, but you can automate just about anything that you can do by hand. This works for me (using chrome on Windows 10)...
library(RSelenium)
rD <- rsDriver(port=4444L,browser="chrome")
remDr <- rD$client
remDr$navigate("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
firstpage <- remDr$getPageSource() #you can use this to get the first table
#(1)
webElem1 <- remDr$findElement(using = 'partial link text', value = "Previous Years")
webElem1$clickElement()
nextpage <- remDr$getPageSource() #you can use this to get the next page for previous years
#repeat from #(1) to go back another page etc
remDr$closeall() #when you have finished.