Navigate to a link using html session in R - html

I am trying to navigate to a link on a website. All the links work except for one single link. Here are the results.
> mcsession<-html_session("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
> mcsession<-mcsession %>% follow_link("Previous Years »")
Error: No links have text 'Previous Years »'
In addition: Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
> mcsession<-mcsession %>% follow_link("Balance Sheet")
Navigating to /financials/tataconsultancyservices/balance-sheetVI/TCS#TCS
Warning message:
In grepl(i, text, fixed = TRUE) : input string 316 is invalid UTF-8
Any idea why this happens so?

It is not a normal link - it is javascript. I don't know of a way of doing it with rvest, but you could use RSelenium, which basically automates a normal browser window. It is slower than scraping directly, but you can automate just about anything that you can do by hand. This works for me (using chrome on Windows 10)...
library(RSelenium)
rD <- rsDriver(port=4444L,browser="chrome")
remDr <- rD$client
remDr$navigate("http://www.moneycontrol.com/financials/tataconsultancyservices/balance-sheetVI/TCS#TCS")
firstpage <- remDr$getPageSource() #you can use this to get the first table
#(1)
webElem1 <- remDr$findElement(using = 'partial link text', value = "Previous Years")
webElem1$clickElement()
nextpage <- remDr$getPageSource() #you can use this to get the next page for previous years
#repeat from #(1) to go back another page etc
remDr$closeall() #when you have finished.

Related

Loop through multiple links from an Excel file, open and download the corresponding webpages

I downloaded from MediaCloud an Excel file with 1719 links to different newspaper articles. I am trying to use R to loop through each link, open it and download all the corresponding online articles in a single searchable file (HTML, CSV, TXT, PDF - doesn't matter) that I can read and analyze later.
I went through all similar questions on Stack Overflow and a number of tutorials for downloading files and managed to assemble this code (I am very new to R):
express <-read.csv("C://Users//julir//Documents//Data//express.csv")
library(curl)
for (express$url in 2:1720)
destfile <- paste0("C://Users//julir//Documents//Data//results.csv")
download.file(express$url, destfile, method = "auto", quiet = TRUE, cacheOK=TRUE)
Whenever I try to run it though I get the following error:
Error in download.file(express$url, destfile = express$url, method = "auto", : 'url' must be a length-one character vector
I tried also this alternative method suggested online:
library(httr)
url <- express$url
response <- GET(express$url)
html_document <- content(response, type = "text", encoding = "UTF-8")
But I get the same mistake:
Error in parse_url(url) : length(url) == 1 is not TRUE
So I guess there is a problem with how the URLs are stored - but I can't understand how to fix it.
I am also not certain about the downloading process - I would ideally want all text on the HTML page - it seems unpractical to use selectors and rvest in this case - but I might be very wrong.
You need to look through the url's and read/parse each individually. You are essentially passing an array of urls into one request, which is why you see that error.
I don't know your content/urls, but here's an example of how you would approach this:
library(xml2)
library(jsonlite)
library(dplyr)
df <- data.frame(page_n = 1:5, urls = sprintf('https://www.politifact.com/factchecks/list/?page=%s', 1:5))
result_info <- lapply(df$urls, function(i){
raw <- read_html(i)
a_tags <- raw %>% xml_find_all(".//a[contains(#href,'factchecks/2021')]")
urls <- xml2::url_absolute(xml_attr(a_tags, "href"),xml_url(raw))
titles <- xml_text(a_tags) %>% stri_trim_both()
data.frame(title = titles, links = urls)
}) %>% rbind_pages()
result_info %>% head()
title
links
Says of UW-Madison, "It cost the university $50k (your tax dollars) to remove" a rock considered by some a symbol of racism.
https://www.politifact.com/factchecks/2021/aug/14/rachel-campos-duffy/no-taxpayer-funds-were-not-used-remove-rock-deemed/
“Rand Paul’s medical license was just revoked!”
https://www.politifact.com/factchecks/2021/aug/13/facebook-posts/no-rand-pauls-medical-license-wasnt-revoked/
Every time outgoing New York Gov. Andrew Cuomo “says the firearm industry ‘is immune from lawsuits,’ it's false.”
https://www.politifact.com/factchecks/2021/aug/13/elise-stefanik/refereeing-andrew-cuomo-elise-stefanik-firearm-ind/
The United States' southern border is "basically open" and is "a super spreader event.”
https://www.politifact.com/factchecks/2021/aug/13/gary-sides/north-carolina-school-leader-repeats-false-claims-/
There is a “0.05% chance of dying from COVID.”
https://www.politifact.com/factchecks/2021/aug/13/tiktok-posts/experts-break-down-numbers-catching-or-dying-covid/
The Biden administration is “not even testing these people” being released by Border Patrol into the U.S.
https://www.politifact.com/factchecks/2021/aug/13/ken-paxton/biden-administration-not-even-testing-migrants-rel/

R package rvest() working, but not giving the result of filling out the HTML form

I'm stumped, not sure what I'm doing wrong (using rvest()):
url <- "http://library.uaf.edu"
a_session <- session(url)
a_form <- html_form(a_session)
filled_form <- html_form_set(a_form[[1]],ebscohostsearchtext="SQL")
#
a_returned_info <- html_form_submit(filled_form)
returned_page <- rawToChar(a_returned_info$content)
big_pile_of_html <- read_html(returned_page)
#writeLines(text=returned_page,con=file(FILENAMEHERE))
PROBLEM: When I look at the output from the last line in a browser, I see the original page (obtained by 'url'), and not the page that should be created when the SQL is submitted to the field. Otherwise, output is reasonable HTML. Also, the status code for a_returned_info is 200, which should be a success.

Trying to scrape business website from webpage with R

I am attempting to scrape the website link:
https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
from the Contact for Appointment Popup on
https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984 using R Studio.
I have tried
page <- read_html("https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984")
Website <- html_attr(html_nodes(page,xpath = '//*[#id="appointment-vendors-categories"]/div/div[3]/div[1]/span/a'),"href")
This is my output: character(0).
The desired output is: https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
I successfully scraped info from the contact section at the bottom of the page using the code below, but the same method doesn't seem to be working for the link.
Name_of_Vendor2 <- substr((page %>% html_nodes("h3") %>% html_text),18,70)
Phone_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[2]"))
Address_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[1]"))
After taking a look into the downloaded html by writing it into a file using rvest::write_xml(page, file="temp.html"). I then proceeded to search for the URL and found it in the one of the last script tags as a JSON object, and the key for the url is websiteUrl. So I selected the last script tag and ran a regular expression on it's content to get the URL
scripts <- html_nodes(page,xpath = '//script[#type="text/javascript"]')
script <- html_text(scripts[[length(scripts)-1]])
stringr::str_extract(script,'(?<="websiteUrl":").+?(?=")')
#> [1] "http://www.AnElegantAffairBridal.com?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot"

How to trigger a file download using R

I am trying to use R to trigger a file download on this site: http://www.regulomedb.org. Basically, an ID, e.g., rs33914668, is input in the form, click Submit. Then in the new page, click Download in the bottom left corner to trigger a file download.
I have tried rvest with the help from other posts.
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("http://www.regulomedb.org")
POST(
url = "http://www.regulomedb.org",
body = list(
data = "rs33914668"
),
encode = "form"
)
) -> res
pg <- content(res, as="parsed")
By checking pg, I think I am still on the first page, not the http://www.regulomedb.org/results. (I don't know how to check pg list other than reading it line by line). So, I cannot reach the download button. I cannot figure out why it cannot jump to the next page.
By learning from some other posts, I managed to download the file without using rvest.
library(httr)
library(rvest)
library(RCurl)
session <- html_session("http://www.regulomedb.org")
form <- html_form(session)[[1]]
filledform <- set_values(form, `data` = "rs33914668")
session2 <- submit_form(session, filledform)
form2 <- html_form(session2)[[1]]
filledform2 <- set_values(form2)
thesid <- filledform2[["fields"]][["sid"]]$value
theurl <- paste0('http://www.regulomedb.org/download/',thesid)
download.file(theurl,destfile="test.bed",method="libcurl")
In filledform2, I found the sid. Using www.regulomedb.org/download/:sid, I can download the file.
I am new to html or even R, and don't even know what sid is. Although I made it, I am not satisfied with the coding. So, I hope some experienced users can provide better, alternative solutions, or improve my current solution. Also, what is wrong with the POST/rvest method?
url<-"http://www.regulomedb.org/"
library(rvest)
page<-html_session(url)
download_page<-rvest:::request_POST(page,url="http://www.regulomedb.org/results",
body=list("data"="rs33914668"),
encode = 'form')
#This is a unique id on generated based on your query
sid<-html_nodes(download_page,css='#download > input[type="hidden"]:nth-child(8)') %>% html_attr('value')
#This is a UNIX time
download_token<-as.numeric(as.POSIXct(Sys.time()))
download_page1<-rvest:::request_POST(download_page,url="http://www.regulomedb.org/download",
body=list("format"="bed",
"sid"=sid,
"download_token_value_id"=download_token ),
encode = 'form')
writeBin(download_page1$response$content, "regulomedb_result.bed")

Putting hyperlinks into an HTML table in R

I am a biologist trying to do computer science for research, so I may be a bit naïve. But I would like to a make a table containing information from a data frame, with a hyperlink in one of the columns. I imagine this needs to be an html document (?). I found this post this post describing how to put a hyperlink into a data frame and write it as an HTML file using googleVis. I would like to use this approach (it is the only one I know and seems to work well) except I would like to replace the actual URL with a description. The real motivation being that I would like to include many of these hyperlinks, and the links have long addresses which is difficult to read.
To be verbose, I essentially want to do what I did here where we read 'here' but 'here' points to
http:// stackoverflow.com/questions/8030208/exporting-table-in-r-to-html-with-hyperlinks
From your previous question, you can have another list which contains the titles of the URL's:
url=c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov'))
urlTitles=c('NY Times', 'CNN', 'Weather'))
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x = gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)
Building on Jack's answer but consolidating from different threads:
library(googleVis)
library(R2HTML)
url <- c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov')
urlTitles <- c('NY Times', 'CNN', 'Weather')
foo <- data.frame(a=c(1,2,3), b=c(4,5,6), url=url)
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x <- gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)