Web Scrape an Image with rvest R - html

I'm having a problem when trying to scrape an image from this page. My code is as follow:
library(rvest)
url <- read_html("https://covid-19vis.cmm.uchile.cl/chart")
m <- '/html/body/div/div/div[4]/main/div/div/div/div/div/div[2]/div[1]'
grafico_cmm <- html_node(url, xpath = m) %>% html_attr('src')
When I run the above code, the result is NA. Does someone know how can I scrape the plot or maybe the data from the page?
Thanks a lot

It not an image, it is an interactive chart. For an image, you would need to scrape the data points and re-create as a chart and then convert to an image. Xpath is also invalid.
The data comes from an API call. I checked the values against the chart and this is the correct endpoint.
library(jsonlite)
data <- jsonlite::read_json('https://covid-19vis.cmm.uchile.cl/api/data?scope=0&indicatorId=57', simplifyVector = T)
The chart needs some tidying but here is a basic plot of the r values:
data$date <- data$date %>% as.Date()
library("ggplot2")
ggplot(data=data,
aes(x=date, y=value, colour ='red')) +
geom_line() +
scale_color_discrete(name = "R Efectivo", labels = c("Chile"))
print tail(data)

Related

Using R code to scrape data from a webpage into an Excel file

I have written a code in R which is supposed to retrieve certain information from a website and import it into an Excel file. I have used it for one website and it works, but for this particular website, it has an issue, it returns N/A values in excel, and I don't know why.
library(tidyverse)
library(rvest)
library(string)
library(rebus)
library(lubridate)
library(xlsx)
library(reader)
setwd("C:/Users/user/Desktop/Tenders")
getwd()
ran=seq(300100,300000,-1)
result = data.frame(matrix(nrow = length(ran), ncol = 1))
colnames(result) <- c("111")
for (i in ran){
url <- paste0("http://tenders.procurement.gov.ge/public/?go=", i)
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
#111
status=content %>% html_nodes("#print_area tr:nth-child(1) td + td")%>% html_text()
status[length(status) == 0] <- NA
status=as.data.frame(status)
status=(if (nrow(status)>1){
a=as.matrix(paste(unlist(status), collapse =" "))
} else {as.matrix(status)
})
result[i, 1]=status
}
s=as.data.frame(ran)
final=result[-c(1:s[nrow(s),]),]
#Excel
write.xlsx(final,"C:/Users/user/Desktop/Tenders.xlsx", sheetName = "111")
I am using selector gadget tool, which is a chrome extension for identifying HTML parts that the code is supposed to use to gather the information (for example, in the code above it is "#print_area tr:nth-child(1) td + td", which is the first entry in the link).
Can someone help me find out what the issue might be?

R package rvest() working, but not giving the result of filling out the HTML form

I'm stumped, not sure what I'm doing wrong (using rvest()):
url <- "http://library.uaf.edu"
a_session <- session(url)
a_form <- html_form(a_session)
filled_form <- html_form_set(a_form[[1]],ebscohostsearchtext="SQL")
#
a_returned_info <- html_form_submit(filled_form)
returned_page <- rawToChar(a_returned_info$content)
big_pile_of_html <- read_html(returned_page)
#writeLines(text=returned_page,con=file(FILENAMEHERE))
PROBLEM: When I look at the output from the last line in a browser, I see the original page (obtained by 'url'), and not the page that should be created when the SQL is submitted to the field. Otherwise, output is reasonable HTML. Also, the status code for a_returned_info is 200, which should be a success.

Trying to scrape business website from webpage with R

I am attempting to scrape the website link:
https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
from the Contact for Appointment Popup on
https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984 using R Studio.
I have tried
page <- read_html("https://www.theknot.com/marketplace/an-elegant-affair-cedar-falls-ia-537984")
Website <- html_attr(html_nodes(page,xpath = '//*[#id="appointment-vendors-categories"]/div/div[3]/div[1]/span/a'),"href")
This is my output: character(0).
The desired output is: https://www.anelegantaffairbridal.com/?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot
I successfully scraped info from the contact section at the bottom of the page using the code below, but the same method doesn't seem to be working for the link.
Name_of_Vendor2 <- substr((page %>% html_nodes("h3") %>% html_text),18,70)
Phone_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[2]"))
Address_of_Vendor <- html_text(html_nodes(page, xpath = "//div[#class = 'contact-info--900c8 body1--711dc']/span[1]"))
After taking a look into the downloaded html by writing it into a file using rvest::write_xml(page, file="temp.html"). I then proceeded to search for the URL and found it in the one of the last script tags as a JSON object, and the key for the url is websiteUrl. So I selected the last script tag and ran a regular expression on it's content to get the URL
scripts <- html_nodes(page,xpath = '//script[#type="text/javascript"]')
script <- html_text(scripts[[length(scripts)-1]])
stringr::str_extract(script,'(?<="websiteUrl":").+?(?=")')
#> [1] "http://www.AnElegantAffairBridal.com?utm_source=theknot.com&utm_medium=referral&utm_campaign=theknot"

How to trigger a file download using R

I am trying to use R to trigger a file download on this site: http://www.regulomedb.org. Basically, an ID, e.g., rs33914668, is input in the form, click Submit. Then in the new page, click Download in the bottom left corner to trigger a file download.
I have tried rvest with the help from other posts.
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("http://www.regulomedb.org")
POST(
url = "http://www.regulomedb.org",
body = list(
data = "rs33914668"
),
encode = "form"
)
) -> res
pg <- content(res, as="parsed")
By checking pg, I think I am still on the first page, not the http://www.regulomedb.org/results. (I don't know how to check pg list other than reading it line by line). So, I cannot reach the download button. I cannot figure out why it cannot jump to the next page.
By learning from some other posts, I managed to download the file without using rvest.
library(httr)
library(rvest)
library(RCurl)
session <- html_session("http://www.regulomedb.org")
form <- html_form(session)[[1]]
filledform <- set_values(form, `data` = "rs33914668")
session2 <- submit_form(session, filledform)
form2 <- html_form(session2)[[1]]
filledform2 <- set_values(form2)
thesid <- filledform2[["fields"]][["sid"]]$value
theurl <- paste0('http://www.regulomedb.org/download/',thesid)
download.file(theurl,destfile="test.bed",method="libcurl")
In filledform2, I found the sid. Using www.regulomedb.org/download/:sid, I can download the file.
I am new to html or even R, and don't even know what sid is. Although I made it, I am not satisfied with the coding. So, I hope some experienced users can provide better, alternative solutions, or improve my current solution. Also, what is wrong with the POST/rvest method?
url<-"http://www.regulomedb.org/"
library(rvest)
page<-html_session(url)
download_page<-rvest:::request_POST(page,url="http://www.regulomedb.org/results",
body=list("data"="rs33914668"),
encode = 'form')
#This is a unique id on generated based on your query
sid<-html_nodes(download_page,css='#download > input[type="hidden"]:nth-child(8)') %>% html_attr('value')
#This is a UNIX time
download_token<-as.numeric(as.POSIXct(Sys.time()))
download_page1<-rvest:::request_POST(download_page,url="http://www.regulomedb.org/download",
body=list("format"="bed",
"sid"=sid,
"download_token_value_id"=download_token ),
encode = 'form')
writeBin(download_page1$response$content, "regulomedb_result.bed")

Scraping html table and its href Links in R

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct
row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
Thanks!!
You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")