Scraping html table and its href Links in R - html

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct
row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
Thanks!!

You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")

Related

Saving the Text from a News Article in R?

I found this post over here that shows how to save the text from a website. Is there a simple way in R to extract only the text elements of an HTML page?.
I tried one of the answers provided here and it seems to be working quite well! For example:
library(htm2txt)
url_1 <- 'https://en.wikipedia.org/wiki/Alan_Turing'
text_1 <- gettxt(url_1)
url_2 <- 'https://www.bbc.com/future/article/20220823-how-auckland-worlds-most-spongy-city-tackles-floods'
text_2 <- gettxt(url_2)
All the text from the article appears, but so does a lot of "extra text" which does not have any meaning. For example:
p. 40/03B\n• ^ a or identifiers\n• Articles with GND identifiers\n• Articles with ICCU identifiers\n•
Is there some standard way to only keep the actual text from these articles? Or does this depend too much on the individual structure of the website and no "one size fits all" solution exists for such a problem?
Perhaps there might be some method of doing this in R that only recognizes the "actual text"?
Thank you!
You can cross-reference the words from the HTML page with a dictionary from qdapDictionaries, so only real English words are kept, but this method does keep words that aren't exclusively from the article (e.g., the word "jump" from "Jump to navigation").
library(tidyverse)
library(htm2txt)
library(quanteda)
library(qdapDictionaries)
data(DICTIONARY)
text <- 'https://en.wikipedia.org/wiki/Alan_Turing' %>% gettxt() %>% corpus()
text <- tokens(text, remove_punct = TRUE, remove_numbers = TRUE)
text <- tokens_select(text, DICTIONARY$word)
text <- data.frame(text = sapply(text, as.character), stringsAsFactors = FALSE) %>%
group_by(text1 = tolower(text)) %>%
table() %>% as.data.frame() %>%
rename(word = text1) %>%
rename(frequency = Freq)
head(text)

RVEST - Extracting text from table - Problems with access to the right table

I would like to extract the values in the table on the top right side of this Webpage:
https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima
(Wärmster Monat : VALUE, Kältester Monat: VALUE, Jahresniederschlag: VALUE)
Unfortunately, if I use html_nodes("Selectorgadgets result for the specific value"), I receive the values for the table on the top of the link:
https://www.timeanddate.de/stadt/info/deutschland/karlsruhe
(The webpages are similar, if you click "Uhrzeit/Übersicht" on the top bar, you access the second page and table, if you click "Wetter" --> "Klima", you access the first page/table (the one I want to extract values from!)
num_link= "https://www.timeanddate.de/wetter/deutschland/Karlsruhe/klima"
num_page= read_html(num_link)
rain_year = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(3) > p:nth-child(1)") %>% html_text()
temp_warm = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
temp_cold = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
I get " character (empty) " for each variable . :(
THANK YOU IN ADVANCE!
You can use the html_table function in rvest, which is pretty good by now. Makes it a bit easier to extract, but I do recommend learning to identify the right css-selectors as well, as it does not always work. html_table always returns a list with all tables from the webpage, so in this case the steps are:
get the html
get the tables
index the right table (here there is only one)
reformat a little to extract the values
library(rvest)
library(tidyverse)
result <- read_html("https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima") %>%
html_table() %>%
.[[1]] %>%
rename('measurement' = 1,
'original' = 2) %>%
mutate(value_num = str_extract_all(original,"[[:digit:]]+\\.*[[:digit:]]*") %>% unlist())

Why does rvest/html_table skip every second row in this table?

I am trying to scrape this table but for some reason every second row is skipped (which means I don't have data on half of the states). This is my code:
# read in the url
library(dplyr)
library(rvest)
webpage <- read_html ("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
Does anyone have any ideas why this is? The only thing I can think of is that every second row has background colour specified?
Thanks :)
see if this helps
webpage <- read_html("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_nodes(xpath = '/html/body/table') %>%
html_table(fill = TRUE)
d = as.data.frame(df)

Not getting expected output in we scraping R

I have written a small program.
Where I scrape Google search website and I want all the URL on the Google search web page. But I'm getting character(0) in the O/P. Plz help me.
CODE -
library("rvest")
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".iUh30") %>% html_text() %>% as.character()
That class is not present in the returned html. You need a different selector strategy and then extract href
library(rvest)
library(stringr)
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".jfp3ef > a") %>% html_attr(., "href")
for(i in d){
res <- str_match_all(i,'(http.*?)&')
print(res[[1]][,2])
}

How to trigger a file download using R

I am trying to use R to trigger a file download on this site: http://www.regulomedb.org. Basically, an ID, e.g., rs33914668, is input in the form, click Submit. Then in the new page, click Download in the bottom left corner to trigger a file download.
I have tried rvest with the help from other posts.
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("http://www.regulomedb.org")
POST(
url = "http://www.regulomedb.org",
body = list(
data = "rs33914668"
),
encode = "form"
)
) -> res
pg <- content(res, as="parsed")
By checking pg, I think I am still on the first page, not the http://www.regulomedb.org/results. (I don't know how to check pg list other than reading it line by line). So, I cannot reach the download button. I cannot figure out why it cannot jump to the next page.
By learning from some other posts, I managed to download the file without using rvest.
library(httr)
library(rvest)
library(RCurl)
session <- html_session("http://www.regulomedb.org")
form <- html_form(session)[[1]]
filledform <- set_values(form, `data` = "rs33914668")
session2 <- submit_form(session, filledform)
form2 <- html_form(session2)[[1]]
filledform2 <- set_values(form2)
thesid <- filledform2[["fields"]][["sid"]]$value
theurl <- paste0('http://www.regulomedb.org/download/',thesid)
download.file(theurl,destfile="test.bed",method="libcurl")
In filledform2, I found the sid. Using www.regulomedb.org/download/:sid, I can download the file.
I am new to html or even R, and don't even know what sid is. Although I made it, I am not satisfied with the coding. So, I hope some experienced users can provide better, alternative solutions, or improve my current solution. Also, what is wrong with the POST/rvest method?
url<-"http://www.regulomedb.org/"
library(rvest)
page<-html_session(url)
download_page<-rvest:::request_POST(page,url="http://www.regulomedb.org/results",
body=list("data"="rs33914668"),
encode = 'form')
#This is a unique id on generated based on your query
sid<-html_nodes(download_page,css='#download > input[type="hidden"]:nth-child(8)') %>% html_attr('value')
#This is a UNIX time
download_token<-as.numeric(as.POSIXct(Sys.time()))
download_page1<-rvest:::request_POST(download_page,url="http://www.regulomedb.org/download",
body=list("format"="bed",
"sid"=sid,
"download_token_value_id"=download_token ),
encode = 'form')
writeBin(download_page1$response$content, "regulomedb_result.bed")