RVEST - Extracting text from table - Problems with access to the right table - html

I would like to extract the values in the table on the top right side of this Webpage:
https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima
(Wärmster Monat : VALUE, Kältester Monat: VALUE, Jahresniederschlag: VALUE)
Unfortunately, if I use html_nodes("Selectorgadgets result for the specific value"), I receive the values for the table on the top of the link:
https://www.timeanddate.de/stadt/info/deutschland/karlsruhe
(The webpages are similar, if you click "Uhrzeit/Übersicht" on the top bar, you access the second page and table, if you click "Wetter" --> "Klima", you access the first page/table (the one I want to extract values from!)
num_link= "https://www.timeanddate.de/wetter/deutschland/Karlsruhe/klima"
num_page= read_html(num_link)
rain_year = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(3) > p:nth-child(1)") %>% html_text()
temp_warm = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
temp_cold = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
I get " character (empty) " for each variable . :(
THANK YOU IN ADVANCE!

You can use the html_table function in rvest, which is pretty good by now. Makes it a bit easier to extract, but I do recommend learning to identify the right css-selectors as well, as it does not always work. html_table always returns a list with all tables from the webpage, so in this case the steps are:
get the html
get the tables
index the right table (here there is only one)
reformat a little to extract the values
library(rvest)
library(tidyverse)
result <- read_html("https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima") %>%
html_table() %>%
.[[1]] %>%
rename('measurement' = 1,
'original' = 2) %>%
mutate(value_num = str_extract_all(original,"[[:digit:]]+\\.*[[:digit:]]*") %>% unlist())

Related

How to correctly read html node & content

I received an R code from a colleague that is no longer working with me. The code intends to scrape prices for multiple products from an online dealer.
Altough the code itself takes the links to the products from a intern excel list, it looks somehow like this:
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('div strong') %>%
html_text() %>%
nth(5))
Sys.sleep(runif(1, min=1, max=2))
}
one of the articles marked as j in the code is for example 14513929
But when i run the code, i don't get the prices, but Service or Standorte
I guess it's because the html_text() or nodes are selected wrongly, but I can't really say how to properly select the real ones.

Side-by-Side gt tables **WITH** footnotes

I am trying to create side-by-side gt tables, as the title suggests. I started with the very helpful answer found here: Arrange gt tables side by side or in a grid or table of tables. The key was to ouput the left and right tables as raw html (as_raw_html), then combine in a dataframe, then send back into gt and reformat as markdow (fmt_markdown).
However, I ran into a problem that I couldn't solve. The fmt_markdown command skips the footnote, so the resulting table has the raw html as a footnote.
I checked the documentation for gt, and the fmt_markdown command takes columns and rows as input - but, apparently, the footnote area is considered neither a column nor a row.
So the crux seems to be that I can't seem to find any way to target the footnote area for reformatting as mardown.
Below is a reproducible example.
library(tidyverse)
library(gt)
# Make a table with a footnote
tL <- exibble %>%
select(c(num, char, group)) %>%
gt() %>%
tab_footnote(
footnote = html("**I'm an apricot**"),
locations = cells_body(columns = char,
rows = char == "apricot")
) %>%
tab_style(style = cell_text(color = "blue"),
locations = cells_footnotes()) %>%
as_raw_html()
# Make a copy
tR <- tL
# Side-By-Side
SideBySide <- data.frame(Ltable = tL, Rtable = tR) %>%
gt() %>%
fmt_markdown(columns = everything())
And the result looks like this:
Created on 2022-02-20 by the reprex package (v2.0.1)

Why does rvest/html_table skip every second row in this table?

I am trying to scrape this table but for some reason every second row is skipped (which means I don't have data on half of the states). This is my code:
# read in the url
library(dplyr)
library(rvest)
webpage <- read_html ("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
Does anyone have any ideas why this is? The only thing I can think of is that every second row has background colour specified?
Thanks :)
see if this helps
webpage <- read_html("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_nodes(xpath = '/html/body/table') %>%
html_table(fill = TRUE)
d = as.data.frame(df)

Not getting expected output in we scraping R

I have written a small program.
Where I scrape Google search website and I want all the URL on the Google search web page. But I'm getting character(0) in the O/P. Plz help me.
CODE -
library("rvest")
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".iUh30") %>% html_text() %>% as.character()
That class is not present in the returned html. You need a different selector strategy and then extract href
library(rvest)
library(stringr)
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".jfp3ef > a") %>% html_attr(., "href")
for(i in d){
res <- str_match_all(i,'(http.*?)&')
print(res[[1]][,2])
}

Scraping html table and its href Links in R

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct
row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
Thanks!!
You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")