How to correctly read html node & content

How to correctly read html node & content - html

I received an R code from a colleague that is no longer working with me. The code intends to scrape prices for multiple products from an online dealer.
Altough the code itself takes the links to the products from a intern excel list, it looks somehow like this:
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('div strong') %>%
html_text() %>%
nth(5))
Sys.sleep(runif(1, min=1, max=2))
}
one of the articles marked as j in the code is for example 14513929
But when i run the code, i don't get the prices, but Service or Standorte
I guess it's because the html_text() or nodes are selected wrongly, but I can't really say how to properly select the real ones.

Related

RVEST - Extracting text from table - Problems with access to the right table

I would like to extract the values in the table on the top right side of this Webpage:
https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima
(Wärmster Monat : VALUE, Kältester Monat: VALUE, Jahresniederschlag: VALUE)
Unfortunately, if I use html_nodes("Selectorgadgets result for the specific value"), I receive the values for the table on the top of the link:
https://www.timeanddate.de/stadt/info/deutschland/karlsruhe
(The webpages are similar, if you click "Uhrzeit/Übersicht" on the top bar, you access the second page and table, if you click "Wetter" --> "Klima", you access the first page/table (the one I want to extract values from!)
num_link= "https://www.timeanddate.de/wetter/deutschland/Karlsruhe/klima"
num_page= read_html(num_link)
rain_year = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(3) > p:nth-child(1)") %>% html_text()
temp_warm = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
temp_cold = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
I get " character (empty) " for each variable . :(
THANK YOU IN ADVANCE!

You can use the html_table function in rvest, which is pretty good by now. Makes it a bit easier to extract, but I do recommend learning to identify the right css-selectors as well, as it does not always work. html_table always returns a list with all tables from the webpage, so in this case the steps are:
get the html
get the tables
index the right table (here there is only one)
reformat a little to extract the values
library(rvest)
library(tidyverse)
result <- read_html("https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima") %>%
html_table() %>%
.[[1]] %>%
rename('measurement' = 1,
'original' = 2) %>%
mutate(value_num = str_extract_all(original,"[[:digit:]]+\\.*[[:digit:]]*") %>% unlist())

Why does rvest/html_table skip every second row in this table?

I am trying to scrape this table but for some reason every second row is skipped (which means I don't have data on half of the states). This is my code:
# read in the url
library(dplyr)
library(rvest)
webpage <- read_html ("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
Does anyone have any ideas why this is? The only thing I can think of is that every second row has background colour specified?
Thanks :)

see if this helps
webpage <- read_html("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_nodes(xpath = '/html/body/table') %>%
html_table(fill = TRUE)
d = as.data.frame(df)

Not getting expected output in we scraping R

I have written a small program.
Where I scrape Google search website and I want all the URL on the Google search web page. But I'm getting character(0) in the O/P. Plz help me.
CODE -
library("rvest")
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".iUh30") %>% html_text() %>% as.character()

That class is not present in the returned html. You need a different selector strategy and then extract href
library(rvest)
library(stringr)
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".jfp3ef > a") %>% html_attr(., "href")
for(i in d){
res <- str_match_all(i,'(http.*?)&')
print(res[[1]][,2])
}

How to trigger a file download using R

I am trying to use R to trigger a file download on this site: http://www.regulomedb.org. Basically, an ID, e.g., rs33914668, is input in the form, click Submit. Then in the new page, click Download in the bottom left corner to trigger a file download.
I have tried rvest with the help from other posts.
library(httr)
library(rvest)
library(tidyverse)
pre_pg <- read_html("http://www.regulomedb.org")
POST(
url = "http://www.regulomedb.org",
body = list(
data = "rs33914668"
),
encode = "form"
)
) -> res
pg <- content(res, as="parsed")
By checking pg, I think I am still on the first page, not the http://www.regulomedb.org/results. (I don't know how to check pg list other than reading it line by line). So, I cannot reach the download button. I cannot figure out why it cannot jump to the next page.
By learning from some other posts, I managed to download the file without using rvest.
library(httr)
library(rvest)
library(RCurl)
session <- html_session("http://www.regulomedb.org")
form <- html_form(session)[[1]]
filledform <- set_values(form, `data` = "rs33914668")
session2 <- submit_form(session, filledform)
form2 <- html_form(session2)[[1]]
filledform2 <- set_values(form2)
thesid <- filledform2[["fields"]][["sid"]]$value
theurl <- paste0('http://www.regulomedb.org/download/',thesid)
download.file(theurl,destfile="test.bed",method="libcurl")
In filledform2, I found the sid. Using www.regulomedb.org/download/:sid, I can download the file.
I am new to html or even R, and don't even know what sid is. Although I made it, I am not satisfied with the coding. So, I hope some experienced users can provide better, alternative solutions, or improve my current solution. Also, what is wrong with the POST/rvest method?

url<-"http://www.regulomedb.org/"
library(rvest)
page<-html_session(url)
download_page<-rvest:::request_POST(page,url="http://www.regulomedb.org/results",
body=list("data"="rs33914668"),
encode = 'form')
#This is a unique id on generated based on your query
sid<-html_nodes(download_page,css='#download > input[type="hidden"]:nth-child(8)') %>% html_attr('value')
#This is a UNIX time
download_token<-as.numeric(as.POSIXct(Sys.time()))
download_page1<-rvest:::request_POST(download_page,url="http://www.regulomedb.org/download",
body=list("format"="bed",
"sid"=sid,
"download_token_value_id"=download_token ),
encode = 'form')
writeBin(download_page1$response$content, "regulomedb_result.bed")

Scraping html table and its href Links in R

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct
row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
Thanks!!

You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to correctly read html node & content - html

Related

RVEST - Extracting text from table - Problems with access to the right table

Why does rvest/html_table skip every second row in this table?

Not getting expected output in we scraping R

How to trigger a file download using R

Scraping html table and its href Links in R

Categories

Resources