Not getting expected output in we scraping R - html

I have written a small program.
Where I scrape Google search website and I want all the URL on the Google search web page. But I'm getting character(0) in the O/P. Plz help me.
CODE -
library("rvest")
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".iUh30") %>% html_text() %>% as.character()

That class is not present in the returned html. You need a different selector strategy and then extract href
library(rvest)
library(stringr)
r_h = read_html("https://www.google.com/search?q=google&oq=google&aqs=chrome.0.69i59j0l2j69i60l2j69i65.1101j0j7&sourceid=chrome&ie=UTF-8")
d = r_h %>% html_nodes(".jfp3ef > a") %>% html_attr(., "href")
for(i in d){
res <- str_match_all(i,'(http.*?)&')
print(res[[1]][,2])
}

Related

Using R code to scrape data from a webpage into an Excel file

I have written a code in R which is supposed to retrieve certain information from a website and import it into an Excel file. I have used it for one website and it works, but for this particular website, it has an issue, it returns N/A values in excel, and I don't know why.
library(tidyverse)
library(rvest)
library(string)
library(rebus)
library(lubridate)
library(xlsx)
library(reader)
setwd("C:/Users/user/Desktop/Tenders")
getwd()
ran=seq(300100,300000,-1)
result = data.frame(matrix(nrow = length(ran), ncol = 1))
colnames(result) <- c("111")
for (i in ran){
url <- paste0("http://tenders.procurement.gov.ge/public/?go=", i)
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
#111
status=content %>% html_nodes("#print_area tr:nth-child(1) td + td")%>% html_text()
status[length(status) == 0] <- NA
status=as.data.frame(status)
status=(if (nrow(status)>1){
a=as.matrix(paste(unlist(status), collapse =" "))
} else {as.matrix(status)
})
result[i, 1]=status
}
s=as.data.frame(ran)
final=result[-c(1:s[nrow(s),]),]
#Excel
write.xlsx(final,"C:/Users/user/Desktop/Tenders.xlsx", sheetName = "111")
I am using selector gadget tool, which is a chrome extension for identifying HTML parts that the code is supposed to use to gather the information (for example, in the code above it is "#print_area tr:nth-child(1) td + td", which is the first entry in the link).
Can someone help me find out what the issue might be?

How to correctly read html node & content

I received an R code from a colleague that is no longer working with me. The code intends to scrape prices for multiple products from an online dealer.
Altough the code itself takes the links to the products from a intern excel list, it looks somehow like this:
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('div strong') %>%
html_text() %>%
nth(5))
Sys.sleep(runif(1, min=1, max=2))
}
one of the articles marked as j in the code is for example 14513929
But when i run the code, i don't get the prices, but Service or Standorte
I guess it's because the html_text() or nodes are selected wrongly, but I can't really say how to properly select the real ones.

RVEST - Extracting text from table - Problems with access to the right table

I would like to extract the values in the table on the top right side of this Webpage:
https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima
(Wärmster Monat : VALUE, Kältester Monat: VALUE, Jahresniederschlag: VALUE)
Unfortunately, if I use html_nodes("Selectorgadgets result for the specific value"), I receive the values for the table on the top of the link:
https://www.timeanddate.de/stadt/info/deutschland/karlsruhe
(The webpages are similar, if you click "Uhrzeit/Übersicht" on the top bar, you access the second page and table, if you click "Wetter" --> "Klima", you access the first page/table (the one I want to extract values from!)
num_link= "https://www.timeanddate.de/wetter/deutschland/Karlsruhe/klima"
num_page= read_html(num_link)
rain_year = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(3) > p:nth-child(1)") %>% html_text()
temp_warm = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
temp_cold = num_page %>% html_nodes("#climateTable > div.climate-month.climate-month--allyear > div:nth-child(2) > p:nth-child(1)") %>% html_text()
I get " character (empty) " for each variable . :(
THANK YOU IN ADVANCE!
You can use the html_table function in rvest, which is pretty good by now. Makes it a bit easier to extract, but I do recommend learning to identify the right css-selectors as well, as it does not always work. html_table always returns a list with all tables from the webpage, so in this case the steps are:
get the html
get the tables
index the right table (here there is only one)
reformat a little to extract the values
library(rvest)
library(tidyverse)
result <- read_html("https://www.timeanddate.de/wetter/deutschland/karlsruhe/klima") %>%
html_table() %>%
.[[1]] %>%
rename('measurement' = 1,
'original' = 2) %>%
mutate(value_num = str_extract_all(original,"[[:digit:]]+\\.*[[:digit:]]*") %>% unlist())

Why does rvest/html_table skip every second row in this table?

I am trying to scrape this table but for some reason every second row is skipped (which means I don't have data on half of the states). This is my code:
# read in the url
library(dplyr)
library(rvest)
webpage <- read_html ("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
Does anyone have any ideas why this is? The only thing I can think of is that every second row has background colour specified?
Thanks :)
see if this helps
webpage <- read_html("https://oui.doleta.gov/unemploy/trigger/2013/trig_010613.html")
df <- webpage %>%
html_nodes(xpath = '/html/body/table') %>%
html_table(fill = TRUE)
d = as.data.frame(df)

Scraping html table and its href Links in R

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.
library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)
link <- "http://www.qimedical.com/resources/method-suitability/"
qi_webpage <- read_html(link)
qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]
Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct
row:
qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))
qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]
qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))
I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.
Thanks!!
You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:
qi_webpage %>%
html_nodes(xpath = "//td/a") %>%
html_attr("href")