Parsing rvest output from an unstructured infobox - html

I am attempted to extract data from a wiki fandom website using the rvest package in R. However, I am running into several issues because the infobox is not structured as an HTML table. Please see below for my attempts at dealing with this issue:
library(tidyverse)
library(data.table)
library(rvest)
library(httr)
url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")
#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
html_table()
#So now just extract data using the CSS selector
df <- read_html(url) %>%
html_element("aside")
html_text2()
The second attempt does succeed at extracting the raw data, but it is formatted in a way that is not easy to format into a clean dataframe. So, then I attempted to extract each element of the table individually, which might be easier to clean and structure into a dataframe. However, when I attempt to do so using the XPath, I get an empty result:
df <- read_html(url) %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/aside/section[1]') %>%
html_text2()
So I suppose my question is primarily: does anyone know of a good way to automatically extract the infobox in a datarfame friendly format? If not, would someone be able to point me towards why my attempt to extract each panel individually is not working?

If you target the div.pi-data directly, you could do something like this:
bind_rows(
read_html(url) %>%
rvest::html_nodes("div.pi-data") %>%
map(.f = ~tibble(
label = html_elements(.x, ".pi-data-label") %>% html_text2(),
text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
) %>% unnest(text)
)
)
Output:
# A tibble: 29 x 2
label text
<chr> <chr>
1 Homeworld Tatooine[1]
2 Born 41 BBY,[2] Tatooine[3]
3 Died 4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
4 Species Human[1]
5 Gender Male[1]
6 Height 1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]
7 Mass 120 kilograms in armor[7]
8 Hair color Blond,[8] light[9] and dark[10]
9 Eye color Blue,[11] later yellow (dark side)[12]
10 Skin color Light,[11] later pale[5]
# ... with 19 more rows

Related

"rvest" not fetching the product details using html_nodes()

I was using rvest for scraping the details of the product (Names, Price, and Availability) on amazon's product search results. I was able to fetch the webpage with read_html(), but I am not able to fetch the details of the product (Names, Price, and Availability). The page has <span> tag with class as class = "a-size-medium a-color-base a-text-normal". I have used html_nodes("span.a-size-medium a-color-base a-text-normal"), but got NA.
Here is the reproducible code:
library(rvest)
library(xml2)
url <- "https://www.amazon.in/s?k=Smartphone&rh=n%3A1389401031&ref=nb_sb_noss"
page <- read_html(url)
data <- page%>%
html_node("span.a-size-medium a-color-base a-text-normal") %>%
html_text()
print(data)
You just need to change the css selector a little bit. I was able to get the names and the prices, the availability was a little bit trickier :/
library(rvest)
library(xml2)
url <- "https://www.amazon.in/s?k=Smartphone&rh=n%3A1389401031&ref=nb_sb_noss"
page <- read_html(url)
name <- page %>% html_nodes(".a-size-medium.a-color-base.a-text-normal") %>% html_text()
price <- page %>% html_nodes(".a-price-whole") %>% html_text()

Extracting text/number from tspan class tag HTML with R

I am trying to extract Current Production number from this website http://okg.se/sv/Produktionsinformation/ (in the blue area below).
Here is the HTML code part I need to use:
<tspan dy="0" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);">518</tspan>
The example of the code I used:
url <- "http://okg.se/sv/Produktionsinformation//"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content %>% html_nodes(".content__info__item__value")
But the result I get shows that there is not nodes available:
{xml_nodeset (0)}
Do you have any ides on how to solve this issue?
Thanks in advance!
I am not pretty sure about the value that you need, but this work
librar(rvest)
# page url
url <- "http://okg.se/sv/Produktionsinformation/"
# current value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-current")
# Max value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-max")
The html you see with your browser has been processed by javascript, so isn't the same as the html you see with rvest.
The raw data you are looking for is actually stored in attributes of a div with the id "gauge", so you get it like this:
library(rvest)
#> Loading required package: xml2
"http://okg.se/sv/Produktionsinformation//" %>%
read_html() %>%
html_node("#gauge") %>%
html_attrs() %>%
`[`(c("data-current", "data-max"))
#> data-current data-max
#> "553" "1450"
Note that you don't need to save the html to your local drive to process it. You can read it directly from the internet by giving the url to read_html
Created on 2020-02-20 by the reprex package (v0.3.0)

How to get values of select menu?

I'm trying to get values (all regions) of select menu on this webpage. What's my fault? Almost tried all combinations but result is zero. One of them is:
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
regions <- page %>%
html_nodes("div") %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[#id="select2-ys-areaSelector-container"]') %>%
html_attr("title")
Thanks in advance.
XPath is kind of a ugly beast. Get the id of the select element then get all options groups and finally get their text data. Use html_text to convert it to R character.
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
regions <- page %>%
html_nodes(xpath='//*[#id="ys-areaSelector"]/optgroup/*/text()') %>%
html_text()
I'd use a css selector combination assuming all option values wanted
library(rvest)
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
options <- page %>%
html_nodes('#ys-areaSelector [data-url]') %>%
html_text()

R to change the values in html form and scrape web data

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.
I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.
However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.
url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)
result in my case
> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's':
Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:
library(rvest)
page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()
station_id <- page %>% html_nodes('button#cityname + ul a') %>%
html_attr('onclick') %>% # If you need names, grab the `href` attribute, too.
sub(".*'(.*)'.*", '\\1', .)
which can then be put into expand.grid with the months and years to generate all the necessary combinations:
df <- expand.grid(station_id,
month = sprintf('%02d', 1:12),
year = 2014:2016)
(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)
The combinations can then be paste0ed into URLs:
urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_',
df$station_id, '_', df$year, df$month, '.csv')
which can be lapplyed across to download all the files:
# Warning! This will download a lot of files! Make sure you're in a clean directory.
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})

Using rvest to scrape GoodReads pages

I'm trying to scrape ratings and review numbers on goodreads, but getting an NA result. Why is this?
SelectorGadget finds "span span" for the average rating on hover over, but there's no "valid path" found at the bottom.
Using the same method on other sites (e.g. IMDB, theatlantic.com) works fine.
Here's my code and result (I've also tried replacing html_text with html_tag)
Rating<- html("http://www.goodreads.com/book/show/22444789-delicious-foods")
Rating %>%
html_node("span span") %>%
html_text () %>%
as.numeric()
[1] NA
Warning message:
In function_list[[k]](value) : NAs introduced by coercion
I didn't have any success using selectorgadget with the Goodreads site, but sometimes you just have to look at the html source and find what you're looking for that way.
In this case, you can use the .average class selector:
Rating %>%
html_node(".average") %>%
html_text %>%
as.numeric