How to get values of select menu? - html

I'm trying to get values (all regions) of select menu on this webpage. What's my fault? Almost tried all combinations but result is zero. One of them is:
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
regions <- page %>%
html_nodes("div") %>%
html_nodes("span") %>%
html_nodes(xpath = '//*[#id="select2-ys-areaSelector-container"]') %>%
html_attr("title")
Thanks in advance.

XPath is kind of a ugly beast. Get the id of the select element then get all options groups and finally get their text data. Use html_text to convert it to R character.
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
regions <- page %>%
html_nodes(xpath='//*[#id="ys-areaSelector"]/optgroup/*/text()') %>%
html_text()

I'd use a css selector combination assuming all option values wanted
library(rvest)
page <- read_html("https://www.yemeksepeti.com/en/istanbul")
options <- page %>%
html_nodes('#ys-areaSelector [data-url]') %>%
html_text()

Related

Parsing rvest output from an unstructured infobox

I am attempted to extract data from a wiki fandom website using the rvest package in R. However, I am running into several issues because the infobox is not structured as an HTML table. Please see below for my attempts at dealing with this issue:
library(tidyverse)
library(data.table)
library(rvest)
library(httr)
url <- c("https://starwars.fandom.com/wiki/Anakin_Skywalker")
#See here that the infobox information does not appear when checking for HTML tables in the page
df <- read_html(url) %>%
html_table()
#So now just extract data using the CSS selector
df <- read_html(url) %>%
html_element("aside")
html_text2()
The second attempt does succeed at extracting the raw data, but it is formatted in a way that is not easy to format into a clean dataframe. So, then I attempted to extract each element of the table individually, which might be easier to clean and structure into a dataframe. However, when I attempt to do so using the XPath, I get an empty result:
df <- read_html(url) %>%
html_nodes(xpath = '//*[#id="mw-content-text"]/div/aside/section[1]') %>%
html_text2()
So I suppose my question is primarily: does anyone know of a good way to automatically extract the infobox in a datarfame friendly format? If not, would someone be able to point me towards why my attempt to extract each panel individually is not working?
If you target the div.pi-data directly, you could do something like this:
bind_rows(
read_html(url) %>%
rvest::html_nodes("div.pi-data") %>%
map(.f = ~tibble(
label = html_elements(.x, ".pi-data-label") %>% html_text2(),
text= html_elements(.x, ".pi-data-value") %>% html_text2() %>% strsplit(split="\n")
) %>% unnest(text)
)
)
Output:
# A tibble: 29 x 2
label text
<chr> <chr>
1 Homeworld Tatooine[1]
2 Born 41 BBY,[2] Tatooine[3]
3 Died 4 ABY,[4]DS-2 Death Star II Mobile Battle Station, Endor system[5]
4 Species Human[1]
5 Gender Male[1]
6 Height 1.88 meters,[1] later 2.03 meters (6 ft, 8 in) in armor[6]
7 Mass 120 kilograms in armor[7]
8 Hair color Blond,[8] light[9] and dark[10]
9 Eye color Blue,[11] later yellow (dark side)[12]
10 Skin color Light,[11] later pale[5]
# ... with 19 more rows

How to scrape ordered and unordered lists in wikipedia using rvest relative to a header

I'm wanting to scrape the events from several countries from Wikipdia and place each individual event into a row of a table. A certain data can have one event (where there is a single main bullet point) or multiple events (where there are "sub bullet points")
I'm having trouble with is how to grab both the ordered and unordered lists at once and separating them cleanly. The code below will grab the "sub bullets", but not the "main" ones. And if I change the code to exclude the /li then it will place the "sub bullets" into a single cell. I was wondering if there was a way to separate the "main" and "sub bullet points" more easily.
There appear to be slight differences in the html layout for pages that contain events for different countries. Is it possible to specify an xml path based on a header (rather than a relative or absolute position) and then grab the elements after that? Unfortunately, being so new to html, I'm not quite sure how to do that or if it is even possible. Is it possible to find the header "Events by month", find the header "January" and then get all bullet points and sub bullet points in separate cells of a table?
Any help would be appreciated.
Thankyou
# This gets the sub bullet points of the events, but not the main ones
page <- xml2::read_html("https://en.wikipedia.org/wiki/2020_in_the_United_States")
month_data = page %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/ul[3]/li") %>%
html_text()
This webpage is has no structure, it is just one long list of tags without clearly separating the different sections out.
This is partial solution:
library(rvest)
library(xml2)
library(dplyr)
page <- xml2::read_html("https://en.wikipedia.org/wiki/2020_in_the_United_States")
lineitems <- page %>% html_nodes(xpath = "//html/body/div[3]/div[3]/div[5]/div[1]/ul[3]/li")
#Count the number of child ul nodes
subcount <- lineitems %>% html_node("ul") %>% xml_length()
output <- lapply(1:length(subcount), function(i) {
if(subcount[i] == 0 ){
out <- lineitems[i] %>% html_text()
}
else {
out <- lineitems[i] %>% html_node("ul") %>%
html_nodes(xpath=".//li") %>% html_text()
}
out
})
#name the list items with the data
names(output) <- lineitems %>% html_node("a") %>%
html_attr("title")
#a list for each date
output
I didn't have the time or patience to refine this. You may have a easier time trying to select the nodes based on the available attributes instead of the particular html/xml tags.

"rvest" not fetching the product details using html_nodes()

I was using rvest for scraping the details of the product (Names, Price, and Availability) on amazon's product search results. I was able to fetch the webpage with read_html(), but I am not able to fetch the details of the product (Names, Price, and Availability). The page has <span> tag with class as class = "a-size-medium a-color-base a-text-normal". I have used html_nodes("span.a-size-medium a-color-base a-text-normal"), but got NA.
Here is the reproducible code:
library(rvest)
library(xml2)
url <- "https://www.amazon.in/s?k=Smartphone&rh=n%3A1389401031&ref=nb_sb_noss"
page <- read_html(url)
data <- page%>%
html_node("span.a-size-medium a-color-base a-text-normal") %>%
html_text()
print(data)
You just need to change the css selector a little bit. I was able to get the names and the prices, the availability was a little bit trickier :/
library(rvest)
library(xml2)
url <- "https://www.amazon.in/s?k=Smartphone&rh=n%3A1389401031&ref=nb_sb_noss"
page <- read_html(url)
name <- page %>% html_nodes(".a-size-medium.a-color-base.a-text-normal") %>% html_text()
price <- page %>% html_nodes(".a-price-whole") %>% html_text()

Extracting text/number from tspan class tag HTML with R

I am trying to extract Current Production number from this website http://okg.se/sv/Produktionsinformation/ (in the blue area below).
Here is the HTML code part I need to use:
<tspan dy="0" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);">518</tspan>
The example of the code I used:
url <- "http://okg.se/sv/Produktionsinformation//"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
content %>% html_nodes(".content__info__item__value")
But the result I get shows that there is not nodes available:
{xml_nodeset (0)}
Do you have any ides on how to solve this issue?
Thanks in advance!
I am not pretty sure about the value that you need, but this work
librar(rvest)
# page url
url <- "http://okg.se/sv/Produktionsinformation/"
# current value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-current")
# Max value
read_html(url) %>%
html_nodes(".footer__gauge") %>%
html_attr("data-max")
The html you see with your browser has been processed by javascript, so isn't the same as the html you see with rvest.
The raw data you are looking for is actually stored in attributes of a div with the id "gauge", so you get it like this:
library(rvest)
#> Loading required package: xml2
"http://okg.se/sv/Produktionsinformation//" %>%
read_html() %>%
html_node("#gauge") %>%
html_attrs() %>%
`[`(c("data-current", "data-max"))
#> data-current data-max
#> "553" "1450"
Note that you don't need to save the html to your local drive to process it. You can read it directly from the internet by giving the url to read_html
Created on 2020-02-20 by the reprex package (v0.3.0)

Using rvest to scrape GoodReads pages

I'm trying to scrape ratings and review numbers on goodreads, but getting an NA result. Why is this?
SelectorGadget finds "span span" for the average rating on hover over, but there's no "valid path" found at the bottom.
Using the same method on other sites (e.g. IMDB, theatlantic.com) works fine.
Here's my code and result (I've also tried replacing html_text with html_tag)
Rating<- html("http://www.goodreads.com/book/show/22444789-delicious-foods")
Rating %>%
html_node("span span") %>%
html_text () %>%
as.numeric()
[1] NA
Warning message:
In function_list[[k]](value) : NAs introduced by coercion
I didn't have any success using selectorgadget with the Goodreads site, but sometimes you just have to look at the html source and find what you're looking for that way.
In this case, you can use the .average class selector:
Rating %>%
html_node(".average") %>%
html_text %>%
as.numeric