How can I extract this text from HTML, using RSelenium? - html

I want to scrape the 57, but if I'm only able to get the Text that says Search Result
this is the code that i use to grab the element with RSelenium
element<- remDr$findElements(using = 'id','resultListHeadingName')
lapply(element,function (x) x$getElementText()) %>% unlist()
this is the output
[1] "Search Results:"
this is the HTML text from the Website
<h1 class="page-title alt selectorgadget_selected" xpath="1">
<span id="resultListHeadingName" class="color-p2">Search Results:</span> 1 - 50 of 57 </h1>

Find the parent:
element<- remDr$findElements(using = 'xpath', "//*[#id='resultListHeadingName']/..")
lapply(element,function (x) x$getElementText()) %>% unlist()

Related

Scrape information from meta and button tags with rvest

I am trying to scrape the average user ratings (out of 5 stars) and number of ratings from a wine seller's page. The average stars our of 5 seems to be in a button tag while the number of rating is in a meta tag.
Here is the HTML:
<div class="bv_avgRating_component_container notranslate">
<button
type="button"
class="bv_avgRating"
aria-expanded="false"
aria-label="average rating value is 4.5 of 5."
id="avg-rating-button"
role="link"
itemprop="ratingValue"
>
4.5
</button>
</div>
<div class="bv_numReviews_component_container">
<meta itemprop="reviewCount" content="95" />
<button
type="button"
class="bv_numReviews_text"
aria-label="Read 95 Reviews"
aria-expanded="false"
id="num-reviews-button"
role="link"
>
(95)
</button>
</div>
What I've tried:
library(tidyverse)
library(rvest)
x <- "/wine/red-wine/cabernet-sauvignon/amici-cabernet-sauvignon-napa/p/20095750?s=918&igrules=true"
ratings <- read_html(paste0("https://www.totalwine.com", x)) %>%
html_nodes(xpath = '//meta[#itemprop="reviewCount"]') %>%
html_attr('content') #returns character(empty)
ratings <- read_html(paste0("https://www.totalwine.com", x)) %>%
html_nodes("meta") %>%
html_attr("content") #returns chr [1:33]
ratings <- read_html(paste0("https://www.totalwine.com", x)) %>%
html_nodes("div meta") %>%
html_attr("content") #returns chr [1:21]
ratings <- read_html(paste0("https://www.totalwine.com", x)) %>%
html_nodes("meta[itemprop=reviewCount]") %>%
html_attr("content") #returns character(empty)
At the end of the day, the two points I am trying to extract are 4.5 and content="95".
Open the Network tab of Dev Tool and reload the page, you'll see that this page loads data from https://www.totalwine.com/product/api/product/product-detail/v1/getProduct/20095750-1?shoppingMethod=INSTORE_PICKUP&state=US-CA&storeId=918 (which is a JSON file):
Get the rating and review count you want by this:
data <- jsonlite::fromJSON("https://www.totalwine.com/product/api/product/product-detail/v1/getProduct/20095750-1?shoppingMethod=INSTORE_PICKUP&state=US-CA&storeId=918")
rating <- data$customerAverageRating
reviews_count <- data$customerReviewsCount
Update: If you're new to the web-scraping field, you're probably wondering why I didn't use rvest at all. The thing is, this page uses JS to generate the content and rvest cannot handle JS, it only reads the HTML before JS loaded.

Python web scraping using BeautifulSoup, how to merge two <p> text into one element of list

I use BeautifulSoup to do the web scraping, the put the result into a list,
html shows like this:
<p class="attrgroup">
<span><b>2013 Volkswagen Passat</b></span>
<br>
</p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br>
</p>
my code is:
title=[]
text=[]
for newpage in list:
webpage = urlopen(newpage).read()
soup = BeautifulSoup(webpage,'html.parser')
header=soup.find_all("span",attrs={"id":"titletextonly"})
info = soup.find_all("p",attrs={"class":"attrgroup"})
for h in header:
title.append(h.get_text())
for m in info:
text.append(m.get_text())
the text list result is:
["2013 Volkswagen Passat","condition:excellent"]
But i want the result like this:
["2013 Volkswagen Passat condition:excellent"]
How to merge the two text when put into a list? please help!!!
Use join() function of lists.
title = []
for h in header:
title.append(h.get_text())
title = ''.join([title])
Else, add elements to the list instead of text and use list comprehension to join texts.
title = []
for h in header:
title.append(h)
title = ''.join([i.text for i in title])
Hope this helps! Cheers!
you can use stripped_strings
from bs4 import BeautifulSoup
html = """<p class="attrgroup">
<span><b>2013 Volkswagen Passat</b></span>
<br>
</p>
<p class="attrgroup">
<span>condition: <b>excellent</b></span>
<br>
</p>"""
tag = BeautifulSoup(html, 'html.parser')
data = (' '.join(tag.stripped_strings))
print data

Remove div in shiny app's title

I'm using shinydashboard, and I want to put an image in the title with the following code:
header <- dashboardHeader(
title = div(img(src = 'logo.png',
height = 60,
width = 120))
)
Everything goes well, but when I open the app by chrome, in my browser's tag, it looks very weird like below.
Is there any way to keep this from showing on the browser and show some normal text?
<div> <img src="logo.png" height="60" width="120"/>
Since you're overwriting Dashboard's Title, You've to explicitly mention the page title with tags$title
tags$title('This is my page')
For shinydashboard:
ui <- dashboardPage(title = 'This is my title', header, sidebar, body, skin='red')
The best solution could be the following:
header <- dashboardHeader(
title = HTML('<div> <img src="logo.png" height="60" width="120"/>')
)
Old thread, but there is now in 2022 an argument windowTitle to titlePanel
titlePanel(title, windowTitle = title)
So it is simply a case of supplying that, as in the following where the tab gets its own text string:
`titlePanel(
title=htmltools::div(
htmltools::img(
src="Transparent Logo No Slogan.png",style="width:50px; height:50px"),
"my application title"
),
windowTitle = "my tab title"
)`

rvest & parsing HTML: find a list of items, and extract specific information from each item

I'm struggling to find neat code to do the following:
Find a list of five items
Iterate over all five items
Extract 4 columns from each item
return a dataframe with five rows, one for each item.
Example HTML:
<div class="i-am-a-list">
<div class="item item-one"><a class="title"></a><p>sub-title</p></div>
<div class="item item-two"><a class="title-two"></a><p>sub-title</p></div>
<div class="item item-three"><a class="title-three"></a><p>sub-title</p></div>
<div class="item item-four"><a class="title-for"></a><p>sub-title</p></div>
<div class="item item-five"><a class="title-five"></a><p>sub-title</p></div>
</div>
Code thus far:
# find the upper list
coll <- read_html(doc.html) %>%
html_node('.i-am-a-list') %>%
html_nodes(".item")
# problems here, how do I iterate over the returned divs
# I was expecting something like
results <- coll %>%
do(parse_a_single_item) %>%
rbind_all()
Would it be possible to write such pretty code to do such a common task? :)
It's not really pretty and I feel like I'm missing some obvious method, but you can do:
library(rvest)
library(purrr)
read_html(x) %>%
html_node('.i-am-a-list') %>%
html_nodes(".item") %>%
map_df(~{
class = html_attr(.x, 'class')
a1 = html_nodes(.x, 'a') %>% '['(1) %>% html_attr('href')
a2 = html_nodes(.x, 'a') %>% '['(2) %>% html_attr('class')
# or with CSS selector
# a1 = html_nodes(.x, 'a:first-child') %>% html_attr('href')
# a2 = html_nodes(.x, 'a:nth-child(2)') %>% html_attr('class')
p = html_nodes(.x, 'p') %>% html_text()
data.frame(class, a1, a2, p)
})
# class a1 a2 p
# 1 item item-one title sub-title
# 2 item item-two title-two sub-title
# 3 item item-three title-three sub-title
# 4 item item-four title-for sub-title
# 5 item item-five title-five sub-title
data:
x <- '<div class="i-am-a-list">
<div class="item item-one"><a class="title"></a><p>sub-title</p></div>
<div class="item item-two"><a class="title-two"></a><p>sub-title</p></div>
<div class="item item-three"><a class="title-three"></a><p>sub-title</p></div>
<div class="item item-four"><a class="title-for"></a><p>sub-title</p></div>
<div class="item item-five"><a class="title-five"></a><p>sub-title</p></div>
</div>'

How to identify a node with its XML value in XPath?

I use R to scrape a web site, and when parsing the HTML code, I have this code below:
<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
Now I would like to get some values in this code.
How to identify the span with the xml value "Number". and get the node, in order to extract "number extra"?
I know how to use xpathApply to identify nodes in order to get the xmlValue or some attributes (like href with xmlGetAttr). But I don't know how to identify a node with knowing its xmlvalue.
xpathApply(page, '//span[#class="property"]',xmlValue)
If I want to get the "value" 72 for the property class "Surface", what is the most efficient way?
Here's I started to do:
First, I extract all "property":
xpathApply(page, '//span[#class="property"]',xmlValue)
Then I extract all "value":
xpathApply(page, '//span[#class="value"]',xmlValue)
Then I build a list or a matrix, so that I can identify the value of "Surface", which is 72. But the problem is that sometimes, a span with class="property" can not have a span with class="value" that just follows in a h2. So I can not build a proper list.
Could this be the most efficient way? Identify the span with class="property", then identify the h2 that contains this span, then identify the span with class="value"?
For your HTML made to be well-formed by adding a single root element,
<?xml version="1.0" encoding="UTF-8"?>
<r>
<div class="line">
<h2 class="clearfix">
<span class="property">Number
<div>number extra</div>
</span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>
</r>
(A) This XPath expression,
//span[#class='property' and starts-with(., 'Number')]/div/text()
will return
number extra
as requested.
(B) This XPath expression,
//h2[span[#class='property' and . = 'Surface']]/span[#class='value']/text()
will return
72
as requested.
XPath can evaluate the contents of a tag using its own function text(). Using rvest for simplicity:
library(rvest)
html <- '<div class="line">
<h2 class="clearfix">
<span class="property">Number<div>number extra</div></span>
<span class="value">3</span>
</h2>
</div>
<div class="line">
<h2 class="clearfix">
<span class="property">Surface</span>
<span class="value">72</span>
</h2>
</div>'
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Number"]/*') %>% # select node
html_text() # get text contents of node
# [1] "number extra"
XPath also has selectors to follow family axes, in this case following:::
html %>% read_html() %>% # read html
html_nodes(xpath = '//span[text()="Surface"]/following::*') %>% # select node
html_text() # get text contents of node
# [1] "72"