Rvest getting an specific text from html_node - html

I want to extract only "Beech Valley Solutions - "
When I run
html_nodes('li') %>%
html_nodes(".flexbox.empLoc") %>%
html_text()
All the information comes out. "Beech Valley Solutions - Atlanta, GA Today 24hr"

There is one more way of doing scraping using rvest.
Instead of passing css selector item in html_nodes(), you can pass xpath within html_nodes().Just an example below -
page %>% html_nodes(xpath = "//*[#id='series-matches']/div[20]/div[3]/div[1]/a[1]/span")
Reference:
https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/
x path is easier to fetch -
Right click the section for which you want to fetch xpath.
Select inspect code from the drop down. 3. html page will appear to the right side, from which click the right click and press Copy option.
Drop will appear from which select "Copy xpath".
Ctrl V (Paste) the xpath within html_nodes(xpath = "xpath here"). I hope this will help you.

Related

Extract text from dynamic Web page using R

I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#
None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.
I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:
Target
Attack
AAA News
Fake News
AAA News
Fake News
AAA News
A total disgrace
...
...
Mr. ZZZ
A real nut job
but I'd like to show how to do everything programmatically (no copy/paste).
My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?
PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\
I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.
If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:
tweets-full.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv
It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.
Be sure to read the term of service before scraping any webpage.
Here's a programatic approach with RSelenium and rvest:
library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]
#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
html_nodes(xpath = '//*[#id="mem-wall"]/div[2]/div')
#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
html_text)
#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
map(html_nodes, css = 'div.g-twitter-quote-c') %>%
map(html_text))
#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
data.frame(Entity = NA, Insult = NA)}})) -> Result
#Strip out the quotes
Result %>%
mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result
#Take a look at the result
Result %>%
slice_sample(n=10)
Entity Insult
1 Mitt Romney failed presidential candidate
2 Hillary Clinton Crooked
3 The “mainstream” media Fake News
4 Democrats on a fishing expedition
5 Pete Ricketts illegal late night coup
6 The “mainstream” media anti-Trump haters
7 The Washington Post do nothing but write bad stories even on very positive achievements
8 Democrats weak
9 Marco Rubio Lightweight
10 The Steele Dossier a Fake Dossier
The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:

How to get information from html in R?

I want to get information about price from this page: https://www.coffeedesk.pl/product/16632/Espresso-Miesiaca-Lacava-Etiopia-Yirgacheffe-Rocko-Mountain-1Kg
My code
url <-"https://www.coffeedesk.pl/product/16632/Espresso-Miesiaca-Lacava-Etiopia-Yirgacheffe-Rocko-Mountain-1Kg"
x <- xml2::read_html(url)
price<-x%>% html_node('span.product-price smaller-price') %>%
html_text()
but it returns NA
What can I do?
You have a space in your html statement when you really need to have a period. Try html_node('span.product-price.smaller-price') in your code and see if that works.

Rselenium select dropdown menu

Hi I am trying to use Rselenium to select a dropdown menu.
The field I want to click for the dropdown menu is Date Range so I look up in the html code (see picture below) and found
class="select2-choice"
to be the pointer so I invoke command to click on the dropdown menu
webElem <- rd$client$findElement(using = 'xpath',
value = '//*[#class="select2-choice"]')
webElem$clickElement()
Then I want to select "Custom" in the dropdown field so I look up in the html code (see picture below) and found it is under
select id="namedRange-3640"
and the option is
value="custom"
so I try to invoke Rselenium command again to click on this custom field
webElem <- rd$client$findElement(using = 'xpath', "//select[#id='namedRange-3640']/option[#value='custom']")
webElem$clickElement()
However there is no action in the webpage, there is no warning from the code either. I tried in other webpage with much simpler structure like W3C tutorial on dropdown menu and it works. However in this case it seems to be slightly more complicated, with something called ng-repeat which I have not come across before. Anyone know how to select the custom field?
Many thanks
This could be the solution.
library(RSelenium)
remDr <- remoteDriver(browser=c("firefox"), port = 4445)
remDr$open()
remDr$navigate("your_web_site.com")
frame_ws<- remDr$findElement(using='id', value="iframeResult")
remDr$switchToFrame(frame_ws)
#You can replace "today" with all elements the list
option <- remDr$findElement(using = 'xpath', "//*/option[#value = 'today']")
option$clickElement()
If you want to deep the argument you should visit here

Can't access specific content in html page with rvest and selectorGadget

I'm trying to scrape a ncbi website (https://www.ncbi.nlm.nih.gov/protein/29436380) to obtain information of a protein. I need to access the gene_synonyms and GeneID fields. I have tried to find the relevant nodes with the selectorGadget addon in chrome and with the code inspector in ff. I have tried this code:
require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)
Then I try to find the relevant text but it is simply not there.
str_extract_all(TestHTML, pattern = "(synonym).{30}")
[[1]]
character(0)
str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
[[1]]
character(0)
All I seem to be accessing is some of the text content of the column on the right.
str_extract_all(TestHTML, pattern = "(protein).{30}")
[[1]]
[1] "protein codes including ambiguities a"
[2] "protein sequence for myosin-9 (NP_00"
[3] "protein should not be confused with t"
[4] "protein, partial [Homo sapiens]gi|294"
[5] "protein codes including ambiguities a"
I have tried so many combinations of nodes selections with html_node() that I don't know anymore what to try. Is this content buried in some structure I can't see? or I'm just not skilled enough to realize the node to select?
Thanks a lot,
José.
The page is dynamically loading the information. The underlying information is store at another location.
Using the developer tools from your bowser, look for the link:
The information you are looking for is store at the "viewer.fcgi", right click to copy the link.
See similar question/answers: R not accepting xpath query

Getting data from an html page using R

Anyone can help me why the below code doe not have any data for the selected table?
library('httr')
library('rvest')
url= read_html("http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB")
table = html_node(url,"table#f05v5-sorting-table.border-top2.border-allside.clearboth")
Thanks!
You are missing some steps. Your workflow should look like this:
dat_html <- read_html(
"http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB"
)
dat_nodes <- html_nodes(dat_html, xpath = "xxxx")
dat <- html_table(dat_nodes)
dat will be a list, so if you want a data frame, you could do something like:
dat_df <- as.data.frame(dat)
Or, if you like tibbles:
dat_tbl <- as_tibble(dat)
I cannot find the table you are interested in on that webpage, so you have to replace "xxxx" by the xpath of the table you are interested in.
To find the xpath, if you are inspecting the page from chrome or chromium, you can right click on the node in the inspector window, and look for Copy, then Copy XPath.