I am relatively new to web scraping.
I am having problems with child numbers when web scraping for multiple patents. The child number changes accordingly to the the location of the table in the web page. Sometimes the child is "div:nth-child(17)" and other times it is "div:nth-child(18)" when searching for different patents.
My line of code is this one:
IPCs <-sapply("http://www.sumobrain.com/patents/us/Sonic-pulse-echo-method-apparatus/4202215.html", function(url1){
tryCatch(url1 %>%
as.character() %>%
read_html() %>%
html_nodes("#inner_content2 > div:nth-child(17) > div.disp_elm_value3 > table") %>%
html_table(),
error = function(e){NA}
)
})
When I search for another patent (for example: "http://www.sumobrain.com/patents/us/Method-apparatus-quantitative-depth-differential/4982090.html") the child number changes to (18).
I am planning to analyse more than a thousand patents so I would need a code that work for both child numbers. Is there a CSS selector which allows me to select more children? I have tried the "div:nth-child(n)" and "div:nth-child(*)" but they do not work.
I am also open to using a different method. Does anybody have any suggestions?
Try this pseudo classes :
It's a range between 17 and 18.
nth-child(17):nth-child(-n+18)
Related
Aim to extract three data points from a URL.Able to locate the specific top and individual CSSs nodes and xpaths using selectorGadget. Aim to use html_node function (html(url),CSS)) to extract the elements I am interested at.
Have used the main node CSS (CSS node ._2t2gK1hs") and was able to extract the first element as a string. The top CSS node appears to have embedded only the first element not the other subsequent two although the three elements (one text and the other two numeric elements) share the same CSS node address codes (For all three top "._39sLqIkw" with a heading followed by "._1NHwuRzF")
[![Snapshot of CSS and selector gadget for the specific data points I would like to extract.][1]][1]
In attempting to extract the data I tried:
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_Mir
Ripoll_Province_of_Girona_Catalonia.html")
html_node(page0_url, "._2t2gK1hs")```
#Resulting in a string with the top element I aim to extract embedded.
{html_node}
div class="_2t2gK1hs" data-tab="TABS_ABOUT" data-section-signature="about" id="ABOUT_TAB"
[1] <div>\n<div class="_39sLqIkw">PRICE RANGE</div>\n<div class="_1NHwuRzF">€124<!-- --> - <!-- --€222<!-- --> <!-- -->(Based on Average Rates for a Standard Room) ...>
#Failed to extract the two remaining three elements by selecting the individual CSSs or xpaths.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_Mir-Ripoll_Province_of_Girona_Catalonia.html")
html_nodes(xpath = "//*[contains(concat( " ", #class, " " ), concat( " ", "_1NHwuRzF", " " ))]") %>%
html_text(trim = TRUE)```
#Tried passing without success the specific element node followed/preceded by #PRICE RANGE, #LOCATION, #NUMBER OF ROOMS.
#I wonder how should I pass the argument and what node/s to use in the above function.
#Expected result
PRICE RANGE
122 222
LOCATION
Spain Catalonia Province of Gerona Ripoll
NUMBER OF ROOMS
5
Thank you
Those classes look dynamic. Here is a hopefully more robust selector strategy based on the relationship between more stable looking elements, and avoiding using likely dynamic class values:
library(rvest)
library(magrittr)
page0_url <- read_html('https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_MirRipoll_Province_of_Girona_Catalonia.html')
data <- page0_url %>%
html_nodes('.in-ssr-only [data-tab=TABS_ABOUT] div[class]') %>%
html_text()
data
I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#
None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.
I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:
Target
Attack
AAA News
Fake News
AAA News
Fake News
AAA News
A total disgrace
...
...
Mr. ZZZ
A real nut job
but I'd like to show how to do everything programmatically (no copy/paste).
My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?
PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\
I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.
If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:
tweets-full.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv
It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.
Be sure to read the term of service before scraping any webpage.
Here's a programatic approach with RSelenium and rvest:
library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]
#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
html_nodes(xpath = '//*[#id="mem-wall"]/div[2]/div')
#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
html_text)
#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
map(html_nodes, css = 'div.g-twitter-quote-c') %>%
map(html_text))
#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
data.frame(Entity = NA, Insult = NA)}})) -> Result
#Strip out the quotes
Result %>%
mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result
#Take a look at the result
Result %>%
slice_sample(n=10)
Entity Insult
1 Mitt Romney failed presidential candidate
2 Hillary Clinton Crooked
3 The “mainstream” media Fake News
4 Democrats on a fishing expedition
5 Pete Ricketts illegal late night coup
6 The “mainstream” media anti-Trump haters
7 The Washington Post do nothing but write bad stories even on very positive achievements
8 Democrats weak
9 Marco Rubio Lightweight
10 The Steele Dossier a Fake Dossier
The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:
I'm trying to scrape a ncbi website (https://www.ncbi.nlm.nih.gov/protein/29436380) to obtain information of a protein. I need to access the gene_synonyms and GeneID fields. I have tried to find the relevant nodes with the selectorGadget addon in chrome and with the code inspector in ff. I have tried this code:
require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)
Then I try to find the relevant text but it is simply not there.
str_extract_all(TestHTML, pattern = "(synonym).{30}")
[[1]]
character(0)
str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
[[1]]
character(0)
All I seem to be accessing is some of the text content of the column on the right.
str_extract_all(TestHTML, pattern = "(protein).{30}")
[[1]]
[1] "protein codes including ambiguities a"
[2] "protein sequence for myosin-9 (NP_00"
[3] "protein should not be confused with t"
[4] "protein, partial [Homo sapiens]gi|294"
[5] "protein codes including ambiguities a"
I have tried so many combinations of nodes selections with html_node() that I don't know anymore what to try. Is this content buried in some structure I can't see? or I'm just not skilled enough to realize the node to select?
Thanks a lot,
José.
The page is dynamically loading the information. The underlying information is store at another location.
Using the developer tools from your bowser, look for the link:
The information you are looking for is store at the "viewer.fcgi", right click to copy the link.
See similar question/answers: R not accepting xpath query
I'm trying to scrape a web page.
I want to get a dataset from two different html nodes; ".table-grosse-schrift" and "td.zentriert.no-border".
url<-paste0("https://www.transfermarkt.co.uk/serie-a/spieltag/wettbewerb/IT1/saison_id/2016/spieltag/4")
tt<-read_html(url[[x]]) %>%html_nodes(".table-grosse-schrift")%>%html_text()%>%as.matrix()
temp1=data.frame(as.character(gsub("\r|\n|\t|\U00A0", "", tt[,])))
temp2<-(read_html(url[[x]]) %>%html_nodes("td.zentriert.no-border") %>% html_text() %>% data.frame())
The problem is that the order of the nodes of ".table-grosse-schrift" on the web page keep changing, so that I cannot match the data from the two nodes.
I found that the solution can be getting two nodes' data at the same time, like this:
tt<-read_html(url[[x]]) %>%html_nodes(".table-grosse-schrift")%>%html_nodes("td.zentriert.no-border")%>%html_text()%>%as.matrix()
But this code does not work.
If I understand correctly, you should be able to use following-sibling to select the next corresponding sibling in the pair of nodes that you need.
The following-sibling axis indicates all the nodes that have the same
parent as the context node and appear after the context node in the
source document. (Source: https://developer.mozilla.org/en-US/docs/Web/XPath/Axes/following-sibling)
So I am wanting to scrape some NBA data. The following is what I have so far, and it is perfectly functional:
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
the problem is that these tables don't include the team names, which is important. To get this information, I was thinking I would scrape the four factors table on the webpage, however, rvest doesnt seem to be recognizing this as a table. The div that contains the four factors table is:
<div class="overthrow table_container" id="div_four_factors">
And the table is:
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
This made me think that I could access the table via something along the lines of
table = html_nodes(webpage,'#div_four_factors')
but this doesnt seem to work as I am getting just an empty list. How can I access the four factors table?
I am by no means an HTML expert but it appears that the table you are interested in is commented out in the source code then the comment is overridden at some point before being rendered.
If we assume that the Home team is always listed second, we can just use positional arguments and scrape another table on the page:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
Obviously not the cleanest solution but such is life in the world of web scraping