I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#
None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.
I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:
Target
Attack
AAA News
Fake News
AAA News
Fake News
AAA News
A total disgrace
...
...
Mr. ZZZ
A real nut job
but I'd like to show how to do everything programmatically (no copy/paste).
My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?
PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\
I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.
If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:
tweets-full.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv
It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.
Be sure to read the term of service before scraping any webpage.
Here's a programatic approach with RSelenium and rvest:
library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]
#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
html_nodes(xpath = '//*[#id="mem-wall"]/div[2]/div')
#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
html_text)
#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
map(html_nodes, css = 'div.g-twitter-quote-c') %>%
map(html_text))
#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
data.frame(Entity = NA, Insult = NA)}})) -> Result
#Strip out the quotes
Result %>%
mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result
#Take a look at the result
Result %>%
slice_sample(n=10)
Entity Insult
1 Mitt Romney failed presidential candidate
2 Hillary Clinton Crooked
3 The “mainstream” media Fake News
4 Democrats on a fishing expedition
5 Pete Ricketts illegal late night coup
6 The “mainstream” media anti-Trump haters
7 The Washington Post do nothing but write bad stories even on very positive achievements
8 Democrats weak
9 Marco Rubio Lightweight
10 The Steele Dossier a Fake Dossier
The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:
Related
I wonder if you could give me a hint on how to get over the problem I encountered when trying to extract data from HTML files. I looked through other questions regarding the issue but still cannot figure out what changes exactly should I make. I have five HTML files in a folder. From each of them, I want to extract HTML links which I will later use. First, I extracted this data without any effort reading every HTML separately and creating a separate data frame for each HTML with much-needed links (/item.asp?id=). Then I used a 'rbind' function to merge columns from each data frame. The key here is that the first three HTML pages have 20 rows of the data I need, the fourth HTML has 16 rows, and the fifth and the last has 9 rows.
The looping code works just fine when I loop over the first three pages for which I have 20 rows each, but the code doesn't allow me to do the same for the fourth and fifth HTML pages because there the row number is different. I get the problem:
Error in [[<-.data.frame(*tmp*, i, value = c("/item.asp?id=22529120", : replacement has 16 rows, data has 20
The code is as follows:
#LOOP over others
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
out.file<-""
file.names <- dir(path, pattern =".html")
for (i in 1:length(file.names))
{
page <- read_html(file.names[i])
links <- page %>% html_nodes("a") %>% html_attr("href")
##get all links into a dataframe
df <-as.data.frame(links)
##get links which contain /item.asp
page_article <- df[grep("/item.asp", df$links), ]
##for each HTML save a separate data frame with links column
java[i] <-as.data.frame(page_article)
##save number of a page where this link is
page_num[i] <- paste(toString(i))
##save id of a person this page belongs to
id[i] <- as.character(file.names[i])
}
Can anyone give a bit of advice on how to solve this issue? If I am successful, I then must be capable to create a single column with links, another column with an id and a number of the HTML page.
Write a function which returns a dataframe after reading from each HTML file.
read_html_files <- function(filename) {
page <- read_html(filename)
links <- page %>% html_nodes("a") %>% html_attr("href")
page_article <- grep("/item.asp", links, value = TRUE)
data.frame(filename, page_article)
}
Use purrr::map_df and pass this function to every file and combine the output in one dataframe (result).
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
file.names <- list.files(path, pattern ="\\.html$", full.names = TRUE)
result <- purrr::map_df(file.names, read_html_files, .id = 'id')
result
I'm trying to scrape a ncbi website (https://www.ncbi.nlm.nih.gov/protein/29436380) to obtain information of a protein. I need to access the gene_synonyms and GeneID fields. I have tried to find the relevant nodes with the selectorGadget addon in chrome and with the code inspector in ff. I have tried this code:
require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)
Then I try to find the relevant text but it is simply not there.
str_extract_all(TestHTML, pattern = "(synonym).{30}")
[[1]]
character(0)
str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
[[1]]
character(0)
All I seem to be accessing is some of the text content of the column on the right.
str_extract_all(TestHTML, pattern = "(protein).{30}")
[[1]]
[1] "protein codes including ambiguities a"
[2] "protein sequence for myosin-9 (NP_00"
[3] "protein should not be confused with t"
[4] "protein, partial [Homo sapiens]gi|294"
[5] "protein codes including ambiguities a"
I have tried so many combinations of nodes selections with html_node() that I don't know anymore what to try. Is this content buried in some structure I can't see? or I'm just not skilled enough to realize the node to select?
Thanks a lot,
José.
The page is dynamically loading the information. The underlying information is store at another location.
Using the developer tools from your bowser, look for the link:
The information you are looking for is store at the "viewer.fcgi", right click to copy the link.
See similar question/answers: R not accepting xpath query
I want to extract only "Beech Valley Solutions - "
When I run
html_nodes('li') %>%
html_nodes(".flexbox.empLoc") %>%
html_text()
All the information comes out. "Beech Valley Solutions - Atlanta, GA Today 24hr"
There is one more way of doing scraping using rvest.
Instead of passing css selector item in html_nodes(), you can pass xpath within html_nodes().Just an example below -
page %>% html_nodes(xpath = "//*[#id='series-matches']/div[20]/div[3]/div[1]/a[1]/span")
Reference:
https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/
x path is easier to fetch -
Right click the section for which you want to fetch xpath.
Select inspect code from the drop down. 3. html page will appear to the right side, from which click the right click and press Copy option.
Drop will appear from which select "Copy xpath".
Ctrl V (Paste) the xpath within html_nodes(xpath = "xpath here"). I hope this will help you.
I am relatively new to web scraping.
I am having problems with child numbers when web scraping for multiple patents. The child number changes accordingly to the the location of the table in the web page. Sometimes the child is "div:nth-child(17)" and other times it is "div:nth-child(18)" when searching for different patents.
My line of code is this one:
IPCs <-sapply("http://www.sumobrain.com/patents/us/Sonic-pulse-echo-method-apparatus/4202215.html", function(url1){
tryCatch(url1 %>%
as.character() %>%
read_html() %>%
html_nodes("#inner_content2 > div:nth-child(17) > div.disp_elm_value3 > table") %>%
html_table(),
error = function(e){NA}
)
})
When I search for another patent (for example: "http://www.sumobrain.com/patents/us/Method-apparatus-quantitative-depth-differential/4982090.html") the child number changes to (18).
I am planning to analyse more than a thousand patents so I would need a code that work for both child numbers. Is there a CSS selector which allows me to select more children? I have tried the "div:nth-child(n)" and "div:nth-child(*)" but they do not work.
I am also open to using a different method. Does anybody have any suggestions?
Try this pseudo classes :
It's a range between 17 and 18.
nth-child(17):nth-child(-n+18)
Using htmlParse and xpathSApply in the XML package in R, I've encountered an issue where I cannot download the xmlValue from a certain HTML element on a webpage. I'm fairly (if not entirely) new to using R for web scraping, so I'm not sure what I need to do in order to get the information I need.
Essentially the part of the code from the page I'm targeting reads:
<div class="panel-body">
<div id="primarycitation">
<h4>Tetracycline Repressor Allostery Does not Depend on Divalent Metal Recognition.
</h4>
So after establishing the link to the webpage (in a for-loop; hence the i)**:
pdbId <- strtrim(pp2[i, 1], 4)
url2 <- paste("http://www.rcsb.org/pdb/explore/explore.do?structureId=", pdbId, sep = "")
val <- htmlParse(url2)
body <- xmlChildren(xmlRoot(val))$body
I've used:
script2 <- xpathSApply(body,
"//div[#id = 'primarycitation']",
xmlValue)
But all I get from that is junk:
> script2
[1] "\n \n "
Again, I'm not very familiar with web scraping, but to the best of my knowledge, and from the experience I've had so far with all my other functions, the citation title should be the value pulled by xpathSApply. Any suggestions?
** To tack it on at the end the pdbId I'm using here is 4D7N.