rvest html_nodes() returns empty character - html

I am trying to scrape a website (https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281). In particular, I am trying to scrape all 281 "release dates" (with the first being '30-Oct-2006')
To do this, I am using the R package rvest and the SelectorGadget Chrome extension. I am using Mac version 10.15.6.
I attempted the following code:
library(rvest)
library(httr)
library(xml2)
library(dplyr)
link = "https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281"
page = read_html(link)
year = page %>% html_nodes("td:nth-child(4) ul") %>% html_text()
However, this returns 'character(0)`.
I used the code td:nth-child(4) ul because this is what SelectorGadget highlighted for each of the 281 release dates. I also tried to "View source page" but could not find these years listed on the source page.
I have read that rvest does not always work depending on the type of website. In this case, what is a possible workaround? Thank you.

This site gets the data from this API call https://genelab-data.ndc.nasa.gov/genelab/data/study/all that returns JSON data. You can use httr to get the data and parse JSON :
library(httr)
url <- "https://genelab-data.ndc.nasa.gov/genelab/data/study/all"
output <- content(GET(url), as = "parsed", type = "application/json")
#sort by glds_id
output = output[order(sapply(output, `[[`, i = "glds_id"))]
#build dataframe
result <- list();
index <- 1
for(t in output[length(output):1]){
result[[index]] <- t$metadata
result[[index]]$accession <- t$accession
result[[index]]$legacy_accession <- t$legacy_accession
index <- index + 1
}
df <- do.call(rbind, result)
options(width = 1200)
print(df)
Output sample (without all columns)
accession legacy_accession public_release_date title
[1,] "GLDS329" "GLDS-329" "30-Oct-2006" "Transcription profiling of atm mutant, adm mutant and wild type whole plants and roots of Arabidops" [truncated]
[2,] "GLDS322" "GLDS-322" "27-Aug-2020" "Comparative RNA-Seq transcriptome analyses reveal dynamic time dependent effects of 56Fe, 16O, and " [truncated]
[3,] "GLDS320" "GLDS-320" "18-Sep-2014" "Gamma radiation and HZE treatment of seedlings in Arabidopsis"
[4,] "GLDS319" "GLDS-319" "18-Jul-2018" "Muscle atrophy, osteoporosis prevention in hibernating mammals"
[5,] "GLDS318" "GLDS-318" "01-Dec-2019" "RNA seq of tumors derived from irradiated versus sham hosts transplanted with Trp53 null mammary ti" [truncated]
[6,] "GLDS317" "GLDS-317" "19-Dec-2017" "Galactic cosmic radiation induces stable epigenome alterations relevant to human lung cancer"
[7,] "GLDS311" "GLDS-311" "31-Jul-2020" "Part two: ISS Enterobacteriales"
[8,] "GLDS309" "GLDS-309" "12-Aug-2020" "Comparative Genomic Analysis of Klebsiella Exposed to Various Space Conditions at the International" [truncated]
[9,] "GLDS308" "GLDS-308" "07-Aug-2020" "Differential expression profiles of long non-coding RNAs during the mouse pronucleus stage under no" [truncated]
[10,] "GLDS305" "GLDS-305" "27-Aug-2020" "Transcriptomic responses of Serratia liquefaciens cells grown under simulated Martian conditions of" [truncated]
[11,] "GLDS304" "GLDS-304" "28-Aug-2020" "Global gene expression in response to X rays in mice deficient in Parp1"
[12,] "GLDS303" "GLDS-303" "15-Jun-2020" "ISS Bacillus Genomes"
[13,] "GLDS302" "GLDS-302" "31-May-2020" "ISS Enterobacteriales Genomes"
[14,] "GLDS301" "GLDS-301" "30-Apr-2020" "Eruca sativa Rocket Science RNA-seq"
[15,] "GLDS298" "GLDS-298" "09-May-2020" "Draft Genome Sequences of Sphingomonas sp. Isolated from the International Space Station Genome seq" [truncated]
...........................................................................

Related

Issue scraping website with reactive blocks

I am trying to extract the business name/address information from a website listing clinic locations. The locations that are displayed depend upon the search parameters in the Google Maps widget. My goal is to collect information about all of them in the US, so I zoomed out and tried the following in both Map & List View:
List View:
fyz <- read_html("https://www.fyzical.com/Locations')
> loc_text <- fyz %>%
+ html_nodes("div.psl-text-content") %>%
+ html_text()
> loc_text
character(0)
And then in Map View:
> loc <- fyz %>%
+ html_nodes("script") %>%
+ .[str_detect(., "maps\\.google")] %>%
+ str_extract_all("\".*maps\\.google.*\"")
Warning message:
In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
argument is not an atomic vector; coercing
> loc
list()
Both came up empty. Using SelectorGadget to paste in the XPath produced the same results. I'm relatively new to this, so any help/insight would be greatly appreciated!
If you want to be able to zoom, interactively, you'll probably need to use RSelenium. Here's how I did it. First, use this to navigate to the website. You should see the address bar with slated alternating light and dark red stripes in it.
library(RSelenium)
remDr <- rsDriver(browser='firefox', phantomver=NULL)
brow <- remDr[["client"]]
brow$open()
brow$navigate("https://www.fyzical.com/Locations")
Go to the browser that has the fyzical website loaded:
Then, input a zip code and zoom the map out to where you want it. Following that, do this:
library(rvest)
h <- read_html(brow$getPageSource()[[1]])
addresses <- h %>% html_elements(css=".psl-text-address") %>% html_text()
head(addresses)
# [1] "6415 Kenai Spur Hwy, Kenai, AK, 99611" "650 N Shoreline Dr, Wasilla, AK, 99654"
# [3] "832 Princeton Ave SW, Birmingham, AL, 35211" "602 Corley Avenue, Boaz, AL, 35957"
# [5] "1218 13th Avenue SE, Decatur, AL, 35601" "101 Hwy 80 West, Demopolis, AL, 36732"
You'll see that the address list should have 445 entries. I've printed the first six here.

parse Google Scholar search results scraped with rvest

I am trying to use rvest to scrape one page of Google Scholar search results into a dataframe of author, paper title, year, and journal title.
The simplified, reproducible example below is code that searches Google Scholar for the example terms "apex predator conservation".
Note: to stay within the Terms of Service, I only want to process the first page of search results that I would get from a manual search. I am not asking about automation to scrape additional pages.
The following code already works to extract:
author
paper title
year
but it does not have:
journal title
I would like to extract the journal title and add it to the output.
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, stringsAsFactors = FALSE)
df
source: https://stackoverflow.com/a/58192323/8742237
So the output of that code looks like this:
#> titles
#> 1 [HTML][HTML] Saving large carnivores, but losing the apex predator?
#> 2 Site fidelity and sex-specific migration in a mobile apex predator: implications for conservation and ecosystem dynamics
#> 3 Effects of tourism-related provisioning on the trophic signatures and movement patterns of an apex predator, the Caribbean reef shark
#> authors years
#> 1 A Ordiz, R Bischof, JE Swenson 2013
#> 2 A Barnett, KG Abrantes, JD Stevens, JM Semmens 2011
Two questions:
How can I add a column that has the journal title extracted from the raw data?
Is there a reference where I can read and learn more about how to work out how to extract other fields for myself, so I don't have to ask here?
One way to add them is this:
library(rvest)
library(xml2)
library(selectr)
library(stringr)
library(jsonlite)
url_name <- 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C38&q=apex+predator+conservation&btnG=&oq=apex+predator+c'
wp <- xml2::read_html(url_name)
# Extract raw data
titles <- rvest::html_text(rvest::html_nodes(wp, '.gs_rt'))
authors_years <- rvest::html_text(rvest::html_nodes(wp, '.gs_a'))
# Process data
authors <- gsub('^(.*?)\\W+-\\W+.*', '\\1', authors_years, perl = TRUE)
years <- gsub('^.*(\\d{4}).*', '\\1', authors_years, perl = TRUE)
leftovers <- authors_years %>%
str_remove_all(authors) %>%
str_remove_all(years)
journals <- str_split(leftovers, "-") %>%
map_chr(2) %>%
str_extract_all("[:alpha:]*") %>%
map(function(x) x[x != ""]) %>%
map(~paste(., collapse = " ")) %>%
unlist()
# Make data frame
df <- data.frame(titles = titles, authors = authors, years = years, journals = journals, stringsAsFactors = FALSE)
For your second question: the css selector gadget chrome extension is nice for getting the css selectors of the elements you want. But in your case all elements share the same css class, so the only way to disentangle them is to use regex. So I guess learn a bit about css selectors and regex :)

Scrape multiple URLs with rvest

How can I scrape multiple urls when using the read_html in rvest? The goal is to obtain a single document consisting of the text bodies from the respective urls on which to run various analyses.
I tried to concatenate the urls:
url <- c("https://www.vox.com/","https://www.cnn.com/")
page <-read_html(url)
page
story <- page %>%
html_nodes("p") %>%
html_text
After read_html get an error:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=3].
Not surprised since the read_html probably only handles one path at a time. However, can I use a different function or transformation so several pages can be scraped simultaneously?
You can use map (or in base R: lapply) to loop through every url element; here is an example
url <- c("https://www.vox.com/", "https://www.bbc.com/")
page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())
str(page)
#List of 2
# $ : chr [1:22] "But he was acquitted on the two most serious charges he faced." "Health experts say it’s time to prepare for worldwide spread on all continents." "Wall Street is waking up to the threat of coronavirus as fears about the disease and its potential global econo"| __truncated__ "Johnson, who died Monday at age 101, did groundbreaking work in helping return astronauts safely to Earth." ...
# $ : chr [1:19] "" "\n The ex-movie mogul is handcuffed and led from cou"| __truncated__ "" "27°C" ...
The return object is a list.
PS. I've changed the second url element because "https://www.cnn.com/" returned NULL for html_nodes("p") %>% html_text().

How to click links onto the next page using RCurl?

I am trying to scrape this table from this website using RCurl. I am able to do this and put it into a nice dataframe using the code:
clinVar <- getURL("http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq", "Phenotype","Clinical significance","Status", "Chr","Location")
However, I can only extract the data on the first page, and the table spans multiple pages. How do you access data on the next page? I have looked at the HTML code for the website and the region that the "Next" button exists in is here (I believe!):
<a name="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page">Next ></a>
I would like to know how to access this link using getURL, postForm etc. I think I should be doing something like this, to get data from the second page but it's still just giving me the first page:
url <- "http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1"
clinVar <- postForm(url,
"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.cPage" ="2")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq","Phenotype","Clinical significance","Status", "Chr","Location")
Thanks to anyone who can help.
I would use E-utilities to access data at NCBI instead.
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=brca1"
readLines(url)
[1] "<?xml version=\"1.0\" ?>"
[2] "<!DOCTYPE eSearchResult PUBLIC \"-//NLM//DTD eSearchResult, 11 May 2002//EN\" \"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd\">"
[3] "<eSearchResult><Count>1080</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_36649974_130.14.18.34_9001_1386348760_356908530</WebEnv><IdList>"
Pass the QueryKey and WebEnv to esummary and get the XML summary (this changes with each esearch, so copy and paste the new keys into the url below)
url2 <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&query_key=1&WebEnv=NCID_1_36649974_130.14.18.34_9001_1386348760_356908530"
brca1 <- xmlParse(url2)
Next, view a single record and then extract the fields you need. You may need to loop through the set if there are 0 to many values assigned to a tag. Others like clinical significance description always have 1 value.
getNodeSet(brca1, "//DocumentSummary")[[1]]
table(xpathSApply(brca1, "//clinical_significance/description", xmlValue) )
Benign conflicting data from submitters not provided other
129 22 6 1
Pathogenic probably not pathogenic probably pathogenic risk factor
508 68 19 43
Uncertain significance
284
Also, there are many packages with E-utilities on github and BioC (rentrez, reutils, genomes and others). Using the genomes package on BioC, this simplifies to
brca1 <- esummary( esearch("brca1", db="clinvar"), parse=FALSE )
Using the e-utilities feature on the NCBI database, see http://www.ncbi.nlm.nih.gov/books/NBK25500/ for more details.
## use eSearch feature in eUtilities to search NCBI for ids corresponding to each row of data.
## note to see all ids, not not just top 10 set retmax to a high number
## to get query id and web env info, set usehistory=y
library(RCurl)
library(XML)
baseSearch <- ("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=") ## eSearch
db <- "clinvar" ## database to query
gene <- "BRCA1" ## gene of interest
query <- paste('[gene]+AND+"','clinsig pathogenic"','[Properties]+AND+"','single nucleotide variant"','[Type of variation]&usehistory=y&retmax=1110',sep="") ## query, see below for details
baseFetch <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=" ## base fetch
searchURL <- paste(baseSearch,db, "&term=",gene,query,sep="")
getSearch <- getURL(searchURL)
searchHTML <- htmlTreeParse(searchURL, useInternalNodes =T)
nodes <- getNodeSet(searchHTML,"//querykey") ## this name "querykey" was extracted from the HTML source code for this page
querykey <- xmlToDataFrame(nodes)
nodes <- getNodeSet(searchHTML,"//webenv") ## this name "webenv" was extracted from the HTML source code for this page
webenv <- xmlToDataFrame(nodes)
fetchURL <- paste(baseFetch,db,"&query_key=",querykey,"&WebEnv=",webenv[[1]],"&rettype=docsum",sep="")
getFetch <- getURL(fetchURL)
fetchHTML <- htmlTreeParse(getFetch, useInternalNodes =T)
nodes <- getNodeSet(fetchHTML, "//position")
extractedDataAll <- xmlToDataFrame(nodes)
colnames(extractedDataAll) <- c("pathogenicSNPs")
print(extractedDataAll)
Please note, I found the query information by going to http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1 selecting my filters (pathogenic, etc) and then clicking the advanced button. The most recent filters applied should come up in the main box, I used this for the query.
ClinVar now offers XML download of the whole database so webscraping is not necessary.

How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it
Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.
Here's an example to get you started:
require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
This results in a character vector of mostly just webpage text (along with some javascript):
> head(x)
[1] "Subscribe to Print Edition" "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time: 16:48 (EST+7)"
[4] "  Make Haaretz your homepage" "/*check the search form*/" "function chkSearch()"
Your best bet may be the XML package -- see for example this previous question.
I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?