Extracting table from website using rvest in R - html

I am trying to extract this link "s3://sra-pub-src-5/SRR11393390/Run1_Plate3_S3_L004_R2_001.fastq.gz.1" in this address: "https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR11393390&display=data-access". and I am using the rvest package in R but I am not sure how to extract information from that table.
This is my R code:
library(rvest)
url <- "https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR11393390&display=data-access"
html_content <- read_html(url)
html_content %>% html_nodes("body") %>% html_nodes("#ph-maincontent")
Please help

Related

Web-Scraping using R. I want to extract some table like data from a website

I'm having some problems scraping data from a website. I have not a lot of experience with web-scraping. My intended plan is to scrape some data using R from the following website: https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands
More precisely, I want to extract the brands on the right-hand side.
My idea so far:
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>% html_nodes(xpath='/html/body/div[1]/div/div[2]/div[2]/div[2]/div[4]/div/div/div[3]/div/div[1]/div') %>% html_text()
But this doesn't bring up the intended information. Some help would be really appreciated here! Thanks!
That data is dynamically pulled from a script tag. You can pull the content of that script tag and parse as json. subset just for the items of interest from the returned list and then extract the brand names:
library(rvest)
library(jsonlite)
library(stringr)
data <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json()
data <- data$props$pageProps$apolloState
mask <- map(names(data), str_detect, '^Brand:') %>% unlist()
data <- subset(data, mask)
brands <- lapply(data, function(x){x$name})
I find the above easier to read but you could try other methods such as
library(rvest)
library(jsonlite)
library(stringr)
brands <- read_html('https://www.shipserv.com/supplier/profile/s/w-w-grainger-inc-59787/brands') %>%
html_node('#__NEXT_DATA__') %>% html_text() %>%
jsonlite::parse_json() %>%
{.$props$pageProps$apolloState} %>%
subset(., {str_detect(names(.), 'Brand:')}) %>%
lapply(. , function(x){x$name})
Using {} to have call be treated like an expression and not a function is something I read in a comment by #asachet

R programming, webscraping- I can not get a link from html

enter image description here
I am trying to get this html, but there is no class name.
How could I do that?
That is my code in R:
link <- read_html('https://www.coffeedesk.pl/kawa/')
link %>% html_nodes('div.product-title a') %>%
html_attr('href')
link_www <- paste0("https://www.coffeedesk.pl", link)
I prefer using xpath to parse htmls:
library(magrittr)
url <-"https://www.coffeedesk.pl/kawa/"
x <- xml2::read_html(url)
links <- xml2::xml_find_all(x, "//div[#class = 'product-title']//a") %>%
xml2::xml_attr("href") %>%
paste0("https://www.coffeedesk.pl",.)

Scraping web table using R and rvest

I'm new in web scraping using R.
I'm trying to scrape the table generated by this link:
https://gd.eppo.int/search?k=saperda+tridentata.
In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page".
Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?
Below you have my code.
Thank you in advance!
library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text")
json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8")
table_contents <- JSON %>%
{gsub("\\\\n", "\n", .)} %>%
{gsub("\\\\/", "/", .)} %>%
{gsub("\\\\\"", "\"", .)} %>%
strsplit("html\":\"") %>%
unlist %>%
extract(2) %>%
substr(1, nchar(.) -2) %>%
paste0("</tbody>")
new_page <- gsub("</tbody>", table_contents, resp)
read_html(new_page) %>%
html_nodes("table") %>%
html_table()
The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.
library(httr)
library(jsonlite)
params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)

How can I extract selective data from a webpage using rvest?

I have been trying to display the review rating of this song using rvest in r from Pitchforkhttps://pitchfork.com/reviews/albums/us-girls-heavy-light/ . In this case, it is 8.5. But somehow I get this:
Here is my code
library(rvest)
library(dplyr)
library(RCurl)
library(tidyverse)
URL="https://pitchfork.com/reviews/albums/us-girls-heavy-light/"
webpage = read_html(URL)
cat("Review Rating")
webpage%>%
html_nodes("div span")%>%
html_text
We can get the relevant information from the class of div which is "score-circle".
library(rvest)
webpage %>% html_nodes('div.score-circle') %>% html_text() %>% as.numeric()
#[1] 8.5

R extracting structured data between multiple html tags

I have downloaded my facebook data. It contains a htm file with all my contacts. I would like to read it in with R, and create a contact.csv.
The usual structure is:
<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: email#email.com</li><li>contact: +123456789</li></ul></span></td></tr>
but some contacts may miss the phone number
<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: email#email.com</li></ul></span></td></tr>
while some miss the email
<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>
The csv should have the structure Firstname Lastname; email; tel number
I have tried:
library(rvest)
library(stringr)
html <- read_html("contact_info.htm")
p_nodes <- html %>% html_nodes('tr')
p_nodes_text <- p_nodes %>% html_text()
write.csv(p_nodes_text, "contact.csv")
Which creates me the csv, but unfortunately merges names with "contact:" and does not create separate columns and does not allow to have "NA" for missing either phone numbers or emails.
How could I enhance my code to accomplish this?
Thanks
You can use regexpr to identify the email & the telephon number :
xml1 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: email#email.com</li><li>contact: +123456789</li></ul></span></td></tr>'
xml2 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: email#email.com</li></ul></span></td></tr>'
xml3 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>'
docs <- c(xml1,xml2,xml3)
library(rvest)
df <- NULL
for ( doc in docs) {
page <- read_html(doc)
name <- page %>% html_nodes("tr td:first-child") %>% html_text()
meta <- page %>% html_nodes("span.meta li") %>% html_text
ind_mail <- grep(".{1,}\\#.{1,}\\..{1,}",meta)
if(length(ind_mail)>0) mail <- meta[ind_mail] else mail <- "UNKWN"
ind_tel <- grep("[0-9]{6,}$",meta)
if(length(ind_tel)>0) tel <- meta[ind_tel] else tel <- "UNKWN"
res <- cbind(name,mail,tel)
df <- rbind(df,res)
}
Hope that will helps ,
Gottavianoni