Avoid getting "glued" words with R webscraping

Avoid getting "glued" words with R webscraping - html

When I use both of the two following blocks of code code, I get "glued" words, and by that i mean words that are not not separated by a space but they should, and this is a problem. In the original HTML, it seem like they're separated by a <b> and i'm not beeing able to handle this. The two blocks do the same thing by different ways.
library(XML)
library(RCurl)
# Block 1---------
url <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
u <- readLines(url)
h <- htmlTreeParse(file=u,
asText=TRUE,
useInternalNodes = TRUE,
encoding = "utf-8")
song <- getNodeSet(doc=h, path="//article", fun=xmlValue)
# Block 2---------
u <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
h <- htmlParse(getURL(u))
song <- xpathSApply(h, path = "//article", fun = xmlValue)
Which returns something like:
[1] "Sometimes I feelLike I don't have a partnerSometimes I feelLike my only friendIs the city I live inThe city of angelsLonely as I amTogether we cryI drive on her streets'Cause she's my companionI walk through her hills'Cause she knows who I amShe sees my good deedsAnd she kisses me windyI never worryNow that is a lieI don't ever wanna feelLike I did that dayBut take me to the place I loveTake me all the wayIt's hard to believeThat there's nobody out thereIt's hard to believeThat I'm all aloneAt...

I was able to retrieve words with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.letras.mus.br/red-hot-chili-peppers/32739/")
remDr$screenshot(display = TRUE, useViewer = TRUE)
page_Content <- remDr$getPageSource()[[1]]
list_Text_Song <- list()
for(i in 1 : 30)
{
print(i)
web_Obj <- tryCatch(remDr$findElement("xpath", paste0("//*[#id='js-lyric-cnt']/article/div[2]/div[2]/p[", i, "]")), error = function(e) NA)
list_Text_Song[[i]] <- tryCatch(web_Obj$getElementText(), error = function(e) NA)
}
list_Text_Song <- unlist(list_Text_Song)
list_Text_Song <- list_Text_Song[!is.na(list_Text_Song)]
The words are not glued with this approach.

Related

Obtaining data from NCBI gene database with R

Rentrez package
I was discovering rentrez package in RStudio (Version 1.1.442) on a lab computer in Linux (Ubuntu 20.04.2) according to this manual.
However, later when I wanted to run the same code on my laptop in Windows 8 Pro (RStudio 2021.09.0 )
library (rentrez)
entrez_dbs()
entrez_db_searchable("gene")
#res <- entrez_search (db = "gene", term = "(Vibrio[Organism] OR vibrio[All Fields]) AND (16s[All Fields]) AND (rna[All Fields]) AND (owensii[All Fields] OR navarrensis[All Fields])", retmax = 500, use_history = TRUE)
I can not get rid of this error, even after closing the session or reinstalling rentrez package
Error in curl::curl_fetch_memory(url, handle = handle) : schannel:
next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE
(0x80090326) - This error usually occurs when a fatal SSL/TLS alert is
received (e.g. handshake failed).
This is the main problem that I faced.
RSelenium package
Later I decided to address pages containing details about the genes and their sequences in FASTA format modifying a code that I have previously used. It uses rvest and rselenium packages and the results were perfect.
# Specifying a webpage
url <- "https://www.ncbi.nlm.nih.gov/gene/66940694" # the last 9 numbers is gene id
library(rvest)
library(RSelenium)
# Opening a browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(url)
# Clicked outside in an empty space next to the FASTA button and copied a full xPath (redirecting to a FASTA data containing webpage)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of a webpage: left it from the old code for the case of a long gene
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
# Let's get gene FASTA, for example
page <- read_html(remDr$getPageSource()[[1]])
fasta <- page %>%
html_nodes('pre') %>%
html_text()
print(fasta)
Output: ">NZ_QKKR01000022.1:c3037-151 Vibrio paracholerae strain
2016V-1111 2016V-1111_ori_contig_18, whole genome shotgun
sequence\nGGT...
The code worked well to obtain other details about the gene like its accession number, position, organism and etc.
Looping of the process for several gene IDs
Later I tried to change the code to get simultaneously the same information for several gene IDs following the explanations I got here for the other project of mine.
# Specifying a list of gene IDs
res_id <- c('57838769','61919208','66940694')
dt <- res_id # <lapply> looping function refused to work if an argument had a different name rather than <dt>
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
## Writing a function of GET_FASTA dependent on GENE_ID (x)
get_fasta <- function(x){
link = paste0('https://www.ncbi.nlm.nih.gov/gene/',x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
... there is a continuation below but an error was appearing here, saying that the same xPath, which was successfully used before, can not be found.
Error: Summary: NoSuchElement Detail: An element could not be located
on the page using the given search parameters. class:
org.openqa.selenium.NoSuchElementException Further Details: run
errorDetails method
I tried to delete /a[2] to get /html/.../p at the end of the xPath as it was working in the initial code, but an error was appearing later again.
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of FASTA on the website
fasta <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('pre') %>%
html_text()
fasta
return(fasta)
}
## Writing a function of GET_ACC_NUM dependent on GENE_ID (x)
get_acc_num <- function(x){
link = paste0( 'https://www.ncbi.nlm.nih.gov/gene/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p')$clickElement()
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of ACC_NUM on the website
acc_num <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.itemid') %>%
html_text() %>%
str_sub(start= -17)
acc_num
return(acc_num)
}
## Collecting all FUNCTION into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
fasta <- get_fasta(x)
acc_num <- get_acc_num(x)
# Combine into a tibble
combined_data <- tibble( Acc_Number = acc_num,
FASTA = fasta)
}
## Running FUNCTION for all x
df <- lapply(dt, get_data_table)
head(df)
I also tried to write the code
only with rvest,
to write the loop with for (i in res_id) {},
to introduce two different xPaths ending with /html/.../p/a[2] or .../p using if () {} else {}
but the results were even more confusing.
I am studying R coding while working on such tasks, so any suggestions and critics are welcome.

The node pre is not a valid one. We have to look for value inside class or 'id` etc.
webElem$sendKeysToElement(list(key = "end") you don't need this command as there is no necessity yo scroll the page.
Below is code to get you the sequence of genes.
First we have to get the links to sequence of genes which we do it by rvest
library(rvest)
library(dplyr)
res_id <- c('57838769','61919208','66940694')
link = vector()
for(i in res_id){
url = paste0('https://www.ncbi.nlm.nih.gov/gene/', i)
df = url %>%
read_html() %>%
html_node('.note-link')
link1 = xml_attrs(xml_child(df, 3))[["href"]]
link1 = paste0('https://www.ncbi.nlm.nih.gov', link1)
link = rbind(link, link1)
}
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_ADAF01000001.1?report=fasta&from=257558&to=260444"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_VARQ01000103.1?report=fasta&from=64&to=2616&strand=true"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_QKKR01000022.1?report=fasta&from=151&to=3037&strand=true"
After obtaining the links we shall get the sequence of genes which we do it by RSelenium. I tried to do it with rvest but couldn't get the sequence.
Launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
Function to get the sequence
get_seq = function(link){
remDr$navigate(link)
Sys.sleep(5)
df = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes(xpath = '//*[#id="viewercontent1"]') %>%
html_text()
return(df)
}
df = lapply(link, get_seq)
Now we have list df with all the info.

Error in eval_tidy (xs [[j]], mask): object 'titles' not found

When I try this code, I get an error. I want to scrape the data from multiple pages. But when I try the script down below, I get an error that says:
Error in eval_tidy (xs [[j]], mask): object 'titles' not found
library(tidyverse)
library(rvest)
# function to scrape all elements also missing elements
scrape_css <- function(css, group, html_page) {
txt <- html_page %>%
html_nodes(group) %>%
lapply(
. %>%
html_nodes(css) %>%
html_text() %>%
ifelse(identical(., character(0)), NA, .)
) %>%
unlist()
return(txt)
}
# Get all elements from 1 page
get_one_page <- function(url) {
html <- read_html(url)
titles <- scrape_css(
".recipe-card_title__1oIb-",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
minutes <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(1)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
callories <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(2)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
}
return(tibble(titles = titles, minutes = minutes, callories = callories))
url <- ("https://www.ah.nl/allerhande/recepten/R-L1473207825981/suikerbewust")
appie <- get_one_page(url)

Two things:
return should be inside the curly braces at the end of the function
return(tibble(titles = titles, minutes = minutes, callories = callories))
}
You forgot to update the variable name such that you have
html <- read_html(url)
when you need
html_page <- read_html(url)

This does not per se answer #Laila's question, but I ran into the
Error in eval_tidy(xs[[j]], mask) : object '' not found
error, when I did sth similar to
tibble(
a = 1,
b = ,
c = NA
)
So basically, forgot to enter a value for a tibble column. But it is a rather uninformative error message. I am leaving this for the future me (and others) as a reference, as there are not many posts related to this error.

Data from page2 same as from page1 when scraping

I am trying to scrape all event links from https://www.tapology.com/fightcenter.
Have already quite some experience in webscraping using R but in this case I am stuck.
I am able to scrape from page 1, however when I input a second page as URL, I still obtain data from first page as if the page is being redirected back automatically.
I have tried various codes found here on the forum, still, something is wrong.
First page
url = "https://www.tapology.com/fightcenter"
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
matched = as.data.frame(matched[[1]], stringsAsFactors = F)
Second page
url = 'https://www.tapology.com/fightcenter_events?page=2'
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
matched = as.data.frame(matched[[1]], stringsAsFactors = F)
Results are identical.
Could you please help me to solve this?
Thank you

Content is added dynamically via xhr. You can use httr (as mentioned in other answer) and add your headers. You also need to alter the page param that goes in the url during a loop/sequence. An example is shown below of a single request for a different page is shown (I just extract the fight links of person 1 v person 2 to show it is reading from that page). You could alter this to be a function returning info of interest in your loop or perhaps use purrr to map info across to an existing structure.
require(httr)
require(rvest)
require(magrittr)
require(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'Accept' = 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
'X-Requested-With' = 'XMLHttpRequest'
)
params = list(
'page' = '2'
)
r <- httr::GET(url = 'https://www.tapology.com/fightcenter_events', httr::add_headers(.headers=headers), query = params)
x <- str_match_all(content(r,as="text") ,'html\\("(.*>)')
y <- gsub('"',"'",gsub('\\\\','', x[[1]][,2]))
z <- read_html(y) %>% html_nodes(., ".billing a") %>% html_attr(., "href")

You're getting redirected back, because the website checking the headers which are you sending. For getting a correct data, you need to have this headers set:
Accept: text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Also, this request doesn't return the HTML of the webpage, but jQuery code, which updating the list on the website dynamically.

I have been able to extract the first three links of the first three pages with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.tapology.com/fightcenter")
list_Matched <- list()
# We get results from first 3 pages
for(i in 1 : 3)
{
print(i)
if(i != 1)
{
# Press on next button
web_Elem_Link <- remDr$findElement("class name", "next")
web_Elem_Link$clickElement()
}
list_Link_Page <- list()
Sys.sleep(3)
# Get the first three links of the page ...
for(j in 1 : 3)
{
web_Elem_Link <- tryCatch(remDr$findElement("xpath", paste0('//*[#id="content"]/div[4]/section[', j, ']/div/div[1]/div[1]/span[1]/a')),
error = function(e) NA)
if(is.na(web_Elem_Link))
{
web_Elem_Link <- remDr$findElement("xpath", paste0('//*[#id="content"]/div[3]/section[', j, ']/div/div[1]/div[1]/span[1]/a'))
}
web_Elem_Link$clickElement()
Sys.sleep(3)
list_Link_Page[[j]] <- remDr$getCurrentUrl()
remDr$goBack()
Sys.sleep(3)
}
list_Matched[[i]] <- list_Link_Page
}

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that

My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

How to read a <li> table in a webpage

I debug the program many times to get the result as follows:
url 研究所知识库列表
/handle/1471x/1 力学研究所
/handle/1471x/8865 半导体研究所
However, no metter what parameters I use, the result is not correct. The content in this table is one part of the basis of my further analysis, and I am very trembled for it. I'm looking forward to your help with great sincerity.
## download community-list ---the 1st level of IR Grid
#loading webpage and analyzing
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
# get table specs
tableNodes <- getNodeSet(com_parsed, "//table")
com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
# get External links
xpath <- "//a/#href"
getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)

it is unclear exactly what you want your end result to look like but if you modify your xpath statements a bit to take advantage of the DOM structure you can get something like this:
library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[#class="communityLink"]//#href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[#class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data
console output causes stackoverflow to think this answer is spam but here is a screen shot:

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Avoid getting "glued" words with R webscraping - html

Related

Obtaining data from NCBI gene database with R

Error in eval_tidy (xs [[j]], mask): object 'titles' not found

Data from page2 same as from page1 when scraping

R - Issue with the DOM of the danish parliament (webscraping)

How to read a <li> table in a webpage

Categories

Resources