Obtaining data from NCBI gene database with R - html

Rentrez package
I was discovering rentrez package in RStudio (Version 1.1.442) on a lab computer in Linux (Ubuntu 20.04.2) according to this manual.
However, later when I wanted to run the same code on my laptop in Windows 8 Pro (RStudio 2021.09.0 )
library (rentrez)
entrez_dbs()
entrez_db_searchable("gene")
#res <- entrez_search (db = "gene", term = "(Vibrio[Organism] OR vibrio[All Fields]) AND (16s[All Fields]) AND (rna[All Fields]) AND (owensii[All Fields] OR navarrensis[All Fields])", retmax = 500, use_history = TRUE)
I can not get rid of this error, even after closing the session or reinstalling rentrez package
Error in curl::curl_fetch_memory(url, handle = handle) : schannel:
next InitializeSecurityContext failed: SEC_E_ILLEGAL_MESSAGE
(0x80090326) - This error usually occurs when a fatal SSL/TLS alert is
received (e.g. handshake failed).
This is the main problem that I faced.
RSelenium package
Later I decided to address pages containing details about the genes and their sequences in FASTA format modifying a code that I have previously used. It uses rvest and rselenium packages and the results were perfect.
# Specifying a webpage
url <- "https://www.ncbi.nlm.nih.gov/gene/66940694" # the last 9 numbers is gene id
library(rvest)
library(RSelenium)
# Opening a browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(url)
# Clicked outside in an empty space next to the FASTA button and copied a full xPath (redirecting to a FASTA data containing webpage)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of a webpage: left it from the old code for the case of a long gene
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
# Let's get gene FASTA, for example
page <- read_html(remDr$getPageSource()[[1]])
fasta <- page %>%
html_nodes('pre') %>%
html_text()
print(fasta)
Output: ">NZ_QKKR01000022.1:c3037-151 Vibrio paracholerae strain
2016V-1111 2016V-1111_ori_contig_18, whole genome shotgun
sequence\nGGT...
The code worked well to obtain other details about the gene like its accession number, position, organism and etc.
Looping of the process for several gene IDs
Later I tried to change the code to get simultaneously the same information for several gene IDs following the explanations I got here for the other project of mine.
# Specifying a list of gene IDs
res_id <- c('57838769','61919208','66940694')
dt <- res_id # <lapply> looping function refused to work if an argument had a different name rather than <dt>
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
## Writing a function of GET_FASTA dependent on GENE_ID (x)
get_fasta <- function(x){
link = paste0('https://www.ncbi.nlm.nih.gov/gene/',x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p/a[2]')$clickElement()
... there is a continuation below but an error was appearing here, saying that the same xPath, which was successfully used before, can not be found.
Error: Summary: NoSuchElement Detail: An element could not be located
on the page using the given search parameters. class:
org.openqa.selenium.NoSuchElementException Further Details: run
errorDetails method
I tried to delete /a[2] to get /html/.../p at the end of the xPath as it was working in the initial code, but an error was appearing later again.
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of FASTA on the website
fasta <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('pre') %>%
html_text()
fasta
return(fasta)
}
## Writing a function of GET_ACC_NUM dependent on GENE_ID (x)
get_acc_num <- function(x){
link = paste0( 'https://www.ncbi.nlm.nih.gov/gene/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div[1]/form/div[1]/div[5]/div/div[6]/div[2]/div[3]/div/div/div[3]/div/p')$clickElement()
webElem <- remDr$findElement("css", "body")
for (i in 1:5){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of ACC_NUM on the website
acc_num <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.itemid') %>%
html_text() %>%
str_sub(start= -17)
acc_num
return(acc_num)
}
## Collecting all FUNCTION into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
fasta <- get_fasta(x)
acc_num <- get_acc_num(x)
# Combine into a tibble
combined_data <- tibble( Acc_Number = acc_num,
FASTA = fasta)
}
## Running FUNCTION for all x
df <- lapply(dt, get_data_table)
head(df)
I also tried to write the code
only with rvest,
to write the loop with for (i in res_id) {},
to introduce two different xPaths ending with /html/.../p/a[2] or .../p using if () {} else {}
but the results were even more confusing.
I am studying R coding while working on such tasks, so any suggestions and critics are welcome.

The node pre is not a valid one. We have to look for value inside class or 'id` etc.
webElem$sendKeysToElement(list(key = "end") you don't need this command as there is no necessity yo scroll the page.
Below is code to get you the sequence of genes.
First we have to get the links to sequence of genes which we do it by rvest
library(rvest)
library(dplyr)
res_id <- c('57838769','61919208','66940694')
link = vector()
for(i in res_id){
url = paste0('https://www.ncbi.nlm.nih.gov/gene/', i)
df = url %>%
read_html() %>%
html_node('.note-link')
link1 = xml_attrs(xml_child(df, 3))[["href"]]
link1 = paste0('https://www.ncbi.nlm.nih.gov', link1)
link = rbind(link, link1)
}
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_ADAF01000001.1?report=fasta&from=257558&to=260444"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_VARQ01000103.1?report=fasta&from=64&to=2616&strand=true"
link1 "https://www.ncbi.nlm.nih.gov/nuccore/NZ_QKKR01000022.1?report=fasta&from=151&to=3037&strand=true"
After obtaining the links we shall get the sequence of genes which we do it by RSelenium. I tried to do it with rvest but couldn't get the sequence.
Launch browser
library(RSelenium)
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
Function to get the sequence
get_seq = function(link){
remDr$navigate(link)
Sys.sleep(5)
df = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes(xpath = '//*[#id="viewercontent1"]') %>%
html_text()
return(df)
}
df = lapply(link, get_seq)
Now we have list df with all the info.

Related

Scraping several webpages from a website (newspaper archive) using RSelenium

I managed to scrape one page from a newspaper archive according to explanations here.
Now I am trying to automatise the process to access a list of pages by running one code.
Making a list of URLs was easy as the newspaper's archive has a similar pattern of links:
https://en.trend.az/archive/2021-XX-XX
The problem is with writing a loop to scrape such data as title, date, time, category. For simplicity, I tried to work only with article headlines from 2021-09-30 to 2021-10-02.
## Setting data frames
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
list_of_url <- character() # or str_c()
## Generating subpage list
for (i in format(seq(d1, d2, by="days"), format="%Y-%m-%d")) {
list_of_url[i] <- str_c ("https://en.trend.az", "/archive/", i)
# Launching browser
driver <- rsDriver(browser = c("firefox")) #Version 93.0 (64-bit)
remDr <- driver[["client"]]
remDr$errorDetails
remDr$navigate(list_of_url[i])
remDr0$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(2)
webElem$sendKeysToElement(list(key = "end"))
}
page <- read_html(remDr$getPageSource()[[1]])
# Scraping article headlines
get_headline <- page %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
get_time <- str_sub(get_time, start= -5)
length(get_time)
}
}
In total length should have been 157+166+140=463. In fact, I did not manage to collect all data even from one page (length(get_time) = 126)
I considered that after the first set of commands in the loop, I obtained three remDr for the 3 dates specified, but they were not recognised later independently.
Because of that I tried to initiate a second loop inside the initial one before or after page <- by
for (remDr0 in remDr) {
page <- read_html(remDr0$getPageSource()[[1]])
# substituted all remDr-s below with remDr0
OR
page <- read_html(remDr$getPageSource()[[1]])
for (page0 in page)
# substituted all page-s below with page0
However, these attempts ended with different errors.
I would appreciate the help of specialists as it is my first time using R for such purposes.
Hope it will be possible to correct the existing loop that I made or maybe even suggest a shorter pathway, by making a function, for example.
Slight broadening for scraping multiple categories
library(RSelenium)
library(dplyr)
library(rvest)
Mention the date period
d1 <- as.Date("2021-09-30")
d2 <- as.Date("2021-10-02")
dt = seq(d1, d2, by="days")#contains the date sequence
#launch browser
driver <- rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
### `get_headline` Function for newspaper headlines
get_headline = function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
headlines = remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
headlines
return(headlines)
}
get_time Function for the time of publishing
get_time <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selector of time on the website
time <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-date') %>%
html_text() %>%
str_sub(start= -5)
time
return(time)
}
Numbering of all articles from one page/day
get_number <- function(x){
link = paste0( 'https://en.trend.az/archive/', x)
remDr$navigate(link)
remDr$findElement(using = "xpath", value = '/html/body/div[1]/div/div[1]/h1')$clickElement()
webElem <- remDr$findElement("css", "body")
#scrolling to the end of webpage, to load all articles
for (i in 1:25){
Sys.sleep(1)
webElem$sendKeysToElement(list(key = "end"))
}
# Addressing selectors of headlines on the website
headline <- remDr$getPageSource()[[1]] %>%
read_html() %>%
html_nodes('.category-article') %>% html_nodes('.article-title') %>%
html_text()
number <- seq(1:length(headline))
return(number)
}
Collection of all functions into tibble
get_data_table <- function(x){
# Extract the Basic information from the HTML
headline <- get_headline(x)
time <- get_time(x)
headline_number <- get_number(x)
# Combine into a tibble
combined_data <- tibble(Num = headline_number,
Article = headline,
Time = time)
}
Used lapply to loop through all the dates in dt
df = lapply(dt, get_data_table)

Web scraping in R with Selenium to click new pages

I am trying to enter the different pages of this dynamic web (https://es.gofundme.com/s?q=covid). In this search engine, my intention is to enter each project. There are 12 projects per page.
Once you have entered each of these projects and have obtained the desired information (that is, if I get it), I want you to continue to the next page. That is, once you have obtained the 12 projects on page 1, you must obtain the 12 projects on page 2 and so on.
How can it be done? You help me a lot. Thanks!
This is my code:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
library(tidyr) #extract_numeric(years)
library(stringr)
df_0<-data.frame(project=character(),
name=character(),
location=character(),
dates=character(),
objective=character(),
collected=character(),
donor=character(),
shares=character(),
follow=character(),
comments=character(),
category=character())
#Specifying the url for desired website to be scraped
url <- 'https://es.gofundme.com/f/ayuda-a-ta-josefina-snchez-por-covid-en-pulmn?qid=00dc4567cb859c97b9c3cefd893e1ed9&utm_campaign=p_cp_url&utm_medium=os&utm_source=customer'
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# 1) Project name
project <- html_obj %>% html_nodes(".a-campaign-title") %>% html_text()
# 2) name
info <- html_obj %>% html_nodes(".m-person-info") %>% html_text()
# 3) location
location <- html_obj %>% html_nodes(".m-person-info-content") %>% html_text()
# 4) dates
dates <- html_obj %>% html_nodes(".a-created-date") %>% html_text()
# 5) Money -collected -objective
money <- html_obj %>% html_nodes(".m-progress-meter-heading") %>% html_text()
# 6) doner, shares and followers
popularity <- html_obj %>% html_nodes(".text-stat-value") %>% html_text()
# 7) Comments
comments <- html_obj %>% html_nodes(".o-expansion-list-wrapper") %>% html_text()
# 8) Category
category <- html_obj %>% html_nodes(".a-link") %>% html_text()
# create the df with all the info
review_data <- data.frame(project=project,
name= gsub("\\Organizador.*","",info[7]),
location=str_remove(location[7], "Organizador"),
dates = dates,
collected = unlist(strsplit(money, " "))[1],
objective = unlist(strsplit(money, " "))[8],
donor = popularity[1],
shares = popularity[2],
follow = popularity[3],
comments = extract_numeric(comments),
category = category[17],
stringsAsFactors = F)
The page does a POST request that you can mimic/simplify. To keep dynamic you need to first grab an api key and application id from a source js file, then pass those in the subsequent POST request.
In the following I simply extract the urls from each request. I set the querystring for the POST to have the max of 20 results per page. After an initial request, in which I retrieve the number of pages, I then map a function across the page numbers, extracting urls from the POST response for each; altering the page param.
You end up with a list of urls for all the projects you can then visit to extract info from; or, potentially make xmlhttp requests to.
N.B. Code can be re-factored a little as tidy up.
library(httr)
library(stringr)
library(purrr)
library(tidyverse)
get_df <- function(x){
df <- map_dfr(x, .f = as_tibble) %>% select(c('url')) %>% unique() %>%
mutate( url = paste0('https://es.gofundme.com/f/', url))
return(df)
}
r <- httr::GET('https://es.gofundme.com/static/js/main~4f8b914b.bfe3a91b38d67631e0fa.js') %>% content(as='text')
matches <- stringr::str_match_all(r, 't\\.algoliaClient=r\\.default\\("(.*?)","(.*?)"')
application_id <- matches[[1]][,2]
api_key <-matches[[1]][,3]
headers = c(
'User-Agent' = 'Mozilla/5.0',
'content-type' = 'application/x-www-form-urlencoded',
'Referer' = 'https://es.gofundme.com/'
)
params = list(
'x-algolia-agent' = 'Algolia for JavaScript (4.7.0); Browser (lite); JS Helper (3.2.2); react (16.12.0); react-instantsearch (6.8.2)',
'x-algolia-api-key' = api_key,
'x-algolia-application-id' = application_id
)
post_body <- '{"requests":[{"indexName":"prod_funds_feed_replica_1","params":"filters=status%3D1%20AND%20custom_complete%3D1&exactOnSingleWordQuery=word&query=covid&hitsPerPage=20&attributesToRetrieve=%5B%22fundname%22%2C%22username%22%2C%22bene_name%22%2C%22objectID%22%2C%22thumb_img_url%22%2C%22url%22%5D&clickAnalytics=true&userToken=00-e940a6572f1b47a7b2338b563aa09b9f-6841178f&page='
page_num <- 0
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST(url = 'https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
num_pages <- res$results[[1]]$nbPages
df <- get_df(res$results[[1]]$hits)
pages <- c(1:num_pages-1)
df2 <- map_dfr(pages, function(page_num){
data <- paste0(post_body, page_num, '"}]}')
res <- httr::POST('https://e7phe9bb38-dsn.algolia.net/1/indexes/*/queries', httr::add_headers(.headers=headers), query = params, body = data) %>% content()
temp_df <-get_df(res$results[[1]]$hits)
}
)
df <- rbind(df, df2)
#David Perea, see this page for differentiation of scraping methods, including Selenium. The method proposed by QHarr is very good, but doesn't use Selenium and also requires good knowledge of HTTP.

Error in eval_tidy (xs [[j]], mask): object 'titles' not found

When I try this code, I get an error. I want to scrape the data from multiple pages. But when I try the script down below, I get an error that says:
Error in eval_tidy (xs [[j]], mask): object 'titles' not found
library(tidyverse)
library(rvest)
# function to scrape all elements also missing elements
scrape_css <- function(css, group, html_page) {
txt <- html_page %>%
html_nodes(group) %>%
lapply(
. %>%
html_nodes(css) %>%
html_text() %>%
ifelse(identical(., character(0)), NA, .)
) %>%
unlist()
return(txt)
}
# Get all elements from 1 page
get_one_page <- function(url) {
html <- read_html(url)
titles <- scrape_css(
".recipe-card_title__1oIb-",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
minutes <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(1)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
callories <- scrape_css(
".recipe-card-properties_property__2tGuH:nth-child(2)",
".recipe-grid-lane_recipeCardColumn__2ILMo",
html_page
)
}
return(tibble(titles = titles, minutes = minutes, callories = callories))
url <- ("https://www.ah.nl/allerhande/recepten/R-L1473207825981/suikerbewust")
appie <- get_one_page(url)
Two things:
return should be inside the curly braces at the end of the function
return(tibble(titles = titles, minutes = minutes, callories = callories))
}
You forgot to update the variable name such that you have
html <- read_html(url)
when you need
html_page <- read_html(url)
This does not per se answer #Laila's question, but I ran into the
Error in eval_tidy(xs[[j]], mask) : object '' not found
error, when I did sth similar to
tibble(
a = 1,
b = ,
c = NA
)
So basically, forgot to enter a value for a tibble column. But it is a rather uninformative error message. I am leaving this for the future me (and others) as a reference, as there are not many posts related to this error.

Scrape Web Page When Selector Does Not Update URL

I am trying to scrape this webpage (https://nc.211counts.org) for a given region and time ('Onslow', 'Yesterday' for example). I want to pull all of the information from that top left table (COVID, Housing, etc through Other). Unfortunately, the URL does not update when the filters are selected. I have been following the tutorial here but cannot find a way to pull in the position of the region names I need to scrape for. Since the html_nodes function is returning empty, I think there is something to the mapping that is off.
What am I missing here?
# docker run -d -p 4445:4444 selenium/standalone-chrome
# docker ps
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://nc.211counts.org")
remDr$screenshot(display = TRUE)
nc211 <- xml2::read_html(remDr$getPageSource()[[1]])
str(nc211)
body_nodes <- nc211 %>%
html_node('body') %>%
html_children()
body_nodes
body_nodes %>%
html_children()
rank <- nc211 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(#class, 'col-lg-12 chosen-select')]") %>%
rvest::html_text()
# this returns empty
nc211 %>%
rvest::html_nodes("#region") %>%
rvest::html_children() %>%
rvest::html_text()
# guessing at an element number to see what happens
element<- remDr$findElement(using = 'css selector', "#region > option:nth-child(1)")
element$clickElement()
Content is dynamically updated through xhr POST requests when you make your selections and press Search. You can use the network tab to analyse these requests and reproduce them without resorting to selenium (as an alternative). You will need to pick up the param options from the initial page.
Below I show you how to make a request for a particular zipcode and also how to find out all the zip codes and their corresponding param ids to use in request. The latter needs to come from the initial url.
library(httr)
library(rvest)
data = list(
'id' = '{"ids":["315"]}', # zip 27006 is id 315 seen in value attribute of checkbox node
'timeIntervalId' = '18',
'centerId' = '7',
'type' = 'Z'
)
#post request that page makes using your filter selections e.g. zip code
r <- httr::POST(url = 'https://nc.211counts.org/dashBoard/barChart', body = data)
page <- read_html(r)
categories <- page %>% html_nodes(".categoriesDiv .toolTipSubCategory, #totalLabel") %>% html_text
colNodes <- page %>% html_nodes(".categoriesDiv .value")
percentages <- colNodes %>% html_attr('data-percentage')
counts <- colNodes %>% html_attr('data-value')
df <- as.data.frame(cbind(categories, percentages, counts))
print(df)
#Lookups e.g. zip codes. Taken from initial url
initial_page <- read_html('https://nc.211counts.org/')
ids <- initial_page %>% html_nodes('.zip [value]') %>% html_attr('value')
zips <- initial_page %>% html_nodes('.zip label') %>% html_text() %>% trimws()
print(ids[match('27006', zips)])

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}