I am writing a loop (in R) to webscrape Reddit posts - using Reddit's API ("Pushshift").
Essentially, I would like to get every comment that contains the word "Trump" between now and until 20,000 hours ago at an hourly basis. The API stores the comments in a JSON frame - I wrote the following code in R to obtain these comments (note - I made it so that the results are saved after every 200 iterations in case of a crash):
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:20000)
{tryCatch({
{
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
ifelse(i %% 200 == 0, saveRDS(results, "results_index.RDS"), "" )
}
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
The code runs - but I am looking for tips to increase the speed and efficiency of this code. For example, I have noticed that:
Sometimes this code seems to take a really long time on certain iterations
I also have a feeling that as the "list" grows in size and the global environment with R becomes more full, things are also slowing down.
Sometimes, the webscraping stops collecting new results (i.e. I added a statement which shows the cumulative number of results that have been collected at each iteration - sometimes, this number stops updating)
I used "tryCatch()" to skip errors to prevent the loop from crashing - but perhaps there might have been some way around this that could have potentially resulted in more Reddit comments being scraped?
Could someone please recommend some tips on how to optimize and speed this code up? Perhaps someone could try running this code and let me know what they think?
Thank you!
There are two things you can do : 1) save the data.frame into a ".RData file" at each iteration. You need less memory when you do this because you do not store data in the RAM 2) use parallel calculations. Here is an example :
library(parallel)
library(doParallel)
library(RSelenium)
fn_Par <- function(core_Id, all_Index, list_remDr, nb_Core)
{
library(jsonlite)
library(RSelenium)
remDr <- list_remDr[[core_Id]]
remDr$open()
setwd("D:\\")
part1 <- "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 <- "h&before="
part3 <- "h&size=500"
nb_Index_All <- length(all_Index)
nb_Id_Per_Core <- floor(nb_Index_All / nb_Core)
index_To_Extract <- all_Index[(1 + (core_Id - 1) * nb_Id_Per_Core) : min((core_Id * nb_Id_Per_Core), nb_Index_All)]
for(i in index_To_Extract)
{
url_i <- paste0(part1, i + 1, part2, i, part3)
remDr$navigate(url_i)
Sys.sleep(0.5)
web_Obj <- remDr$findElement("css selector", 'body > pre')
r_i <- tryCatch(data.frame(fromJSON(web_Obj$getElementText()[[1]])), error = function(e) NA)
if(is.null(dim(r_i)) == FALSE)
{
Sys.sleep(10)
remDr$navigate(url_i)
web_Obj <- remDr$findElement("css selector", 'body > pre')
r_i <- tryCatch(data.frame(fromJSON(web_Obj$getElementText()[[1]])), error = function(e) NA)
}
save(r_i, file = paste0(i, "_core_Id_", core_Id, ".RData"))
Sys.sleep(0.5)
}
}
nb_CPU <- 4
cluster <- parallel::makeCluster(nb_CPU)
doParallel::registerDoParallel(cl = cluster)
list_remDr <- list()
list_rd <- list()
for(i in 1 : nb_CPU)
{
print(i)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
list_rd[[i]] <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
list_remDr[[i]] <- list_rd[[i]]$client
}
parLapply(cluster, X = 1 : nb_CPU, fun = fn_Par, all_Index = 1 : 2000, list_remDr = list_remDr, nb_Core = nb_CPU)
Here is another approach that can be considered :
library(parallel)
library(doParallel)
fn_Par <- function(core_Id, all_Index, nb_Core)
{
library(jsonlite)
setwd("D:\\")
part1 <- "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 <- "h&before="
part3 <- "h&size=500"
nb_Index_All <- length(all_Index)
nb_Id_Per_Core <- floor(nb_Index_All / nb_Core)
index_To_Extract <- all_Index[(1 + (core_Id - 1) * nb_Id_Per_Core) : min((core_Id * nb_Id_Per_Core), nb_Index_All)]
for(i in index_To_Extract)
{
url_i <- paste0(part1, i + 1, part2, i, part3)
r_i <- tryCatch(data.frame(fromJSON(url_i)), error = function(e) NA)
if(is.null(dim(r_i)) == TRUE)
{
Sys.sleep(5)
r_i <- tryCatch(data.frame(fromJSON(url_i)), error = function(e) NA)
}
if(is.null(dim(r_i)) == TRUE)
{
Sys.sleep(5)
r_i <- tryCatch(data.frame(fromJSON(url_i)), error = function(e) NA)
}
if(is.null(dim(r_i)) == TRUE)
{
Sys.sleep(5)
r_i <- tryCatch(data.frame(fromJSON(url_i)), error = function(e) NA)
}
save(r_i, file = paste0(i, "_core_Id_", core_Id, ".RData"))
}
}
nb_CPU <- 4
cluster <- parallel::makeCluster(nb_CPU)
doParallel::registerDoParallel(cl = cluster)
parLapply(cluster, X = 1 : nb_CPU, fun = fn_Par, all_Index = 1 : 2000, nb_Core = nb_CPU)
Related
I want to create a function that takes an externally defined variable and uses it in a group by using dplyr. Here is what I have so far:
data(mtcars)
my_grp_col <- 'gear'
calculate_mean <- function(data, grouping_column, target){
data %>%
group_by(cyl, am, {{my_grp_col}}, target) %>%
summarize(mean(target, na.rm = T))
}
calculate_mean(data = mtcars, grouping_column = my_grp_col, target = mpg)
Essentially, I want to group by cyl, am, gear (which I have defined externally) and then calculate the mean of target (mpg).
The following would work (note that you need also {{...}} around target in this case):
data(mtcars)
my_grp_col <- 'gear'
calculate_mean <- function(data, grouping_column, target){
data %>%
group_by(cyl, am, !!sym(grouping_column), {{target}}) %>%
summarize(mean(target, na.rm = T))
}
calculate_mean(data = mtcars, grouping_column = my_grp_col, target = mpg)
However, it would look much nicer if you also directly give grouping_column without defining it as string before:
calculate_mean <- function(data, grouping_column, target){
data %>%
group_by(cyl, am, {{grouping_column}}, {{target}}) %>%
summarize(mean(target, na.rm = T))
}
calculate_mean(data = mtcars, grouping_column = gear, target = mpg)
I'm developing at the moment my first Shiny app and I have a little issue:
In the "server part" of the app, I have 3 output objects (renderTables) named:
output$results
output$results2
output$results3
In the two first outputs, I compute some parameters using information comming from inputs. These computations are done in a conditionnal loop (if/if else) and for some of them stock in a variable "Resultats".
What I want to do is to reuse these parameters in the third output ("output$results3").
Is there a way to store these parameters when computing them and then to reuse them ?
As an example: I want to be able to use the value of "SP" (computed in out$results2) in the output$results3.
If I don't store them, the output$results3 just can't find the parameters...
I have already tried reactivevalues() function but it didn't work.
Do you have an idea how to do it?
# Define server logic to summarize and view selected dataset
server <- function(input, output) {
# 1. Calculs et formattage des résultats
#Demande
output$results <- renderTable({
if (input$shape == "QofPD") {
a <- input$a
b <- input$b
}
if (input$shape == "PofQD") {
a <- input$a/input$b
b <- 1/input$b
}
Q0 <- max(0,a - b*input$P0)
epd <- -b*input$P0/Q0
SC <- (a/b-input$P0)*Q0/2
DT0 <- input$P0*Q0
Variables <- c("Demande:", "Demande inverse:", "Prix (P*) = ","Quantité (Q*) = ",paste0("Elasticité-prix (",intToUtf8(949),") = "), "Dépense totale (DT*) = ", "Surplus du consommateur (SC) = ")
Resultats <- c(paste0("Q = ",round(a,2)," - ",round(b,2),"P"),paste0("P = ",round(a/b,2)," - ",round(1/b,2),"Q"),input$P0,round(Q0,2),round(epd,2),round(DT0,2),round(SC,2))
df.results <- data.frame(Variables, Resultats)
df.results
},
striped=TRUE, colnames=FALSE)
# Offre
output$results2 <- renderTable({
if (input$shape2 == "QofPO") {
a2 <- input$a2
b2 <- input$b2
inter <- -a2/b2
slope <- 1/b2
Q02 <- max(0,a2+b2*input$P0)
epo <- b2*input$P0/Q02
SP <- (Q02+a2)/2*input$P0
RT0 <- input$P0*Q02
Variables <- c("Offre:", "Offre inverse:", "Prix (P*) = ","Quantité (Q*) = ",paste0("Elasticité-prix (",intToUtf8(949),") = "), "Recette totale (RT*) = ", "Surplus du producteur (SP) = ")
Resultats <- c(paste0("Q = ",a2," + ",b2,"P"),paste0("P = ",round(slope,2),"Q"," ",round(inter,2)),input$P0,round(Q02,2),round(epo,2),round(RT0,2),round(SP,2))
df.results <- data.frame(Variables, Resultats)
df.results
}
else if (input$shape2 == "PofQO") {
a2 <- input$a2
b2 <- input$b2
inter <- a2
slope <- b2
Q02 <- max(0,(input$P0-a2)/b2)
epo <- (1/b2)*input$P0/Q02
SP <- (input$P0-inter)*Q02/2
RT0 <- input$P0*Q02
Variables <- c("Offre:", "Offre inverse:", "Prix (P*) = ","Quantité (Q*) = ",paste0("Elasticité-prix (",intToUtf8(949),") = "), "Recette totale (RT*) = ", "Surplus du producteur (SP) = ")
Resultats <- c(paste0("Q = ",round(-a2/b2,2)," + ",round(1/b2,2),"P"),paste0("P = ",round(inter,2)," + ",round(slope,2),"Q"),input$P0,round(Q02,2),round(epo,2),round(RT0,2),round(SP,2))
df.results <- data.frame(Variables, Resultats)
df.results
}
},
striped=TRUE, colnames=FALSE)
#Equilibre
output$results3 <- renderTable({
# Here I want to enter the parameters SP / SC...etc to be able to make new computation.
},
striped=TRUE, colnames=FALSE)
}
shinyApp(ui=ui, server=server)
I hope I was clear enough.
Thanks for your help.
And I stay available for more information if needed
Valentin
I'd like to scrape the content of a page once the province (and the commune) are selected.
The following code correctly outputs the provinces and their values.
library(rvest)
page <- read_html(x = "https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati/")
text <- page %>% html_nodes(xpath='//select[#name="provincia"]/option')%>% html_text()
values <- page %>% html_nodes(xpath='//select[#name="provincia"]/option')%>% html_attr("value")
Res <- data.frame(text = text, values = values, stringsAsFactors = FALSE)
Res
Now, I'd like to access the page for each value, e.g. this might be helpful for getting access to value=19.
text <- page %>% html_nodes(xpath="//*/option[#value = '19']")%>% html_text()
text
The source code is the following
<div class="row results_form_search">
<form role="search" method="POST" class="search-form" action="/progetto-torelli/progetto-torelli-risultati/" id="search_location">
<input type="hidden" name="comune_from" value="" />
<div class="form-row">
<input type="text" name="cognome" placeholder="Cognome" autocomplete="off" value="">
<select name="provincia">
<option value="0" selected>Seleziona Provincia</option>
<option value="74"
>-
</option>
<option value="75"
>AGRIGENTO
</option>
<option value="19"
>ALESSANDRIA
This is where the content that I want to scrape might be.
<div class="row">
<ul class="listing_search">
</ul>
</div>
Thank you so much for your advice!
RSelenium may end up being the way to go. However, if you can insert some judicious waits, or chunk your requests, so server isn't swamped with requests, you can use rvest and make the same requests the page does.
You first need to generate all the combinations of province and comune (filtering out unwanted values); this can be done by making xmlhttp requests, using the value attribute for the options within the select for province, to gather back the comune dropdown options and their associated values.
You then make further requests, for each combination pair, to get the page content, which you would get when making selections from each of those dropdowns manually and pushing CERCA.
Pauses are needed as there are 10,389 valid combinations, by my reckoning, and, if you attempt to make all those requests one after the other, following the initial requests as well, the server will cut-off the connection.
Another option would be to chunk up combined into smaller dataframes and make requests for those at timed intervals and then combine the results.
library(rvest)
library(dplyr)
library(purrr)
get_provincias <- function(link) {
nodes <- read_html(link) %>%
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
df <- data.frame(
Provincia = nodes %>% html_text(trim = T),
id0 = nodes %>% html_attr("value")
)
return(df)
}
get_comunes <- function(id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')
df <- data.frame(
id0 = id, # id1
Comune = nodes %>% html_text(trim = T),
id3 = nodes %>% html_attr("value")
)
return(df)
}
get_page <- function(prov_id, com_id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
prov_id,
com_id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
page <- read_html(link)
# print(page %>% html_node(".listing_name") %>% html_text(trim = T))
# print(tibble(id3 = com_id, page = page))
return(tibble(id3 = com_id, page = page))
}
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")
comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")
combined <- dplyr::right_join(provincias, comunes, by = "id0")
# length(combined$Comune) -> 10389
results <- map2_dfr(combined$id0, combined$id3, .f = get_page)
final <- dplyr::inner_join(combined, results, by = "id3")
Below is a longer version, with the additional info you requested, where I played around with adding pauses. I still found that I could run up to, and including
combined <- dplyr::right_join(provincias, comunes, by = "id0")
in one go. But after that I needed to chunk requests into about 2000 requests batches with 20-30 minutes in between. You can try tweaking the timings below. I ended up using the commented out section to run each batch and then left a pause of 30 mins in between.
Some things to consider:
It seems that you can have comunes values like ... which still return listings. With that in mind you may wish to remove the :not parts of this:
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
as I assumed that was filtering out invalid results.
Next, you might consider writing a helper function with httr and retry,
to make the requests with backoff/retry, rather than use pauses.
Such a function might look like this:
httr::RETRY(
"GET",
<request url>,
times = 3,
pause_min = 20*60,
pause_base = 20*60)
Anyway, those are some ideas. Even without the server cutting the connection, via uses of waits, I still found it started to throttle requests, meaning some requests took quite a long time to complete. Optimizing this could potentially take a lot of time and effort. I spent a good few days playing around looking at chunk size and waits.
library(rvest)
library(dplyr)
library(purrr)
get_provincias <- function(link) {
nodes <- read_html(link) %>%
html_nodes('[name="provincia"] > option:not([selected]):not(:contains("-")):not(:contains("\u0085"))')
df <- data.frame(
Provincia = nodes %>% html_text(trim = T),
id0 = nodes %>% html_attr("value")
)
return(df)
}
get_comunes <- function(id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_comuni.php?id0=%s&id1=0&_=%i",
id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
nodes <- read_html(link) %>% html_nodes('option:not([value="0"])')
df <- data.frame(
id0 = id, # id1
Comune = nodes %>% html_text(trim = T),
id3 = nodes %>% html_attr("value")
)
return(df)
}
get_data <- function(prov_id, com_id) {
link <- sprintf(
"https://www.solferinoesanmartino.it/db-torelli/_get_soldati.php?id0=1&id1=&id2=%s&id3=%s&_=%i",
prov_id,
com_id,
as.numeric(as.POSIXct(Sys.Date(), format = "%Y-%m-%d"))
)
# print(link)
page <- read_html(link)
df <- data.frame(
cognome = page %>% html_nodes(".listing_name") %>% html_text(trim = T),
livello = page %>% html_nodes(".listing_level") %>% html_text(trim = T),
id3 = com_id,# for later join back on comune
id0 = prov_id
)
Sys.sleep(.25) # pause for . sec
return(df)
}
get_chunks <- function(df, chunk_size) { # adapted from #BenBolker https://stackoverflow.com/a/7060331
n <- nrow(df)
r <- rep(1:ceiling(n / chunk_size), each = chunk_size)[1:n]
d <- split(df, r)
return(d)
}
write_rows <- function(df, filename) {
flag <- file.exists(filename)
df2 <- purrr::map2_dfr(df$id0, df$id3, .f = get_data)
write.table(df2,
file = filename, sep = ",",
append = flag,
quote = F, col.names = !flag,
row.names = F
)
Sys.sleep(60*10)
}
provincias <- get_provincias("https://www.solferinoesanmartino.it/progetto-torelli/progetto-torelli-risultati")
Sys.sleep(60*5)
comunes <- map_df(provincias$id0, get_comunes) %>% filter(Comune != "-")
Sys.sleep(60*10)
combined <- dplyr::right_join(provincias, comunes, by = "id0")
Sys.sleep(60*10)
chunked <- get_chunks(combined, 2000) # https://stackoverflow.com/questions/7060272/split-up-a-dataframe-by-number-of-rows
filename <- "prov_com_cog_liv.csv"
map(chunked, ~ write_rows(.x, filename))
## #### test case #####################
# df <- chunked[[6]]
#
# flag <- file.exists(filename)
#
# df2 <- map2_dfr(df$id0, df$id3, .f = get_data)
#
# write.table(df2,
# file = filename, sep = ",",
# append = flag,
# quote = F, col.names = !flag,
# row.names = F
# )
####################################
results <- read.csv(filename)
final <- dplyr::right_join(combined, results, by = "id3")
I found a solution to this in the past and still using it but I never looked for something else. I would like to share this one and get some feedback from the community about this and if better solution exists.
Actually my next idea on this is to implement selection tool a bit like editableCell. I actuallz could do it but the main raison to use this would be to use copy paste to paste the picture somewhere else as they are positioned. Unfortunately it does not copy the picture.
I have a list of jpeg with a specific name:
010003000Color3_0.jpg
010003000Color3_1.jpg
010003000Color3_2.jpg
I need to get the information from those names are create a grid of those pictures depending on the information contained.
E.g. here I have a name corresponding to _.jpg
Here the complete code, I let you go through it. Read the comments that are partly explaining it.
# Setup -------------------------------------------------------------------
library(data.table)
library(dplyr)
library(tableHTML)
library(knitr)
wd <- "C:/Users/gravier/Downloads/Example_Yann_Html/Jpeg"
# Function ----------------------------------------------------------------
addimgbalise <- function(vect, pisize=200) {
# to add a html balise with a pixel size for the image
vect <- paste0("<img src='", vect,"' width=", pisize, "/>")
return(vect)
}
write.html.link <- function(data,
filename,
caption = filename,
wdfunction,
color.bg = "#00000",
color.line = "#b4bac4",
color.text = "#b4bac4",
font.size = "8px",
font.family = "Arial",
text.align = "center") {
# permit to write a html table with a local link to click
tabhtml <- tableHTML(data,
rownames = FALSE,
caption = caption,
theme="default") %>%
add_css_table(css = list('border', color.line)) %>%
add_css_table(css = list('text-align', text.align)) %>%
add_css_table(css = list('font-family', font.family)) %>%
add_css_table(css = list('font-size', font.size)) %>%
add_css_table(css = list('color', color.text))
write_tableHTML(tabhtml, file = paste0(wdfunction, "/", filename, ".html"))
# unfortunately write_tableHTML has been changed in the past and is not write the '<' '>' characters of the addimgbalise function, so we have to read again the html and exchange those characters again
temp <- suppressWarnings(readLines(paste0(wdfunction, "/", filename, ".html")))
temp <- gsub( ">", ">", temp)
temp <- gsub( "<", "<", temp)
temp <- c("<body bgcolor='", color.bg, "'>", temp)
writeLines(temp, paste0(wdfunction, "/", filename, ".html"))
}
# Parameters (where in function normally, but I detail the process --------
patternlist <- ".jpg"
regexjpeg <- "([0-9]*)(.*)_(.*)" # this the regex expression regarding the name of the files
regexposID <- "\\1" # position of the different variable I want to extract from the name
regexposWhat <- "\\2"
regexposField <- "\\3"
regexposZ <- "\\3"
formulaPV <- "row+Field~col+What" # then the dcast formula regarding how to arrange the pictures
pixelimg <- 150 # size in html but the picture are kept of real resolution so you can zoom
base64 <- F
listimg<- data.table(path = list.files(wd, full.names = T))
listimg[, ID := gsub(paste0(regexjpeg, patternlist), regexposID, basename(path))]
listimg[, What := gsub(paste0(regexjpeg, patternlist), regexposWhat, basename(path))]
listimg[, Field := as.numeric(gsub(paste0(regexjpeg,patternlist), regexposField, basename(path)))]
listimg[, Z := as.numeric(gsub(paste0(regexjpeg,patternlist), regexposZ, basename(path)))]
listimg[, row := LETTERS[as.numeric(substr(ID, 1, 3))]]
listimg[, col := as.numeric(substr(ID, 4, 6))]
if( base64 ) {
for(i in 1:nrow(listimg) ) {
listimg[i, code := paste0(addimgbalise(image_uri(path), pixelimg), "\n", row, col, "-", Field, "-", What)]
}
} else {
listimg[, code := paste0(addimgbalise(path, pixelimg), "\n", row, col, "-", Field, "-", What)]
}
listimg3 <- dcast(listimg, as.formula(formulaPV), value.var = "code")
listimg3 <- data.frame(listimg3)
listimg3[nrow(listimg3)+1,] <- colnames(listimg3)
write.html.link(data = listimg3, filename = "Picture_grid", caption = paste0("Picture_grid", " - ", formulaPV), wdfunction = dirname(wd))
# Some other example ------------------------------------------------------
formulaPV <- "row+Field+What~col" # then the dcast formula regarding how to arrange the pictures
listimg3 <- dcast(listimg, as.formula(formulaPV), value.var = "code")
listimg3 <- data.frame(listimg3)
listimg3[nrow(listimg3)+1,] <- colnames(listimg3)
write.html.link(data = listimg3, filename = "Picture_grid_other_formula", caption = paste0("Picture_grid", " - ", formulaPV), wdfunction = dirname(wd))
# base64
for(i in 1:nrow(listimg) ) {
listimg[i, code := paste0(addimgbalise(image_uri(path), pixelimg), "\n", row, col, "-", Field, "-", What)]
}
listimg3 <- dcast(listimg, as.formula(formulaPV), value.var = "code")
listimg3 <- data.frame(listimg3)
listimg3[nrow(listimg3)+1,] <- colnames(listimg3)
write.html.link(data = listimg3, filename = "Picture_grid_base64", caption = paste0("Picture_grid", " - ", formulaPV), wdfunction = dirname(wd))
The final result is like this:
When I use both of the two following blocks of code code, I get "glued" words, and by that i mean words that are not not separated by a space but they should, and this is a problem. In the original HTML, it seem like they're separated by a <b> and i'm not beeing able to handle this. The two blocks do the same thing by different ways.
library(XML)
library(RCurl)
# Block 1---------
url <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
u <- readLines(url)
h <- htmlTreeParse(file=u,
asText=TRUE,
useInternalNodes = TRUE,
encoding = "utf-8")
song <- getNodeSet(doc=h, path="//article", fun=xmlValue)
# Block 2---------
u <- "https://www.letras.mus.br/red-hot-chili-peppers/32739/"
h <- htmlParse(getURL(u))
song <- xpathSApply(h, path = "//article", fun = xmlValue)
Which returns something like:
[1] "Sometimes I feelLike I don't have a partnerSometimes I feelLike my only friendIs the city I live inThe city of angelsLonely as I amTogether we cryI drive on her streets'Cause she's my companionI walk through her hills'Cause she knows who I amShe sees my good deedsAnd she kisses me windyI never worryNow that is a lieI don't ever wanna feelLike I did that dayBut take me to the place I loveTake me all the wayIt's hard to believeThat there's nobody out thereIt's hard to believeThat I'm all aloneAt...
I was able to retrieve words with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.letras.mus.br/red-hot-chili-peppers/32739/")
remDr$screenshot(display = TRUE, useViewer = TRUE)
page_Content <- remDr$getPageSource()[[1]]
list_Text_Song <- list()
for(i in 1 : 30)
{
print(i)
web_Obj <- tryCatch(remDr$findElement("xpath", paste0("//*[#id='js-lyric-cnt']/article/div[2]/div[2]/p[", i, "]")), error = function(e) NA)
list_Text_Song[[i]] <- tryCatch(web_Obj$getElementText(), error = function(e) NA)
}
list_Text_Song <- unlist(list_Text_Song)
list_Text_Song <- list_Text_Song[!is.na(list_Text_Song)]
The words are not glued with this approach.