R: Parsing group of html files with loop - html

The following code works for individual .html files:
doc <- htmlParse("New folder/1-4.html")
plain.text <- xpathSApply(doc, "//td", xmlValue)
plain.text <- gsub("\n", "", plain.text)
gregexpr("firstThing", plain.text)
firstThing <- substring(plain.text[9], 41, 50)
gregexpr(secondThing, plain.text)
secondThing <- substring(plain.text[7], 1, 550)
But the following loop does not and gives me the error:
XML content does not seem to be XML
file.names <- dir(path = "New folder")
for(i in 1:length(file.names)){
doc <- htmlParse(file.names[i])
plain.text <- xpathSApply(doc, "//td", xmlValue)
gsub("\n", "", plain.text)
firstThing[i] <- substring(plain.text[9], 41, 50)
secondThing[i] <- substring(plain.text[7], 1, 550)
}
I'm simply trying to extract the information (as I've been able to do in the first batch of code), and create a vector of information.
Any ideas on how to resolve this issue?

Two things. First, your paths were wrong. To fix this, use:
filenames = dir(path = "New folder", full.names = TRUE)
Secondly, a better way than filling two variables inside a for loop is to generate structured data in a list function:
result = lapply(filenames, function (filename) {
doc = htmlParse(filename)
plain_text = xpathSApply(doc, "//td", xmlValue)
c(first = substring(plain_text[9], 41, 50),
second = substring(plain_text[7], 1, 550))
})
Now result is a list of elements, where each element is a vector with names first and second.
A few other remarks:
Be wary of dots in variable names — S3 uses dots in names to determine the class of a generic method. Using dots for anything else in variable names causes confusion and should be avoided.
The gsub statement in your loop has no effect.

Related

Shiny - assigning argument to function

I am trying to create a shiny app that applies a self-made function to an uploaded dataset, then allows to download the modified results. Here is my code:
library(shiny)
library(tidyverse)
namkurz <- function(data, a_spalte) {
kuerzel <- vector(length = length(data$a_spalte))
for (i in 1:length(data$a_spalte)){
spez = data$Art[i]
s = unlist(strsplit(spez, " ", fixed = TRUE))
s = substr(s, 1, 2)
s = paste(s, collapse = ' ')
kuerzel[[i]] = s
}
data <- data %>%
mutate(kurz = kuerzel)
}
ui <- fluidPage(
fileInput('upload','Deine Kartierungsdaten'),
textInput('art', 'Wie heißt die Spalte mit Artnamen?'),
downloadButton('analyse','Artenkürzel hinzufügen')
)
server <- function(input, output, session) {
data <- reactive({
req(input$upload)
ext <- tools::file_ext(input$upload$name)
switch(ext,
csv = vroom::vroom(input$upload$datapath, delim = ";"),
validate("Invalid file; Please upload a .csv file")
)
})
art <- reactive(input$art)
output$analyse <- downloadHandler(
filename = function() {
paste0('mit_kuerzel', ".csv")
},
content = function(file) {
ergebnis <- reactive(namkurz(data(), art()))
vroom::vroom_write(ergebnis(), file)
}
)
}
shinyApp(ui, server)
When trying to save the output I get a 'Warning: Unknown or uninitialised column:' error. I think my problem is in the assignment of argument 'art' to the 'ergebnis' object, but I can't find the way to fix it.
I recommend a few things:
(Required) In your function, a_spalte is a character vector and not the literal name of column in the frame, so you need to use [[ instead of $, see The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe and Dynamically select data frame columns using $ and a character value.
Change all references of data$a_spalte to data[[a_spalte]].
namkurz <- function(data, a_spalte) {
kuerzel <- vector(length = length(data[[a_spalte]]))
for (i in 1:length(data[[a_spalte]])){
spez = data$Art[i]
s = unlist(strsplit(spez, " ", fixed = TRUE))
s = substr(s, 1, 2)
s = paste(s, collapse = ' ')
kuerzel[[i]] = s
}
data <- data %>%
mutate(kurz = kuerzel)
}
Your function is a bit inefficient doing things row-wise, we can vectorize that operation.
namkurz <- function(data, a_spalte) {
spez <- strsplit(data$Art, " ", fixed = TRUE)
data$kurz <- sapply(spez, function(z) paste(substr(z, 1, 2), collapse = " "))
data
}
(Optional) The content= portion of downloadHandler is already reactive, you do not need to wrap namkurz in reactive. Because of this, you also don't need to treat ergebnis as reactive.
output$analyse <- downloadHandler(
filename = ...,
content = function(file) {
ergebnis <- namkurz(data(), art())
vroom::vroom_write(ergebnis, file)
}
)
(Optional) Your output filename is fixed, so two things here: if it's always going to be "mit_kuerzel.csv", then there's no need for paste0, just use function() "mit_kuerzel.csv".
However, if you are intending to return a file named something based on the original input filename, one could do something like:
filename = function() {
paste0(tools::file_path_sans_ext(basename(input$upload$name)),
"_mit_kuerzel.",
tools::file_ext(input$upload$name))
},
to add _mit_kuerzel to the base portion of the uploaded filename. Note that the file in the content= section is never this name, the new_mit_kuerzel.csv is the filename offered to the downloading browser as a suggestion, that is all.
(Optional) You are using a .csv file extension in the downloadHandler, but the default for vroom::vroom_write is to use delim = "\t", which is not a CSV. I suggest either adding delim = ";" (or similar), or changing the returned filename extension to .tsv instead.

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

rbind fromJSON page: duplicate rowname error

I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.
I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.

Embed html in Jupyter with R kernel [duplicate]

I just started using Jupyter with R, and I'm wondering if there's a good way to display HTML or LaTeX output.
Here's some example code that I wish worked:
library(xtable)
x <- runif(500, 1, 50)
y <- x + runif(500, -5, 5)
model <- lm(y~x)
print(xtable(model), type = 'html')
Instead of rendering the HTML, it just displays it as plaintext. Is there any way to change that behavior?
A combination of repr (for setting options) and IRdisplay will work for HTML. Others may know about latex.
# Cell 1 ------------------------------------------------------------------
library(xtable)
library(IRdisplay)
library(repr)
data(tli)
tli.table <- xtable(tli[1:20, ])
digits(tli.table) <- matrix( 0:4, nrow = 20, ncol = ncol(tli)+1 )
options(repr.vector.quote=FALSE)
display_html(paste(capture.output(print(head(tli.table), type = 'html')), collapse="", sep=""))
# Cell 2 ------------------------------------------------------------------
display_html("<span style='color:red; float:right'>hello</span>")
# Cell 3 ------------------------------------------------------------------
display_markdown("[this](http://google.com)")
# Cell 4 ------------------------------------------------------------------
display_png(file="shovel-512.png")
# Cell 5 ------------------------------------------------------------------
display_html("<table style='width:20%;border:1px solid blue'><tr><td style='text-align:right'>cell 1</td></tr></table>")
I found a simpler answer, for the initial, simple use case.
If you call xtable without wrapping it in a call to print, then it totally works. E.g.,
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
In Jupyter, you can use Markdown. Just be sure to change the Jupyter cell from a code cell to a Markdown cell. Once you have done this you can simply place a double dollar sign ("$$") before and after the LaTex you have. Then run the cell.
The steps are as follows:
1. Create a Markdown cell.
2. $$ some LaTex $$
3. Press play button within Jupyter.
Defining the following function in the session will display objects returned by xtable as html generated by xtable:
repr_html.xtable <- function(obj, ...){
paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
Without the repr_html.xtable function, because the returned object is also of class data.frame, the display system in the kernel will rich display that object (=html table) via repr::repr_html.data.frame.
Just don't print(...) the object :-)
Render/Embed html/Latex table to IR Kernel jupyter
Some packages in R give tables in html format like "knitr", so if you want to put this tables in the notebook:
library(knitr)
library(kableExtra)
library(IRdisplay) #the package that you need
#we create the table
dt <- mtcars[1:5, 1:6]
options(knitr.table.format = "html")
html_table= kable(dt) %>%
kable_styling("striped") %>%
add_header_above(c(" " = 1, "Group 1" = 2, "Group 2" = 2, "Group 3" = 2))
#We put the table in our notebook
display_html(toString(html_table))
Or for example if you have a file
display_latex(file = "your file path")

How to render LaTeX / HTML in Jupyter (R)?

I just started using Jupyter with R, and I'm wondering if there's a good way to display HTML or LaTeX output.
Here's some example code that I wish worked:
library(xtable)
x <- runif(500, 1, 50)
y <- x + runif(500, -5, 5)
model <- lm(y~x)
print(xtable(model), type = 'html')
Instead of rendering the HTML, it just displays it as plaintext. Is there any way to change that behavior?
A combination of repr (for setting options) and IRdisplay will work for HTML. Others may know about latex.
# Cell 1 ------------------------------------------------------------------
library(xtable)
library(IRdisplay)
library(repr)
data(tli)
tli.table <- xtable(tli[1:20, ])
digits(tli.table) <- matrix( 0:4, nrow = 20, ncol = ncol(tli)+1 )
options(repr.vector.quote=FALSE)
display_html(paste(capture.output(print(head(tli.table), type = 'html')), collapse="", sep=""))
# Cell 2 ------------------------------------------------------------------
display_html("<span style='color:red; float:right'>hello</span>")
# Cell 3 ------------------------------------------------------------------
display_markdown("[this](http://google.com)")
# Cell 4 ------------------------------------------------------------------
display_png(file="shovel-512.png")
# Cell 5 ------------------------------------------------------------------
display_html("<table style='width:20%;border:1px solid blue'><tr><td style='text-align:right'>cell 1</td></tr></table>")
I found a simpler answer, for the initial, simple use case.
If you call xtable without wrapping it in a call to print, then it totally works. E.g.,
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
In Jupyter, you can use Markdown. Just be sure to change the Jupyter cell from a code cell to a Markdown cell. Once you have done this you can simply place a double dollar sign ("$$") before and after the LaTex you have. Then run the cell.
The steps are as follows:
1. Create a Markdown cell.
2. $$ some LaTex $$
3. Press play button within Jupyter.
Defining the following function in the session will display objects returned by xtable as html generated by xtable:
repr_html.xtable <- function(obj, ...){
paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
Without the repr_html.xtable function, because the returned object is also of class data.frame, the display system in the kernel will rich display that object (=html table) via repr::repr_html.data.frame.
Just don't print(...) the object :-)
Render/Embed html/Latex table to IR Kernel jupyter
Some packages in R give tables in html format like "knitr", so if you want to put this tables in the notebook:
library(knitr)
library(kableExtra)
library(IRdisplay) #the package that you need
#we create the table
dt <- mtcars[1:5, 1:6]
options(knitr.table.format = "html")
html_table= kable(dt) %>%
kable_styling("striped") %>%
add_header_above(c(" " = 1, "Group 1" = 2, "Group 2" = 2, "Group 3" = 2))
#We put the table in our notebook
display_html(toString(html_table))
Or for example if you have a file
display_latex(file = "your file path")