How to render LaTeX / HTML in Jupyter (R)?

How to render LaTeX / HTML in Jupyter (R)? - html

I just started using Jupyter with R, and I'm wondering if there's a good way to display HTML or LaTeX output.
Here's some example code that I wish worked:
library(xtable)
x <- runif(500, 1, 50)
y <- x + runif(500, -5, 5)
model <- lm(y~x)
print(xtable(model), type = 'html')
Instead of rendering the HTML, it just displays it as plaintext. Is there any way to change that behavior?

A combination of repr (for setting options) and IRdisplay will work for HTML. Others may know about latex.
# Cell 1 ------------------------------------------------------------------
library(xtable)
library(IRdisplay)
library(repr)
data(tli)
tli.table <- xtable(tli[1:20, ])
digits(tli.table) <- matrix( 0:4, nrow = 20, ncol = ncol(tli)+1 )
options(repr.vector.quote=FALSE)
display_html(paste(capture.output(print(head(tli.table), type = 'html')), collapse="", sep=""))
# Cell 2 ------------------------------------------------------------------
display_html("<span style='color:red; float:right'>hello</span>")
# Cell 3 ------------------------------------------------------------------
display_markdown("[this](http://google.com)")
# Cell 4 ------------------------------------------------------------------
display_png(file="shovel-512.png")
# Cell 5 ------------------------------------------------------------------
display_html("<table style='width:20%;border:1px solid blue'><tr><td style='text-align:right'>cell 1</td></tr></table>")

I found a simpler answer, for the initial, simple use case.
If you call xtable without wrapping it in a call to print, then it totally works. E.g.,
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)

In Jupyter, you can use Markdown. Just be sure to change the Jupyter cell from a code cell to a Markdown cell. Once you have done this you can simply place a double dollar sign ("$$") before and after the LaTex you have. Then run the cell.
The steps are as follows:
1. Create a Markdown cell.
2. $$ some LaTex $$
3. Press play button within Jupyter.

Defining the following function in the session will display objects returned by xtable as html generated by xtable:
repr_html.xtable <- function(obj, ...){
paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
Without the repr_html.xtable function, because the returned object is also of class data.frame, the display system in the kernel will rich display that object (=html table) via repr::repr_html.data.frame.
Just don't print(...) the object :-)

Render/Embed html/Latex table to IR Kernel jupyter
Some packages in R give tables in html format like "knitr", so if you want to put this tables in the notebook:
library(knitr)
library(kableExtra)
library(IRdisplay) #the package that you need
#we create the table
dt <- mtcars[1:5, 1:6]
options(knitr.table.format = "html")
html_table= kable(dt) %>%
kable_styling("striped") %>%
add_header_above(c(" " = 1, "Group 1" = 2, "Group 2" = 2, "Group 3" = 2))
#We put the table in our notebook
display_html(toString(html_table))
Or for example if you have a file
display_latex(file = "your file path")

Related

How to pass multiple value as ID to `rvest`

I want to extract a bunch of numbers from a website https://www.bcassessment.ca/ using PID (unique ID). The list of sample PID are shown below:
PID <- c("012-215-023", "024-521-647", "025-891-669")
For these values, I opened the website manually and in the search engine of the website, I chose PID from the list of available options and then searched these numbers. The search redirected me to the following URLs
URL <- c("https://www.bcassessment.ca//Property/Info/QTAwMDAwM1hIUA==",
"https://www.bcassessment.ca//Property/Info/QTAwMDAwNEJKMA==",
"https://www.bcassessment.ca//Property/Info/QTAwMDAwMUc5OA==")
Then for each of these URLs, I ran the code shown below, to extract the total value of the property:
out <- c()
for (i in 1: length(URL)) {
url <- URL[i]
out[i] <- url %>%
read_html %>%
html_nodes('span#lblTotalAssessedValue') %>%
html_text()
i <- i+1
}
which gives me the final result
[1] "$543,000" "$957,000" "$487,000"
The problem is that I have a list of PID (more than 50000) and I cannot manually search each of these PIDs in the website to find the actual link and then run rvest to scrape it. How do you recommend automating this process so I can only provide PIDs and get the output price?
Summary: for a list of known PID I want to open https://www.bcassessment.ca/ and extract the most up-to-date price of the property, and I want it to be done Automatically.
Test_PID
I added list of PID code, so you can check if you want to check the code is working:
structure(list(P.I.D.. = c("004-050-541", "016-658-540", "016-657-861",
"016-657-764", "019-048-386", "025-528-360", "800-058-036", "025-728-954",
"028-445-783", "027-178-048", "028-445-571", "025-205-145", "015-752-798",
"026-041-308", "024-521-698", "027-541-631", "024-360-651", "028-445-040",
"025-851-411", "025-529-293", "024-138-436", "023-893-796", "018-496-768",
"025-758-721", "024-219-665", "024-359-866", "018-511-015", "026-724-979",
"023-894-253", "006-331-505", "025-961-012", "024-219-690", "027-309-878",
"028-445-716", "025-759-060", "017-692-733", "025-728-237", "028-447-221",
"023-894-202", "028-446-020", "026-827-611", "028-058-798", "017-574-412",
"023-893-591", "018-511-457", "025-960-199", "027-178-714", "027-674-941",
"027-874-826", "025-110-390", "028-071-336", "018-257-984", "023-923-393",
"026-367-203", "027-601-854", "003-773-922", "025-902-989", "018-060-641",
"025-530-003", "018-060-722", "025-960-423", "016-160-126", "009-301-461",
"025-960-580", "019-090-315", "023-464-283", "028-445-503", "006-395-708",
"028-446-674", "018-258-549", "023-247-398", "029-321-166", "024-519-871",
"023-154-161", "003-904-547", "004-640-357", "006-314-864", "025-960-521",
"013-326-783", "003-430-049", "027-490-084", "024-360-392", "028-054-474",
"026-076-179", "005-309-689", "024-613-509", "025-978-551", "012-215-066",
"024-034-002", "025-847-244", "024-222-038", "003-912-019", "024-845-264",
"006-186-254", "026-826-691", "026-826-712", "024-575-569", "028-572-581",
"026-197-774", "009-695-958", "016-089-120", "025-703-811", "024-576-671",
"026-460-751", "026-460-149", "003-794-181", "018-378-684", "023-916-745",
"003-497-721", "003-397-599", "024-982-211", "018-060-129", "018-061-231",
"017-765-714", "027-303-799", "028-565-312", "018-061-010", "006-338-232",
"023-680-024", "028-983-971", "028-092-490", "006-293-239", "018-061-257",
"028-092-376", "018-060-137", "004-302-664", "016-988-060", "003-371-166",
"027-325-342", "011-475-480", "018-060-200")), row.names = c(NA,
-131L), class = c("tbl_df", "tbl", "data.frame"))
P.S. The website I mentioned is a public website and everyone can open it and add an address to find the estimated price of a property, so I don't think there is any problem with scraping it as it's a public database.

When you submit the pid through the form, it triggers the following call:
GET https://www.bcassessment.ca/Property/Search/GetByPid/012215023?PID=012215023&_=1619713418473
The call above has the following parameters:
012215023 is the PID without dash - in your input. It's both a path and query parameter
1619713418473 is the current timestamp in milliseconds since 1970 (unix timestamp)
The result of the call above is a json response like this:
{
"sEcho": 1,
"aaData": [
["XXXXXXX", "XXXXXXXX", "XXXXXXXXXXXX", "200-027-615-115-48-0004", "QTAwMDAwM1hIUA=="]
]
}
The above call returns the response as text/plain and not as application/json content type, so we have to parse it using jsonlite. Then pick the last item of aaData array value which is, in this case: QTAwMDAwM1hIUA== and build the resulted url like the one in your post.
The following code gets a list of PID and extracts the $ values for each one of these:
library(rvest)
getValueForPID <- function(pid) {
pidNum = gsub("-", "", pid)
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(httr::GET(paste0("https://www.bcassessment.ca/Property/Search/GetByPid/",pidNum), query = list(
"PID" = pidNum,
"_" = format(time, digits=13)
)), "text", encoding = "UTF-8")
if(output == "found_no_results"){
return("")
}
data = jsonlite::fromJSON(output)
id = data$aaData[5]
text <- paste0("https://www.bcassessment.ca/Property/Info/", id) %>%
read_html %>%
html_nodes('span#lblTotalAssessedValue') %>%
html_text()
return(text)
}
PID <- c("004-050-541", "016-658-540", "016-657-861", "016-657-764", "019-048-386", "025-528-360", "800-058-036")
out <- c()
count <- 1
for (i in PID) {
print(i)
out[count] <- getValueForPID(i)
count <- count + 1
}
print(out)
sample output:
[1] "$543,000" "$957,000" "$487,000"
kaggle link: https://www.kaggle.com/bertrandmartel/bcassesment-pid

rbind fromJSON page: duplicate rowname error

I was trying to rbind some json data scraped from api
library(jsonlite)
pop_dat <- data.frame()
for (i in 1:3) {
# Generate url for each page
url <- paste0('http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',i)
# Get json data from each page and transform it into dataframe
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
pop_dat <- rbind(pop_dat, dat)
}
However, it returns the following error:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘10’, ‘11’, ‘12’, ‘13’, ‘14’, ‘15’, ‘16’, ‘17’, ‘18’, ‘19’, ‘2’, ‘20’, ‘21’, ‘22’, ‘23’, ‘24’, ‘25’, ‘26’, ‘27’, ‘28’, ‘29’, ‘3’, ‘30’, ‘31’, ‘32’, ‘33’, ‘34’, ‘35’, ‘36’, ‘37’, ‘38’, ‘39’, ‘4’, ‘40’, ‘41’, ‘42’, ‘43’, ‘44’, ‘45’, ‘46’, ‘47’, ‘48’, ‘49’, ‘5’, ‘50’, ‘6’, ‘7’, ‘8’, ‘9’
Changing the row.names to null doesn't work. I heard from someone it is due to the fact that some data are stored as lists here, which I don't quite understand.
I understand that there is an alternative package WDI to access this data and it works well, but I want to know how to resolve the duplicates row name problem here in general so that I can deal with similar situation where no alternative package is available.

I heard from someone it is due to the fact that some data are stored as lists...
This is correct. The solution is fairly simple, but I find it really easy to get tripped up by this. Right now you're using:
dat <- as.data.frame(fromJSON(url)[2],flatten = TRUE, row.names = NULL)
The problem comes from fromJSON(url)[2]. This should be fromJSON(url)[[2]] instead. According to the documentation, the key difference between [ and [[ is a single bracket can select multiple elements whereas [[ selects only one.
You can see how this works with some fake data.
foo <- list(
a = rnorm(100),
b = rnorm(100),
c = rnorm(100)
)
With [, you can select multiple values inside this list.
foo[c("a", "b")]
length(foo["a"]) # Result is 1 not 100 like you might expect.
With [[ the results are different.
foo[[c("a", "b")]] # Raises a subscript error.
foo[["a"]] #This works.
length(foo[["a"]]) # Result is 100.
So, your answer will depend on which subset operator you're using. For your problem, you'll want to use [[ to select a single data.frame inside of the list. Then, you should be able to use rbind correctly.
final <- data.frame()
for (i in 1:10) {
url <- paste0(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=',
i
)
res <- jsonlite::fromJSON(url, flatten = TRUE)[[2]]
final <- rbind(final, res)
}
Alternative solution with lapply:
urls <- sprintf(
'http://api.worldbank.org/v2/countries/all/indicators/SP.POP.TOTL?format=json&page=%s',
1:10
)
resl <- lapply(urls, jsonlite::fromJSON, flatten = TRUE)
resl <- lapply(resl, "[[", 2) # Use lapply to select the 2 element from each list element.
resl <- do.call(rbind, resl) # This takes all the elements of the list and uses those elements as the arguments for rbind.

Read HTML into R

I would like R to take a word in a column in a dataset, and return a value from a website. The code I have so far is below. So, for each word in the data frame column, it will go to the website and return the pronunciation (for example, the pronunciation on http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=word&stress=-s is "W ER1 D"). I have looked at the HTML of the website, and it's unclear what I would need to enter to return this value - it's between <tt> and </tt> but there are many of these. I'm also not sure how to then get that value into R. Thank you.
library(xml2)
for (word in df$word) {
result <- read_html("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in="word"&stress=-s")
}

Parsing HTML is a tricky task in R. There are a couple ways though. If the HTML converts well to XML and the website/API always returns the same structure then you can use tools to parse XML. Otherwise you could use regex and call stringr::str_extract() on the HTML.
For your case, it is fairly easy to get the value you're looking for using XML tools. It's true that there are a lot of <tt> tags but the one you want is always in the second instance so you can just pull out that one.
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
print(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue))
#don't abuse your API
Sys.sleep(0.1)
}
>[1] "HH AE1 P IY0 ."
>[1] "S AE1 D ."
Good luck!
EDIT: This code will return a dataframe:
#load packages. dplyr is just to use the pipe %>% function
library(httr)
library(XML)
library(dplyr)
#test words
wordlist = c('happy', 'sad')
#initializae the dataframe with pronunciation field
pronunciation_list <- data.frame(pronunciation=character(),stringsAsFactors = F)
#loop over the words
for (word in wordlist){
#build the url and GET the result
url <- paste0("http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=",word,"&stress=-s")
h <- handle(url)
res <- GET(handle = h)
#parse the HTML
resXML <- htmlParse(content(res, as = "text"))
#retrieve second <tt>
to_add <- data.frame(pronunciation=(getNodeSet(resXML, '//tt[2]') %>% sapply(., xmlValue)))
#bind the data
pronunciation_list<- rbind(pronunciation_list, to_add)
#don't abuse your API
Sys.sleep(0.1)
}

Embed html in Jupyter with R kernel [duplicate]

I just started using Jupyter with R, and I'm wondering if there's a good way to display HTML or LaTeX output.
Here's some example code that I wish worked:
library(xtable)
x <- runif(500, 1, 50)
y <- x + runif(500, -5, 5)
model <- lm(y~x)
print(xtable(model), type = 'html')
Instead of rendering the HTML, it just displays it as plaintext. Is there any way to change that behavior?

A combination of repr (for setting options) and IRdisplay will work for HTML. Others may know about latex.
# Cell 1 ------------------------------------------------------------------
library(xtable)
library(IRdisplay)
library(repr)
data(tli)
tli.table <- xtable(tli[1:20, ])
digits(tli.table) <- matrix( 0:4, nrow = 20, ncol = ncol(tli)+1 )
options(repr.vector.quote=FALSE)
display_html(paste(capture.output(print(head(tli.table), type = 'html')), collapse="", sep=""))
# Cell 2 ------------------------------------------------------------------
display_html("<span style='color:red; float:right'>hello</span>")
# Cell 3 ------------------------------------------------------------------
display_markdown("[this](http://google.com)")
# Cell 4 ------------------------------------------------------------------
display_png(file="shovel-512.png")
# Cell 5 ------------------------------------------------------------------
display_html("<table style='width:20%;border:1px solid blue'><tr><td style='text-align:right'>cell 1</td></tr></table>")

I found a simpler answer, for the initial, simple use case.
If you call xtable without wrapping it in a call to print, then it totally works. E.g.,
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)

In Jupyter, you can use Markdown. Just be sure to change the Jupyter cell from a code cell to a Markdown cell. Once you have done this you can simply place a double dollar sign ("$$") before and after the LaTex you have. Then run the cell.
The steps are as follows:
1. Create a Markdown cell.
2. $$ some LaTex $$
3. Press play button within Jupyter.

Defining the following function in the session will display objects returned by xtable as html generated by xtable:
repr_html.xtable <- function(obj, ...){
paste(capture.output(print(obj, type = 'html')), collapse="", sep="")
}
library(xtable)
data(cars)
model <- lm(speed ~ ., data = cars)
xtable(model)
Without the repr_html.xtable function, because the returned object is also of class data.frame, the display system in the kernel will rich display that object (=html table) via repr::repr_html.data.frame.
Just don't print(...) the object :-)

Render/Embed html/Latex table to IR Kernel jupyter
Some packages in R give tables in html format like "knitr", so if you want to put this tables in the notebook:
library(knitr)
library(kableExtra)
library(IRdisplay) #the package that you need
#we create the table
dt <- mtcars[1:5, 1:6]
options(knitr.table.format = "html")
html_table= kable(dt) %>%
kable_styling("striped") %>%
add_header_above(c(" " = 1, "Group 1" = 2, "Group 2" = 2, "Group 3" = 2))
#We put the table in our notebook
display_html(toString(html_table))
Or for example if you have a file
display_latex(file = "your file path")

R: Parsing group of html files with loop

The following code works for individual .html files:
doc <- htmlParse("New folder/1-4.html")
plain.text <- xpathSApply(doc, "//td", xmlValue)
plain.text <- gsub("\n", "", plain.text)
gregexpr("firstThing", plain.text)
firstThing <- substring(plain.text[9], 41, 50)
gregexpr(secondThing, plain.text)
secondThing <- substring(plain.text[7], 1, 550)
But the following loop does not and gives me the error:
XML content does not seem to be XML
file.names <- dir(path = "New folder")
for(i in 1:length(file.names)){
doc <- htmlParse(file.names[i])
plain.text <- xpathSApply(doc, "//td", xmlValue)
gsub("\n", "", plain.text)
firstThing[i] <- substring(plain.text[9], 41, 50)
secondThing[i] <- substring(plain.text[7], 1, 550)
}
I'm simply trying to extract the information (as I've been able to do in the first batch of code), and create a vector of information.
Any ideas on how to resolve this issue?

Two things. First, your paths were wrong. To fix this, use:
filenames = dir(path = "New folder", full.names = TRUE)
Secondly, a better way than filling two variables inside a for loop is to generate structured data in a list function:
result = lapply(filenames, function (filename) {
doc = htmlParse(filename)
plain_text = xpathSApply(doc, "//td", xmlValue)
c(first = substring(plain_text[9], 41, 50),
second = substring(plain_text[7], 1, 550))
})
Now result is a list of elements, where each element is a vector with names first and second.
A few other remarks:
Be wary of dots in variable names — S3 uses dots in names to determine the class of a generic method. Using dots for anything else in variable names causes confusion and should be avoided.
The gsub statement in your loop has no effect.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to render LaTeX / HTML in Jupyter (R)? - html

I found a simpler answer, for the initial, simple use case. If you call xtable without wrapping it in a call to print, then it totally works. E.g., library(xtable) data(cars) model <- lm(speed ~ ., data = cars) xtable(model)

Related

How to pass multiple value as ID to `rvest`

rbind fromJSON page: duplicate rowname error

Read HTML into R

Embed html in Jupyter with R kernel [duplicate]

R: Parsing group of html files with loop

Categories

Resources